1
0
mirror of https://github.com/rkd77/elinks.git synced 2024-06-26 01:15:37 +00:00
Go to file
Witold Filipczyk c5a7f87c43 Bug 1060: Use libtre for regexp searches.
When the user tells ELinks to search for a regexp, ELinks 0.11.0
passes the regexp to regcomp() and the formatted document to
regexec(), both in the terminal charset.  This works OK for unibyte
ASCII-compatible charsets because the regexp metacharacters are all in
the ASCII range.  And ELinks 0.11.0 doesn't support multibyte or
ASCII-incompatible (e.g. EBCDIC) charsets in terminals, so it is no
big deal if regexp searches fail in such locales.

ELinks 0.12pre1 attempts to support UTF-8 as the terminal charset if
CONFIG_UTF8 is defined.  Then, struct search contains unicode_val_T c
rather than unsigned char c, and get_srch() and add_srch_chr()
together save UTF-32 values there if the terminal charset is UTF-8.
In plain-text searches, is_in_range_plain() compares those values
directly if the search is case sensitive, or folds them to lower case
if the search is case insensitive: with towlower() if the terminal
charset is UTF-8, or with tolower() otherwise.  In regexp searches
however, get_search_region_from_search_nodes() still truncates all
values to 8 bits in order to generate the string that
search_for_pattern() then passes to regexec().  In UTF-8 locales,
regexec() expects this string to be in UTF-8 and can't make sense of
the truncated characters.  There is also a possible conflict in
regcomp() if the locale is UTF-8 but the terminal charset is not, or
vice versa.

Rejected ways of fixing the charset mismatches:

* When the terminal charset is UTF-8, recode the formatted document
  from UTF-32 to UTF-8 for regexp searching.  This would work if the
  terminal and the locale both use UTF-8, or if both use unibyte
  ASCII-compatible charsets, but not if only one of them uses UTF-8.

* Convert both the regexp and the formatted document to the charset of
  the locale, as that is what regcomp() and regexec() expect.  ELinks
  would have to somehow keep track of which bytes in the converted
  string correspond to which characters in the document; not entirely
  trivial because convert_string() can replace a single unconvertible
  character with a string of ASCII characters.  If ELinks were
  eventually changed to use iconv() for unrecognized charsets, such
  tracking would become even harder.

* Temporarily switch to a locale that uses the charset of the
  terminal.  Unfortunately, it seems there is no portable way to
  construct a name for such a locale.  It is also possible that no
  suitable locale is available; especially on Windows, whose C library
  defines MB_LEN_MAX as 2 and thus cannot support UTF-8 locales.

Instead, this commit makes ELinks do the regexp matching with regwcomp
and regwexec from the TRE library.  This way, ELinks can losslessly
recode both the pattern and the document to Unicode and rely on the
regexp code in TRE decoding them properly, regardless of locale.

There are some possible problems though:

1. ELinks stores strings as UTF-32 in arrays of unicode_val_T, but TRE
   uses wchar_t instead.  If wchar_t is UTF-16, as it is on Microsoft
   Windows, then TRE will misdecode the strings.  It wouldn't be too
   hard to make ELinks convert to UTF-16 in this case, but (a) TRE
   doesn't currently support UTF-16 either, and it seems possible that
   wchar_t-independent UTF-32 interfaces will be added to TRE; and (b)
   there seems to be little interest on using ELinks on Windows anyway.

2. The Citrus Project apparently wanted BSD to use a locale-dependent
   wchar_t: e.g. UTF-32 in some locales and an ISO 2022 derivative in
   others.  Regexp searches in ELinks now do not support the latter.

[ Adapted to elinks-0.12 from bug 1060 attachment 506.
  Commit message by me.  --KON ]
2009-02-08 18:26:22 +02:00
config Win32: Get socklen_t from <ws2tcpip.h>. 2007-07-18 00:41:08 +03:00
contrib Move most of contrib/smjs/README into the manual. 2008-07-10 20:31:22 +03:00
doc bug 153, 1066: Add codepage parameter to update_bookmark(). 2009-02-08 18:26:18 +02:00
po pl.po: Statystyki -> Statystyka 2008-11-20 15:53:35 +01:00
src Bug 1060: Use libtre for regexp searches. 2009-02-08 18:26:22 +02:00
test Fix assertion failure if IMG/@usemap refers to a different file. 2009-01-01 19:12:41 +00:00
Unicode Bug 932: Redisable 0x80...0x9F mappings in some charsets. 2008-10-11 15:35:34 +03:00
.gitignore Autogenerate .vimrc files and put the master in config/vimrc 2006-01-15 18:38:58 +01:00
.mailmap Add .mailmap file to help git-shortlog 2007-04-15 22:08:11 +02:00
ABOUT-NLS Initial commit of the HEAD branch of the ELinks CVS repository, as of 2005-09-15 15:58:31 +02:00
AUTHORS AUTHORS: Peter Collingbourne allows relicensing 2008-11-10 00:02:44 +02:00
autogen.sh autogen.sh: Also remove autom4te.cache. 2007-07-19 21:28:33 +03:00
BUGS Drop .or from elinks.or.cz. 2005-12-29 04:35:02 +00:00
ChangeLog Remove Cogito from ChangeLog and INSTALL too 2008-07-01 02:17:51 +03:00
configure.in Bug 1060: Use libtre for regexp searches. 2009-02-08 18:26:22 +02:00
COPYING Refresh charsets from www.unicode.org. 2008-10-11 15:35:09 +03:00
features.conf lzma: rephrase note about LZMA SDK 2008-03-01 13:55:01 +02:00
INSTALL INSTALL: autoconf-2.13 has not been supported for a while 2008-07-01 02:21:46 +03:00
Makefile Document that GNU Make >= 3.78 is needed, and check it. 2007-12-09 08:16:29 +02:00
Makefile.config.in Bug 1060: Use libtre for regexp searches. 2009-02-08 18:26:22 +02:00
Makefile.lib Continue if a test fails 2008-07-03 01:55:03 +02:00
NEWS bug 153: UTF-8 bookmark.title has been fully implemented. 2009-02-08 18:26:21 +02:00
README Drop .or from elinks.or.cz. 2005-12-29 04:35:02 +00:00
SITES SITES: delete or replace dead links 2008-06-30 20:36:47 +03:00
THANKS THANKS: Remove link to HSTI webpage as the domain is for sale. 2006-11-08 20:53:01 +02:00
TODO TODO: minor rephrasing and cleanups 2007-05-28 12:04:43 +02:00

ELinks - an advanced web browser
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ELinks is an advanced and well-established feature-rich text mode web
(HTTP/FTP/..) browser. ELinks can render both frames and tables, is highly
customizable and can be extended via scripts. It is very portable and runs
on a variety of platforms.

The ELinks official website is available at

	http://elinks.cz/

Please see the SITES file for mirrors or other recommended sites.  If you
want to install ELinks on your computer, see the INSTALL file for further
instructions.

A good start point is documentation files available in doc/, especially the
file named index.txt.

If you want to request features or report bugs, see community information at
http://elinks.cz/community.html and feedback information available at
http://elinks.cz/feedback.html.

If you want to write some patches, please first read the doc/hacking.txt
document.

If you want to add a new language or update the translation for an existing
one, please read po/README document.

If you want to write some documentation, well, you're welcome! ;)



Historical notes
~~~~~~~~~~~~~~~~

Initially, ELinks was a development version of Links (Lynx-like text WWW
browser), with more liberal features policy and development style.  Its purpose
was to provide an alternative to Links, and to test and tune various new
features, but still provide good rock-solid releases inside stable branches.

Why not contribute to Links instead?  Well, first I made a bunch of patches for
the original Links, but Mikulas wasn't around to integrate them, so I started
releasing my fork. When he came back, a significant number of them got refused
because Mikulas did not like them, as he just wouldn't have any use for them
himself.  He wants to keep Links with a relatively closed feature set and merge
only new features which he himself needs.  It has advantages that the tree is
very narrow and the code is small and contains very little bloat.

ELinks, on the contrary, aims to provide a full-featured web browser, superior
to both lynx and w3m and with the power (but not slowness and memory usage) of
Mozilla, Konqueror and similar browsers. However, to prevent drastic bloating
of the code, the development is driven in the course of modularization and
separation of add-on modules (like cookies, bookmarks, ssl, scripting etc).

For more details about ELinks history, please see

	http://elinks.cz/history.html

If you are more interested in the history and various Links clones and versions,
you can examine the website at

	http://links.sf.net/




vim: textwidth=80