mirror of
https://github.com/rkd77/elinks.git
synced 2024-12-04 14:46:47 -05:00
c5a7f87c43
When the user tells ELinks to search for a regexp, ELinks 0.11.0 passes the regexp to regcomp() and the formatted document to regexec(), both in the terminal charset. This works OK for unibyte ASCII-compatible charsets because the regexp metacharacters are all in the ASCII range. And ELinks 0.11.0 doesn't support multibyte or ASCII-incompatible (e.g. EBCDIC) charsets in terminals, so it is no big deal if regexp searches fail in such locales. ELinks 0.12pre1 attempts to support UTF-8 as the terminal charset if CONFIG_UTF8 is defined. Then, struct search contains unicode_val_T c rather than unsigned char c, and get_srch() and add_srch_chr() together save UTF-32 values there if the terminal charset is UTF-8. In plain-text searches, is_in_range_plain() compares those values directly if the search is case sensitive, or folds them to lower case if the search is case insensitive: with towlower() if the terminal charset is UTF-8, or with tolower() otherwise. In regexp searches however, get_search_region_from_search_nodes() still truncates all values to 8 bits in order to generate the string that search_for_pattern() then passes to regexec(). In UTF-8 locales, regexec() expects this string to be in UTF-8 and can't make sense of the truncated characters. There is also a possible conflict in regcomp() if the locale is UTF-8 but the terminal charset is not, or vice versa. Rejected ways of fixing the charset mismatches: * When the terminal charset is UTF-8, recode the formatted document from UTF-32 to UTF-8 for regexp searching. This would work if the terminal and the locale both use UTF-8, or if both use unibyte ASCII-compatible charsets, but not if only one of them uses UTF-8. * Convert both the regexp and the formatted document to the charset of the locale, as that is what regcomp() and regexec() expect. ELinks would have to somehow keep track of which bytes in the converted string correspond to which characters in the document; not entirely trivial because convert_string() can replace a single unconvertible character with a string of ASCII characters. If ELinks were eventually changed to use iconv() for unrecognized charsets, such tracking would become even harder. * Temporarily switch to a locale that uses the charset of the terminal. Unfortunately, it seems there is no portable way to construct a name for such a locale. It is also possible that no suitable locale is available; especially on Windows, whose C library defines MB_LEN_MAX as 2 and thus cannot support UTF-8 locales. Instead, this commit makes ELinks do the regexp matching with regwcomp and regwexec from the TRE library. This way, ELinks can losslessly recode both the pattern and the document to Unicode and rely on the regexp code in TRE decoding them properly, regardless of locale. There are some possible problems though: 1. ELinks stores strings as UTF-32 in arrays of unicode_val_T, but TRE uses wchar_t instead. If wchar_t is UTF-16, as it is on Microsoft Windows, then TRE will misdecode the strings. It wouldn't be too hard to make ELinks convert to UTF-16 in this case, but (a) TRE doesn't currently support UTF-16 either, and it seems possible that wchar_t-independent UTF-32 interfaces will be added to TRE; and (b) there seems to be little interest on using ELinks on Windows anyway. 2. The Citrus Project apparently wanted BSD to use a locale-dependent wchar_t: e.g. UTF-32 in some locales and an ISO 2022 derivative in others. Regexp searches in ELinks now do not support the latter. [ Adapted to elinks-0.12 from bug 1060 attachment 506. Commit message by me. --KON ] |
||
---|---|---|
config | ||
contrib | ||
doc | ||
po | ||
src | ||
test | ||
Unicode | ||
.gitignore | ||
.mailmap | ||
ABOUT-NLS | ||
AUTHORS | ||
autogen.sh | ||
BUGS | ||
ChangeLog | ||
configure.in | ||
COPYING | ||
features.conf | ||
INSTALL | ||
Makefile | ||
Makefile.config.in | ||
Makefile.lib | ||
NEWS | ||
README | ||
SITES | ||
THANKS | ||
TODO |
ELinks - an advanced web browser ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ELinks is an advanced and well-established feature-rich text mode web (HTTP/FTP/..) browser. ELinks can render both frames and tables, is highly customizable and can be extended via scripts. It is very portable and runs on a variety of platforms. The ELinks official website is available at http://elinks.cz/ Please see the SITES file for mirrors or other recommended sites. If you want to install ELinks on your computer, see the INSTALL file for further instructions. A good start point is documentation files available in doc/, especially the file named index.txt. If you want to request features or report bugs, see community information at http://elinks.cz/community.html and feedback information available at http://elinks.cz/feedback.html. If you want to write some patches, please first read the doc/hacking.txt document. If you want to add a new language or update the translation for an existing one, please read po/README document. If you want to write some documentation, well, you're welcome! ;) Historical notes ~~~~~~~~~~~~~~~~ Initially, ELinks was a development version of Links (Lynx-like text WWW browser), with more liberal features policy and development style. Its purpose was to provide an alternative to Links, and to test and tune various new features, but still provide good rock-solid releases inside stable branches. Why not contribute to Links instead? Well, first I made a bunch of patches for the original Links, but Mikulas wasn't around to integrate them, so I started releasing my fork. When he came back, a significant number of them got refused because Mikulas did not like them, as he just wouldn't have any use for them himself. He wants to keep Links with a relatively closed feature set and merge only new features which he himself needs. It has advantages that the tree is very narrow and the code is small and contains very little bloat. ELinks, on the contrary, aims to provide a full-featured web browser, superior to both lynx and w3m and with the power (but not slowness and memory usage) of Mozilla, Konqueror and similar browsers. However, to prevent drastic bloating of the code, the development is driven in the course of modularization and separation of add-on modules (like cookies, bookmarks, ssl, scripting etc). For more details about ELinks history, please see http://elinks.cz/history.html If you are more interested in the history and various Links clones and versions, you can examine the website at http://links.sf.net/ vim: textwidth=80