Previously, each mapping between a codepage byte and a Unicode
character was stored as a struct table_entry, which listed both the
byte and the character. This representation may be optimal for sparse
mappings, but codepages map almost every possible byte to a character,
so it is more efficient to just have an array that lists the Unicode
character corresponding to each byte from 0x80 to 0xFF. The bytes are
not stored but rather implied by the array index. The tcvn5712 and
viscii codepages have a total of four mappings that do not fit in the
arrays, so we still use struct table_entry for those.
This change also makes cp2u() operate in O(1) time and may speed up
other functions as well.
The "sed | while read" concoction in Unicode/gen-cp looks rather
unhealthy. It would probably be faster and more readable if rewritten
in Perl, but IMO that goes for the previous version as well, so I
suppose whoever wrote it had a reason not to use Perl here.
Before:
text data bss dec hex filename
38948 28528 3311 70787 11483 src/intl/charsets.o
500096 85568 82112 667776 a3080 src/elinks
After:
text data bss dec hex filename
31558 28528 3311 63397 f7a5 src/intl/charsets.o
492878 85568 82112 660558 a144e src/elinks
So the text section shrank by 7390 bytes.
Measured on i686-pc-linux-gnu with: --disable-xbel --disable-nls
--disable-cookies --disable-formhist --disable-globhist
--disable-mailcap --disable-mimetypes --disable-smb --disable-mouse
--disable-sysmouse --disable-leds --disable-marks --disable-css
--enable-small --enable-utf-8 --without-gpm --without-bzlib
--without-idn --without-spidermonkey --without-lua --without-gnutls
--without-openssl CFLAGS="-Os -ggdb -Wall"
Before:
text data bss dec hex filename
25726 62992 3343 92061 1679d src/intl/charsets.o
653856 120020 82144 856020 d0fd4 src/elinks
After:
text data bss dec hex filename
60190 28528 3311 92029 1677d src/intl/charsets.o
688320 85556 82112 855988 d0fb4 src/elinks
So 34464 bytes were moved from the data section to the text section
and should be more likely to get shared between ELinks processes.
Measured on i686-pc-linux-gnu with: --disable-xbel --disable-nls
--disable-cookies --disable-formhist --disable-globhist
--disable-mailcap --disable-mimetypes --disable-smb --disable-mouse
--disable-sysmouse --disable-leds --disable-marks --disable-css
--enable-small --enable-utf-8 --without-gpm --without-bzlib
--without-idn --without-spidermonkey --without-lua --without-gnutls
--without-openssl CFLAGS="-O2 -ggdb -Wall"
This also fixes b.delta to have the correct value 0x03B4. The main difference
to ELinks' entity database is:
- entities not in the unicode database from 1997:
Scomma, Tcomma, euro, scomma, tcomma
- obsolete entities kept for compatibility:
emdash, endash, hibar
The root makefile is converted as well as some leaf Makefiles. This
also brings in the required infrastructure and adjusts configure.in
appropriately.
I converted only makefiles containing no configurable stuff, since
that'll require more consideration yet.