Interface and function as proposed by cls.
The reasoning behind this function is that cls expressed his
interest to keep memory allocation out of libutf, which is a
very good motive.
This simplifies the function a lot and should also increase the
speed a bit, but the most important factor here is that there's
no malloc anywhere in libutf, making it a lot smaller and more
robust with a smaller attack-surface.
Look at the paste(1) and tr(1) changes for an idiomatic way to
allocate the right amount of space for the Rune-array.
Interface as proposed by cls, but internally rewritten after a few
The code is much shorter and to the point, aligning itself with other
standard functions. It should also be much faster, which is not bad.
This optimizes the binary size for each tool that uses these functions.
Previously, if a program just used one single function, maybe even a
one-liner, it would statically compile in all lookup-tables, bloating
the binary by up to 20K.
All these changes are derived from a local libutf where I do the
primary changes. So I hope that I can merge these things into libutf
sooner or later, as discussed on the ml.
This basically means that we now have an autogenerating typecheck
and case-conversion tool.
Don't freak out when you see the added LOC. Given we now have
an additional mapping to the uppercase-characters, some ranges got
"lost" and have to be written literally by the generating awk-script.
The runetypebody.h was generated by myself using my modified version
of mkrunetype.awk and I'll push the changed version as soon as this
has been discussed on the ml.
If you worry about speed, consider, that bsearch is just the right
tool for this job and can even handle a long array like this.
tr(1) always used to be a saddening part of sbase, which was
inherently broken and crufted.
But to be fair, the POSIX-standard doesn't make it very simple.
Given the current version was unfixable and broken by design, I
sat down and rewrote tr(1) very close to the concept of set theory
and the POSIX-standard with a few exceptions:
- UTF-8: not allowed in POSIX, but in my opinion a must. This
finally allows you to work with UTF-8 streams without
problems or unexpected behaviour.
- Equivalence classes: Left out, even GNU coreutils ignore them
and depending on LC_COLLATE, which sucks.
- Character classes: No experiments or environment-variable-trickery.
Just plain definitions derived from the POSIX-
standard, working as expected.
I tested this thoroughly, but expect problems to show up in some
way given the wide range of input this program has to handle.
The only thing left on the TODO is to add support for literal
expressions ('\n', '\t', '\001', ...) and probably rethinking
the way [_*n] is unnecessarily restricted to string2.