Commit Graph

18 Commits

Author SHA1 Message Date
FRIGN
ab9b240dc6 Fix warnings and update isalpharune() 2015-02-12 17:08:02 +01:00
FRIGN
4f6d696894 Also add "B"-type characters to isspacerune() 2015-02-12 16:48:22 +01:00
FRIGN
24d6cb90e7 Amend isspacerune() properly with WS and S Unicode characters 2015-02-12 16:41:57 +01:00
FRIGN
ce11e1f195 Add section for laces in lowerrune and upperrune and more ranges
This is a special third kind of structure found in Unicode, besides
singletons and ranges.
This dramatically reduces the number of explicit singletons in the
lookup tables.
Also, I changed the awk-script so that it can sort trivial
translations as well, breaking down the LOC even more.

The binary size of tr dropped from 67K to 51K.
2015-02-12 16:18:02 +01:00
FRIGN
9565eef895 Refactor uppercase-inclusion in libutf
Previously, the to*rune function would have to jiggle with two
arrays, and it somehow evaded me that it is actually way simpler
to just add another entry to the arrays if needed.
Binary size goes slightly down, e.g. tr statically linked against
musl: 68072 -> 67688

Behind the scenes though the conversion should be a bit faster and,
more importantly, the scary case-conversion function is simplified
and easier to understand.

It also drops nearly half the LOC in upperrune.c and lowerrune.c.
2015-02-12 12:28:45 +01:00
FRIGN
73577f10a0 Scrap chartorunearr(), introducing utftorunestr()
Interface and function as proposed by cls.

The reasoning behind this function is that cls expressed his
interest to keep memory allocation out of libutf, which is a
very good motive.
This simplifies the function a lot and should also increase the
speed a bit, but the most important factor here is that there's
no malloc anywhere in libutf, making it a lot smaller and more
robust with a smaller attack-surface.

Look at the paste(1) and tr(1) changes for an idiomatic way to
allocate the right amount of space for the Rune-array.
2015-02-11 21:32:09 +01:00
FRIGN
1c462012e4 Rename variable Rune * p -> r in fgetrune() 2015-02-11 21:14:28 +01:00
FRIGN
7c578bf5b0 Scrap writerune(), introducing fputrune()
Interface and function as proposed by cls.
Code is also shorter, everything else analogous to fgetrune().
2015-02-11 20:58:00 +01:00
FRIGN
a5ae899a48 Scrap readrune(), introducing fgetrune()
Interface as proposed by cls, but internally rewritten after a few
considerations.
The code is much shorter and to the point, aligning itself with other
standard functions. It should also be much faster, which is not bad.
2015-02-11 20:16:49 +01:00
FRIGN
f9846a9a6b Split up is*rune() and to*rune() functions into individual source files
This optimizes the binary size for each tool that uses these functions.
Previously, if a program just used one single function, maybe even a
one-liner, it would statically compile in all lookup-tables, bloating
the binary by up to 20K.
All these changes are derived from a local libutf where I do the
primary changes. So I hope that I can merge these things into libutf
sooner or later, as discussed on the ml.
2015-02-11 15:48:18 +01:00
FRIGN
22f40b5a03 Add explicit boundary to loop in readrune()
You never know what could happen. Better have a "blind" read than
a segmentation fault.
2015-02-01 04:20:02 +01:00
FRIGN
696bb992c3 Return number of bytes read even on a partial read
and set the rune to Runeerror for later checking (if desired).
2015-02-01 03:54:56 +01:00
FRIGN
f2e6af7350 Return number of bytes read in readrune()
Could be useful in some cases...
2015-02-01 03:43:54 +01:00
Hiltjo Posthuma
4469b0b641 chartorunearr: initialize ret 2015-01-10 17:00:01 +00:00
FRIGN
a582cb8a2f Rewrite tr(1) in a sane way
tr(1) always used to be a saddening part of sbase, which was
inherently broken and crufted.
But to be fair, the POSIX-standard doesn't make it very simple.
Given the current version was unfixable and broken by design, I
sat down and rewrote tr(1) very close to the concept of set theory
and the POSIX-standard with a few exceptions:

 - UTF-8: not allowed in POSIX, but in my opinion a must. This
          finally allows you to work with UTF-8 streams without
          problems or unexpected behaviour.
 - Equivalence classes: Left out, even GNU coreutils ignore them
                        and depending on LC_COLLATE, which sucks.
 - Character classes: No experiments or environment-variable-trickery.
                      Just plain definitions derived from the POSIX-
                      standard, working as expected.

I tested this thoroughly, but expect problems to show up in some
way given the wide range of input this program has to handle.
The only thing left on the TODO is to add support for literal
expressions ('\n', '\t', '\001', ...) and probably rethinking
the way [_*n] is unnecessarily restricted to string2.
2015-01-10 14:26:30 +00:00
sin
18850f5dfa writerune() should operate on a FILE * 2014-11-21 16:34:57 +00:00
sin
5b5bb82ec0 Factor out readrune and writerune 2014-11-21 16:31:16 +00:00
sin
56709a2414 Import libutf from http://git.suckless.org/libutf 2014-11-17 15:46:01 +00:00