After a short correspondence with Otto Moerbeek it turned out
mallocarray() is only in the OpenBSD-Kernel, because the kernel-
malloc doesn't have realloc.
Userspace applications should rather use reallocarray with an
explicit NULL-pointer.
Assuming reallocarray() will become available in c-stdlibs in the
next few years, we nip mallocarray() in the bud to allow an easy
transition to a system-provided version when the day comes.
A function used only in the OpenBSD-Kernel as of now, but it surely
provides a helpful interface when you just don't want to make sure
the incoming pointer to erealloc() is really NULL so it behaves
like malloc, making it a bit more safer.
Talking about *allocarray(): It's definitely a major step in code-
hardening. Especially as a system administrator, you should be
able to trust your core tools without having to worry about segfaults
like this, which can easily lead to privilege escalation.
How do the GNU coreutils handle this?
$ strings -n 4611686018427387903
strings: invalid minimum string length -1
$ strings -n 4611686018427387904
strings: invalid minimum string length 0
They silently overflow...
In comparison, sbase:
$ strings -n 4611686018427387903
mallocarray: out of memory
$ strings -n 4611686018427387904
mallocarray: out of memory
The first out of memory is actually a true OOM returned by malloc,
whereas the second one is a detected overflow, which is not marked
in a special way.
Now tell me which diagnostic error-messages are easier to understand.
Stateless and I stumbled upon this issue while discussing the
semantics of read, accepting a size_t but only being able to return
ssize_t, effectively lacking the ability to report successful
reads > SSIZE_MAX.
The discussion went along and we came to the topic of input-based
memory allocations. Basically, it was possible for the argument
to a memory-allocation-function to overflow, leading to a segfault
later.
The OpenBSD-guys came up with the ingenious reallocarray-function,
and I implemented it as ereallocarray, which automatically returns
on error.
Read more about it here[0].
A simple testcase is this (courtesy to stateless):
$ sbase-strings -n (2^(32|64) / 4)
This will segfault before this patch and properly return an OOM-
situation afterwards (thanks to the overflow-check in reallocarray).
[0]: http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man3/calloc.3
Interface and function as proposed by cls.
The reasoning behind this function is that cls expressed his
interest to keep memory allocation out of libutf, which is a
very good motive.
This simplifies the function a lot and should also increase the
speed a bit, but the most important factor here is that there's
no malloc anywhere in libutf, making it a lot smaller and more
robust with a smaller attack-surface.
Look at the paste(1) and tr(1) changes for an idiomatic way to
allocate the right amount of space for the Rune-array.
Interface as proposed by cls, but internally rewritten after a few
considerations.
The code is much shorter and to the point, aligning itself with other
standard functions. It should also be much faster, which is not bad.
Equivalence classes are a hard matter and there's still no "standard"
way to solve the issue.
Previously, tr would just skip those classes, but it's much
better when it resolves a [=c=] to a normal c instead of treating
it as a literal.
Also, reflect recent changes in the manpage (octal escapes) and fix
the markup in some areas.
This is one aspect which I think has blown up the complexity of many
tr-implementations around today.
Instead of complicating the set-theory-based parser itself (he should
still be relying on one rune per char, not multirunes), I added a
preprocessor, which basically scans the code for upcoming '\'s, reads
what he finds, substitutes the real character onto '\'s index and shifts
the entire following array so there are no "holes".
What is left to reflect on is what to do with octal sequences.
I have a local implementation here, which works fine, but imho,
given tr is already so focused on UTF-8, we might as well ignore
POSIX at this point and rather implement the unicode UTF-8 code points,
which are way more contemporary and future-proof.
Reading in \uC3A4 as a an array of 0xC3 and 0xA4 is not the issue,
but I'm still struggling to find a way to turn it into a well-formed
byte sequence. Hit me with a mail if you have a simple solution for
that.
It's standard behaviour to map a whole class of matched objects
to the last element of a given simple set2 instead of just passing
it through.
Also, error out more strictly when the user gives us bogus sets.
tr(1) always used to be a saddening part of sbase, which was
inherently broken and crufted.
But to be fair, the POSIX-standard doesn't make it very simple.
Given the current version was unfixable and broken by design, I
sat down and rewrote tr(1) very close to the concept of set theory
and the POSIX-standard with a few exceptions:
- UTF-8: not allowed in POSIX, but in my opinion a must. This
finally allows you to work with UTF-8 streams without
problems or unexpected behaviour.
- Equivalence classes: Left out, even GNU coreutils ignore them
and depending on LC_COLLATE, which sucks.
- Character classes: No experiments or environment-variable-trickery.
Just plain definitions derived from the POSIX-
standard, working as expected.
I tested this thoroughly, but expect problems to show up in some
way given the wide range of input this program has to handle.
The only thing left on the TODO is to add support for literal
expressions ('\n', '\t', '\001', ...) and probably rethinking
the way [_*n] is unnecessarily restricted to string2.
It actually makes the binaries smaller, the code easier to read
(gems like "val == true", "val == false" are gone) and actually
predictable in the sense of that we actually know what we're
working with (one bitwise operator was quite adventurous and
should now be fixed).
This is also more consistent with the other suckless projects
around which don't use boolean types.
- Added support for character ranges ( a-z )
- Added support for complementary charset ( -c ), only in delete mode
- Added support for octal escape sequences
- Unicode now only works when there are no octal escape sequences,
otherwise behavior is not predictable at first sight.
- tr now supports null characters in the input
- Does not yet have support for character classes ( [:upper:] )