For sort(1) we need memmem(), which I imported from OpenBSD.
Inside sort(1), the changes involved working with the explicit lengths
given by getlines() earlier and rewriting some of the functions.
Now we can handle NUL-characters in the input just fine.
strmem() was not very well thought out. The thing is the following:
If the string contains a zero character, we want to match it, and not
stop right there in place.
The "real" solution is to use memmem() where needed and replace all
functions that assume zero-terminated-strings from standard input, which
could lead to early string-breakoffs.
This requires a strict tracking of string lengths.
We want our delimiters to also contain 0 characters and have them
handled gracefully.
To accomplish this, I wrote a function strmem(), which looks for a
certain, arbitrarily long memory subset in a given string.
memmem() is a GNU extension and forces you to call strlen every time.
Yeah well, the old topic. POSIX allows \0123 and \123 octals in
different tools, in printf, depending on %b or other things.
We'll just keep it simple and just allow 4 digits. the 0 does not make
a difference anyway.
When we move the exit() out of venprintf(), we can reuse it for
weprintf(), which basically had duplicate code.
I also renamed venprintf() to xvprintf (extended vprintf) so it's
more obvious what it actually does.
This reverts commit a564a67c4ea70e90a4dc543814458e4903869d3e.
Not as trivial as I thought. This breaks cp when used as:
cp -r /foo/bar /baz
The old code expands this to:
cp -r /foo/bar /baz/bar
This is a utility function to allow easy parsing of file or other
offsets, automatically taking in regard suffixes, proper bases and
so on, for instance used in split(1) -b or od -j, -N(1).
Of course, POSIX is very arbitrary when it comes to defining the
parsing rules for different tools.
The main focus here lies on being as flexible and consistent as
possible. One central utility-function handling the parsing makes
this stuff a lot more trivial.
Otherwise, we run into problems in a typical autoconf-based build
system:
- config.status is created at some point between two seconds.
- config.status is run, generating Makefile by first writing to a file
in /tmp, and then mv-ing it to Makefile.
- If this mv happens before the beginning of the next second, Makefile
will be created with the same tv_sec as config.status, but with
tv_nsec = 0.
- When make runs, it sees that Makefile is older than config.status,
and re-runs config.status to generate Makefile.
In general, POSIX does not define /dev/std{in, out, err} because it
does not want to depend on the dev-filesystem.
For utilities, it thus introduced the '-'-keyword to denote standard
input (and output in some cases) and the programs have to deal with
it accordingly.
Sadly, the design of many tools doesn't allow strict shell-redirections
and many scripts don't even use this feature when possible.
Thus, we made the decision to implement it consistently across all
tools where it makes sense (namely those which read files).
Along the way, I spotted some behavioural bugs in libutil/crypt.c and
others where it was forgotten to fshut the files after use.
General convention is to use size_t to store sizes of all kinds.
Internally, the function uses double anyway, but at least this
doesn't clobber up the API any more and there's a chance in the
future to make this function a bit cleaner and not use this dirty
static buffer hack any more.
recurse() is getting smarter every day. I expect it to pass the Turing
test in a few months.
Along the way, it was reported that "rm -f" on nonexistant files reports
their missing as an internal recurse()-error.
So recurse() knows when to shut up, I added the SILENT flag to fix all
these things.
The restructuring of recurse() in the last few weeks actually broke
the recursion-flags in different tools.
As a long-term goal, the recursor should have a field "maxdepth"
which should be "1" for the non-Rflag-case. "0" stands for unlimited.
Basically, it's a conflict between POSIX and ISO C what do to when
input streams are passed to fflush().
POSIX mandates that the seeking-position should be synced, but ISO C
says it's undefined behaviour.
We love POSIX, but the standard-documents specify that in all conflict
cases, ISO C wins, so this breaks with EBADF on BSD's.
musl and glibc follow POSIX behaviour, which makes sense, but involves
numerous portability concerns.
To get around this, we just don't check fflush() and rely on the fact
that no implementation sets ferror on the file-stream in fflush if it
is an input stream, so every issue caught in fflush() is caught later
with ferror() and fclose().
Add a comment to fshut() because this stuff is so complicated, it
took us a day to figure out.
This has been a known issue for a long time. Example:
printf "word" > /dev/full
wouldn't report there's not enough space on the device.
This is due to the fact that every libc has internal buffers
for stdout which store fragments of written data until they reach
a certain size or on some callback to flush them all at once to the
kernel.
You can force the libc to flush them with fflush(). In case flushing
fails, you can check the return value of fflush() and report an error.
However, previously, sbase didn't have such checks and without fflush(),
the libc silently flushes the buffers on exit without checking the errors.
No offense, but there's no way for the libc to report errors in the exit-
condition.
GNU coreutils solve this by having onexit-callbacks to handle the flushing
and report issues, but they have obvious deficiencies.
After long discussions on IRC, we came to the conclusion that checking the
return value of every io-function would be a bit too much, and having a
general-purpose fclose-wrapper would be the best way to go.
It turned out that fclose() alone is not enough to detect errors. The right
way to do it is to fflush() + check ferror on the fp and then to a fclose().
This is what fshut does and that's how it's done before each return.
The return value is obviously affected, reporting an error in case a flush
or close failed, but also when reading failed for some reason, the error-
state is caught.
the !!( ... + ...) construction is used to call all functions inside the
brackets and not "terminating" on the first.
We want errors to be reported, but there's no reason to stop flushing buffers
when one other file buffer has issues.
Obviously, functionales come before the flush and ret-logic comes after to
prevent early exits as well without reporting warnings if there are any.
One more advantage of fshut() is that it is even able to report errors
on obscure NFS-setups which the other coreutils are unable to detect,
because they only check the return-value of fflush() and fclose(),
not ferror() as well.
It's not useful when 0 is returned anyway, so be sure that we have a
string with length > 0, this also solves some indexing-gotchas like
"len - 1" and so on.
Also, add checked getline()'s whenever it has been forgotten and
clean up the error-messages.
I've been wanting to do this for a while now, as tar(1) used to
be one of messiest and cruftiest tools.
First off, before walking through the audit, I'll talk about
what the DIRFIRST-flag for recurse() does.
It basically calls fn() on the first-level-dir before calling
it's subentries. It's necessary here, because else the order
of the tar-files would've been wrong (it would try to create
dir/file before creating dir/).
Now, to the audit:
1) Update manpage, fix mistake that compression is also available
for compressing. It's only available for extracting.
2) Define the major, minor and makedev macros from glibc by ourselves.
No need to rely on them, as they are common sense.
decomp()
3) Simple refactorization.
putoctal()
4) Add a truncation check for snprintf().
archive()
5) BUGFIX: Add checks to any checkable function, don't blindly call
them, this is harmful and there are 100 ways to exploit that.
6) Use estrlcpy() instead of snprintf() wherever possible, fix
alignment.
7) BUGFIX: Terminate the result-buffer of readlink(), check if
it even succeeded.
8) Fix sizeof()-formatting.
unarchive()
9) BUGFIX: Add checks to any checkable function, don't blindly call
them, this is harmful and there are 100 ways to exploit that.
10) BUGFIX: strtoul can happily return negative numbers. Add checks
for that and also if the full string has been processed.
11) Remove calls to perror(). We have eprintf, use it.
12) BUGFIX: "minor = strtoul(h->mode, 0, 8);". We need h->minor of
course.
13) Fix typo "usupported", remove fprintf-call.
print()
14) Check fread().
xt()
15) Get rid of snprintf-magic. Use estrlcat().
16) BUGFIX: check for ferror() on the tarfile.
usage()
17) Update it. The old usage() was like 1000 years old.
main()
18) Add DIRFIRST-flag to the recursor.
19) Don't print usage() when a mode is re-set. We allow this in
general.
20) Add function checks and fix error messages.
21) Add tarfilename-global for proper error-messages.
1) Rename cp_HLPflag -> cp_follow for consistency.
2) Use function-pointers for stat to clear up the code.
3) BUGFIX: TERMINATE THE RESULT BUFFER OF READLINK !!!
It's something I noticed earlier and it actually lead to some
pretty insane behaviour on our side using glibc (musl somehow
magically solves this).
Basically, symlinks used to contain the data of the file they
pointed to. I wondered for weeks where this came from and now
this has finally been solved.
4) BUGFIX: Do not unconditionally unlink target-files. Even GNU
coreutils do it wrong.
The basic idea is this:
If fflag == 0 --> don't touch target files if they exist.
If fflag == 1 --> unlink all and don't error out when we try
to unlink a file which doesn't exist.
5) Use estrlcpy and estrlcat instead of snprintf for path building.
6) Make it clearer what happens in preserve.
Okay, why yet another recurse()-refactor?
The last one added the recursor-struct, which simplified things
on the user-end, but there was still one thing that bugged me a lot:
Previously, all fn()'s were forced to (l)stat the paths themselves.
This does not work well when you try to keep up with H-, L- and P-
flags at the same time, as each utility-function would have to set
the right function-pointer for (l)stat every single time.
This is not desirable. Furthermore, recurse should be easy to use
and not involve trouble finding the right (l)stat-function to do it
right.
So, what we needed was a stat-argument for each fn(), so it is
directly accessible. This was impossible to do though when the
fn()'s are still directly called by the programs to "start" the
recurse.
Thus, the fundamental change is to make recurse() the function to
go, while designing the fn()'s in a way they can "live" with st
being NULL (we don't want a null-pointer-deref).
What you can see in this commit is the result of this work. Why
all this trouble instead of using nftw?
The special thing about recurse() is that you tell the function
when to recurse() in your fn(). You don't need special flags to
tell nftw() to skip the subtree, just to give an example.
The only single downside to this is that now, you are not allowed
to unconditionally call recurse() from your fn(). It has to be
a directory.
However, that is a cost I think is easily weighed up by the
advantages.
Another thing is the history: I added a procedure at the end of
the outmost recurse to free the history. This way we don't leak
memory.
A simple optimization on the side:
- if (h->dev == st.st_dev && h->ino == st.st_ino)
+ if (h->ino == st.st_ino && h->dev == st.st_dev)
First compare the likely difference in inode-numbers instead of
checking the unlikely condition that the device-numbers are
different.
Be more pedantic about the error-checking, fread can also return
values > 0 even though there has been a read-error.
We want to write the last incoming data and then bail.