I botched readrec's definition of a record, when I implemented
RS regular expression support. This is the relevant hunk from the
old diff:
```
- return c == EOF && rr == buf ? 0 : 1;
+ isrec = *buf || !feof(inf);
+ dprintf( ("readrec saw <%s>, returns %d\n", buf, isrec) );
+ return isrec;
```
Problem #1
Unlike testing with EOF, `*buf || !feof(inf)` is blind to stdio
errors. This can cause an infinite loop whose each iteration fabricates
an empty record.
The following demonstration uses standard terminal access control
policy to produce a persistent error condition. Note that the "i/o
error" message does not come from readrec(). It's produced much later
by closeall() at shutdown.
```
$ trap '' SIGTTIN && awk 'END {print NR}' &
[1] 33517
$ # After fg, type ^D
$ fg
trap '' SIGTTIN && awk 'END {print NR}'
13847376
awk: i/o error occurred on /dev/stdin
input record number 13847376, file
source line number 1
```
Each time awk tries to read the terminal from the background,
while ignoring SIGTTIN, the read fails with EIO, getc returns EOF,
the stream's end-of-file indicator remains clear, and `!feof`
erroneously promotoes the empty buffer to an empty record. So long
as the error persists, the stream's position does not advance and
end-of-file is never set.
Problem #2:
When RS is a regex, `*buf || !feof(inf)` can't see an empty record's
terminator at the end of a stream.
```
$ echo a | awk 1 RS='a\n'
$
```
That pipeline should have found one empty record and printed a blank
line, but `*buf || !feof(inf)` considers reaching the end of the
stream the conclusion of a fruitless search. That's only correct when
the terminator is a single character, because a regex RS search can
set the end-of-file marker even when it succeeds.
The Fix
`isrec` must be 0 **iff** no record is found. The correct definition
of "no record" is a failure to find a record terminator and a
failure to find any data (possibly from a final, unterminated
record). Conceptually, for any RS:
```
isrec = (noTERM && noDATA) ? 0 : 1
```
noDATA is an expression that's true if `buf` is empty, false otherwise.
When RS is null or a single character, noTERM is an expression
that is true when the sought after character is not found, false
otherwise. Since the search for a single character can only end with
that character or EOF, noTERM is `c == EOF`.
```
isrec = (c == EOF && rr == buf) ? 0 : 1
```
When RS is a regular expression: noTERM is an expression that is
true if a match for RS is not found, false otherwise. This is simply
the inverse of the result of the function that conducts the search,
`!found`.
```
isrec = (found == 0 && *buf == '\0') ? 0 : 1
```
RS ^-anchoring needs to know if it's reading the first record of a file.
Unfortunately, innew, the flag that the main i/o loop uses to track
this, didn't make it from NetBSD unscathed. This commit restores the
last of the wayward lines.
Without this fix, when reading the first record of an input file named
on the command line, the regular expression machinery will be
misconfigured, precluding a successful match.
Relevant commits:
1. 643a5a3dad (Initial import)
2. ffee7780fe (Restoring innew)
If awk prints an error message while when compile_time is still set
to ERROR_PRINTING, don't try to print the context since there is
none. This can happen due to a problem with, e.g., unknown command
line options.
POSIX specifies a dprintf function that operates on an fd instead of
a stdio stream. Using upper case for macros is more idiomatic too.
We no longer need to use an extra set of parentheses for debugging
printf statements.
* LC_NUMERIC radix issue.
According to https://pubs.opengroup.org/onlinepubs/7990989775/xcu/awk.html
The period character is the character recognized in processing awk
programs. Make it so that during output we also print the period
character, since this is what other awk implementations do, and it
makes sense from an interoperability point of view.
* print "T.builtin" in the error message
* Fix backslash continuation line handling.
* Keep track of RS processing so we apply the regex properly only once
per record.
* - enhance fpe handler to print the error type
- cleanup argument parsing
- dynamically allocate program filename array
* bison uses enums now, not #define's, make it work with that.
* We need to use either the enums or the defines but not both. This
is because bison -y will create both enums and #defines, while bison
without -y produces only the enums, and byacc produces just #defines.
* fix indentation
* Set the tokentype when we have a match in the scan, and reset it later
when we decide that the match was bad. Fixes nbyacc.
* - don't use pattern rules for portability
- try to move both flavors of generated names for portability
* Amend tests for the new error messages
MB_CUR_MAX is the maximum number of bytes in a multibyte character
for the current locale, and might not be a constant expression.
MB_LEN_MAX is the maximum number of bytes in a multibyte character
for any locale, and always expands to a constant-expression.
* sprinkle const, static
* account for lineno in unput
* Add an EMPTY string that is used when a non-const empty string is needed.
* make inputFS static and dynamically allocated
* Simplify and in the process avoid -Wwritable-strings
* make fs const to avoid -Wwritable-strings
* More cleanups:
- sprinkle const
- add a macro (setptr) that cheats const to temporarily NUL terminate strings
remove casts from allocations
- use strdup instead of strlen+strcpy
- use x = malloc(sizeof(*x)) instead of x = malloc(sizeof(type of *x)))
- add -Wcast-qual (and casts through unitptr_t in the two macros we
cheat (xfree, setptr)).
* More cleanups:
- add const
- use bounded sscanf
- use snprintf instead of sprintf
* More cleanup:
- use snprintf/strlcat instead of sprintf/strcat
- use %j instead of %l since we are casting to intmax_t/uintmax_t
* Merge the 3 copies of the code that evaluated array strings with separators
and convert them to keep track of lengths and use memcpy instead of strcat.
- sprinkle const
- add a macro (setptr) that cheats const to temporarily NUL terminate strings
remove casts from allocations
- use strdup instead of strlen+strcpy
- use x = malloc(sizeof(*x)) instead of x = malloc(sizeof(type of *x)))
- add -Wcast-qual (and casts through unitptr_t in the two macros we
cheat (xfree, setptr)).