A crafted archive could crash the reader with an out-of-bounds read in
the directory-skip path (uc2_finish_cdir -> uc2_read_cdir -> uc2_get_tag).
decompress_cdir allocates cdir_buf inside its decode loop but, on its
error paths (decode failure or a checksum mismatch), returned before
setting cdir_range.end -- leaving cdir_buf non-NULL with a stale end. A
later uc2_read_cdir/uc2_finish_cdir then saw cdir_buf != NULL, skipped
re-reading, and walked a range whose end pointed below its start, so
range_len wrapped and range_get handed out wild pointers. Free cdir_buf
on every error path so the invariant "cdir_buf != NULL iff cdir_range is
valid" holds, and make range_len report an empty range (rather than a
huge one) if end ever precedes ptr, as defense in depth for the whole
parser.
Also add a compression-ratio ceiling to the cdir decode: a tiny crafted
stream can expand via long matches, so abort once the output far
outgrows the compressed bytes consumed.
Found with a new libFuzzer harness (tests/fuzz/, not built by default).
Memory-safety is clean over sustained fuzzing after this change; 22/22
ctest on Release and ASan. A residual slow-input timeout via a separate
decode path is tracked for follow-up.
The read handler now composes full directory paths from the cdir's
directory ids rather than emitting bare leaf names: build_dir_path
walks the parent chain (root dirid 0, depth-capped against cyclic
cdirs), so multi-file archives with subdirectories list correctly.
Master-block resolution (M4) and tagged long names (M6) already work
through libuc2's extract and tag paths; this adds a libarchive
round-trip test that creates archives at Huffman and rANS levels and
verifies every byte back through libarchive's public API. Documents
the plugin build recipe (libarchive source tree + static lib).
Verified against libarchive 3.7.7; round-trip clean under valgrind.
Same defect class as test_ingest (ac01b32): hardcoded /tmp and a
shell rm -rf gave the test nothing real to do on the Windows runner.
Temp store now lands in %TEMP% and cleanup uses a portable rmtree
(dirent on POSIX, _findfirst on MSVC) over the store's two-level
layout.
The test hardcoded /tmp, which does not exist on the Windows runner.
With NDEBUG compiling the asserts out, the NULL stream from the failed
fopen reached fclose() and tripped the UCRT invalid-parameter fail-fast
(0xc0000409). Temp files now go to %TEMP% on Windows; rm -rf and unlink
are replaced with ISO C remove(); file-handle acquisition failures now
exit loudly instead of relying on assert.
mdoc man page covering all modes and the OTS/ingest long options,
verified with groff and NetBSD mandoc. CMake installs the binary and
the man page (guarded against add_subdirectory embedding). Also
corrects the stale direction-1 comment in the DOSBox round-trip
script: multi-file archives created by v3 have extracted fine in the
original since the custom-Huffman-tree fix.
Extraction of level 6-9 archives crashed (first seen on NetBSD/sdf.org,
reproducible everywhere), and files larger than the 64KB sliding window
silently corrupted at every level. Four causes:
- cli: master COMPRESS records hardcoded method 1 while master data was
compressed at opt.level, so rANS masters were fed to the Huffman
decoder. Records now carry method 10 at levels 6-9; levels 2-5 keep
method 1 for original UC2 Pro compatibility.
- decompress: decompressor_rans stopped at remaining == 0 without
consuming the end-of-block pair and its 12 extra bits, leaving the
bit cursor desynchronized; the next block-present read landed inside
the EOB extras and parsed a phantom block. The loop now decodes all
nsyms symbols and guards output writes instead.
- decompress: a refill read returning a single byte into an empty
buffer let head overtake tail in bits_feed; the unsigned difference
wrapped and head walked off the 4KB buffer (the actual segfault).
The refill now loops until a full byte pair is available, and a
sticky error flag stops the decoder treating negative bits_get
returns as data.
- compress/decompress: chunk loads wrote linearly past the circular
window edge, and the rANS decoder flushed output in one linear write
that cannot express ring wrap. Loads are now capped at the edge and
the decoder flushes incrementally in ring order.
Also: BCJ E8/E9 byte assembly no longer shifts promoted ints into the
sign bit, and the libarchive plugin uses timegm on NetBSD/OpenBSD/
DragonFly so DOS timestamps are not offset by the local timezone.
New cli_bigfile regression test (>128KB round-trip at L5 and L6); it
fails against the previous binary and passes now. Verified: 22/22
ctest including the DOSBox-X round-trip against original uc2pro.exe,
ASan/UBSan clean, and the full matrix on NetBSD 10 (sdf.org).
Scope shift from the original "make output a real UC2 v3 archive"
issue: that requires a new entry type or compress.c refactor (UC2
archives have one master per file, not a chain). This commit ships
the closest-in-spirit upgrade -- a self-contained format that solves
v1's main UX wart, the sidecar <archive>.blocks/ directory.
Format v2:
+0 8B magic "UC2INGST"
+8 1B version (2)
+9 1B cdc_bits
+10 2B reserved
+12 4B chunk_count
+16 ... chunk_count * 16B: 8B hash, 4B length, 4B offset
... chunk pool: unique chunks back-to-back at recorded offsets
The dedup map has a small implementation note: cap must be a power
of two for the mask-based linear probe to terminate. Caught when
test_ingest hung at 25 chunks -- initial_cap=50 is not power-of-two,
so probing wrapped to a non-empty slot indefinitely. Now rounded up
in dedup_map_init.
Trade-off: cross-archive dedup is not preserved (each --ingest call
overwrites the archive). v1 archives remain restorable through the
sidecar blockstore; the writer defaults to v2.
Tests: 6 cases (was 5). test_intra_call_dedup verifies that
identical chunks within a single ingest dedup correctly
(buffer-twice produces > 0 saved bytes). test_v2_self_contained
asserts the .blocks/ directory is NOT created for v2 archives.
Closes 96ef9b8. v3 (real UC2 v3 archive output) is filed at 59bec0d.
tests/scripts/dos_smoke.sh runs the DJGPP-built uc2 inside DOSBox-X
via the flatpak and asserts:
- uc2 -h loads under a real DPMI host and prints the banner
- uc2 -l <archive> opens an existing UC2 archive and produces output
Skips cleanly when any of uc2.exe, CWSDPMI.EXE, or DOSBox-X are
missing. CWSDPMI.EXE is the standard DJGPP DPMI extender from
csdpmi7b.zip; fetch recipe added to cmake/README-djgpp.md.
Verified locally against build-djgpp/cli/uc2.exe +
tests/archives/basic.uc2.
Closes 20019aa. CI matrix entry (9379647) remains a separate
follow-up.
Reads stdin, splits via CDC, deduplicates chunks against a sidecar
block store at <archive>.blocks/, writes a chunk-hash manifest at
<archive>. The reverse operation reads the manifest and reassembles
the byte stream from the block store.
Manifest format (magic UC2INGST) is a standalone container, not yet
unified with the master-block archive layout. Tar boundaries are not
preserved; the input is treated as an opaque byte stream. Follow-ups
filed for both.
Builds entirely on existing CDC + blockstore + merkle infrastructure.
No new compression or hashing primitives.
Tests cover small + 200 KB multichunk round-trip, idempotent dedup
(repeat ingest of the same data reports zero new chunks and exact
bytes_saved), empty stream, bad-magic rejection. Lint gate stays
green.
Closes fa0c7d4.
Same bug class as dae8a50 and 6d8087f: under -DNDEBUG (CMake's default
for Release, which CI uses) the assert macro expands to ((void)0) and
the wrapped expression is not evaluated. Calls inside assert() are
silently dropped.
Found 6 occurrences in test_ots.c (uc2_ots_varint_decode, parse_file)
where the call writes through output pointers. Under Release builds
these tests silently no-op rather than testing anything. Converted to
capture-then-check.
Audit otherwise clean: production code (lib/, cli/) has only one
assert-on-call, and it wraps a pure arithmetic helper.
Adds tests/scripts/check_assert_side_effects.py as a CI gate to keep
this class of bug out: matches assert(IDENT(...)) where IDENT contains
a side-effect verb (encode/decode/parse/...). Pure queries (_equal,
_match, _verify, _has_, _is_, _id, _root, _attest_name, memcmp, ...)
are not flagged. Wired into build.yml on the Linux runner.
Also gitignore Testing/ (CTest run outputs) and __pycache__/.
Same root cause as 97e05ad and dae8a50: assert(call(...)) under NDEBUG
strips the entire expression, including the function call. In Release
builds, uc2_delta_encode and uc2_delta_apply never ran in test_delta,
leaving 'delta' and 'recon' uninitialized. Subsequent free(delta) /
free(recon) of garbage pointers triggered Windows STATUS_HEAP_CORRUPTION
(0xc0000374). Linux glibc happened to be lucky and didn't notice.
Convert all assert(uc2_delta_*(...)) to the capture pattern from
97e05ad: { int _r = call; (void)_r; assert(_r == 0); }. Now the call
runs unconditionally; the assert (still NDEBUG-stripped in Release)
only loses the post-condition check, not the call itself.
Windows MSVC test_delta failed with STATUS_HEAP_CORRUPTION (0xc0000374).
ASan/UBSan on Linux finds nothing; the most likely Windows-specific
issue is malloc(0) in uc2_delta_apply when the target is empty
(test_empty_target). Bump to malloc(1) to get a canonical
free()-safe pointer.
Add fflush(stdout) between tests so the next CI run shows which
test (if any) still fails on Windows.
Round 2 of c67b631 cleanup. After the dirent + utime fixes, the
MSVC link surface still had:
- LNK2005 'fopen already defined': dropped g_fopen so we no longer
override the SDK's fopen. UTF-8 paths still work on Windows 10
with the active-codepage manifest; non-Unicode codepages will see
ANSI translation. This is good enough for the public release; a
full UTF-8 fopen wrapper can be added later if needed.
- LNK2019 'unresolved S_ISDIR / S_ISREG': MSVC's <sys/stat.h> defines
_S_IFDIR / _S_IFREG but not the POSIX S_IS* macros. Add them in
the unistd.h shim (which main.c already pulls).
- LNK1181 'cannot open input file m.lib': test_merkle and test_rans
linked libm unconditionally. Math is in the default CRT on MSVC;
link 'm' only on non-Windows.
- 'unistd.h' not found in test_blockstore.c: it actually only needs
getpid(). Use <process.h> + #define getpid _getpid on MSVC, keep
<unistd.h> elsewhere.
uc2_sha256: pure-C FIPS 180-4 implementation, one-shot and incremental
API, validated against published vectors (empty, abc, 56-byte,
1M 'a', byte-by-byte, every-split-point boundary).
uc2_ots: parser, serializer, and walker for the standard .ots binary
format. Strict canonical varint with 64-bit overflow check, depth-
bounded recursion, varbytes cap, max-digest cap. Walker supports
the calendar-path subset (APPEND, PREPEND, SHA256); proofs that
include other crypto ops (SHA1, RIPEMD160, KECCAK256) are accepted
as structurally valid but flagged for follow-up via the standard
'ots verify'.
UC2-OTS trailer: magic-bracketed sidecar appended after the recorded
archive bytes. Reverse-scan-safe; original UC2 Pro reader ignores
trailing bytes past its recorded length so backward compatibility is
preserved. Layout (all integers little-endian uint32):
front-magic + version + archive_len + proof_len + proof
+ proof_len + back-magic.
CLI: --ots-attach validates that the proof's leaf digest equals
SHA-256(archive[0..archive_len)) before appending and refuses to
overwrite an existing trailer unless -f is given. --ots-extract
writes the proof verbatim, byte-compatible with the standard
'ots verify'. --ots-info parses and prints the leaf, archive-match
status, and attestation list. uc2 -t recomputes the archive
SHA-256 and walks the proof.
Tests: 17 OTS unit tests (varint round-trip, canonical/overflow
rejection, file-envelope round-trip, walker on append/sha256/
sibling/unsupported-op/truncated/trailing-garbage, attest_name,
trailer round-trip + corruption rejection in 4 scenarios).
Plus an optional ctest target ots_cross_check that round-trips
the .ots through python-opentimestamps when the package is
installed; skipped (return code 77) otherwise.
uc2_merkle_root() and uc2_dict_id() return uint64_t; the int _r
temporaries from 97e05ad's NDEBUG fix truncated the high 32 bits.
Under Release the assertion was stripped, hiding the bug; under
Debug the truncated _r never matched the second uint64_t call.
New library (uc2_blake3.h / uc2_blake3.c) for Phase 7:
- Pure C BLAKE3 implementation (~300 lines)
- 256-bit (32-byte) digests using BLAKE2s round function
- Bao tree hashing structure for inputs > 1024 bytes
- Incremental API (init/update/final) and one-shot helper
- Constant-time hash comparison (timing-attack resistant)
Suitable for content verification, block integrity checking,
and content-addressable storage (replacing or supplementing
the 64-bit FNV-1a hashes used in Merkle DAG and block store).
7 unit tests:
- Empty input, determinism, collision avoidance
- Incremental vs one-shot consistency
- Single-byte-at-a-time update consistency
- Avalanche effect (1-bit change → ~50% output bits flip)
- Constant-time comparison
New library (uc2_preprocess.h / uc2_preprocess.c) for Phase 4:
BCJ (Branch/Call/Jump) filter:
- E8/E9 x86 address normalization (relative → absolute)
- Makes calls to the same function from different locations produce
identical byte sequences, improving LZ77 matching
- Round-trip verified; address normalization confirmed
BWT (Burrows-Wheeler Transform):
- Suffix-array-based forward transform
- LF-mapping inverse with reverse reconstruction
- Groups similar contexts for better entropy coding
- Round-trip verified for text ("banana") and binary data
Delta filter:
- Byte-wise delta encoding with configurable stride
- Stride 1 for sequential data, stride 2+ for interleaved channels
- Constant-delta sequences (arithmetic progressions) reduce to
repeated single values
Content detection:
- Automatic content type identification (text/x86/structured/binary)
- MZ/PE and ELF header recognition for x86
- Printable ASCII ratio for text detection
11 unit tests covering all filters and detection.
New library (uc2_dict.h / uc2_dict.c) formalizes master blocks as
proper dictionaries with:
- 64-bit content hash ID (FNV-1a) for cross-archive sharing
- 32-bit integrity checksum with verification
- Portable serialization format (24-byte header + data)
- Deserialization with magic number and size validation
Combined with the block store (uc2_blockstore.h), this enables
distributed dedup: archives in different locations can reference
shared dictionaries by content hash, with integrity verification
before decompression.
6 unit tests including serialization round-trip, corruption
detection, and bad-magic rejection.
Also added plausible deniability (multi-archive with separate
passwords) to Phase 5 roadmap.
New library (uc2_rans.h / uc2_rans.c) — table-based range Asymmetric
Numeral Systems (rANS) entropy coder:
- 32-bit state with 12-bit probability precision
- Supports up to 344 symbols (matching UC2's LZ77 alphabet)
- Frequency table normalization with minimum-frequency guarantee
- Reverse-order encoding with automatic renormalization
- Fast O(1) decoding via cumulative frequency lookup table
Performance: <5% overhead vs Shannon entropy on tested distributions.
Single-symbol streams compress to ~4 bytes (near-zero information).
Skewed distributions (90% one symbol) achieve sub-bit-per-symbol rates.
6 unit tests:
- Table construction with frequency normalization
- Round-trip: uniform, skewed, 344-symbol alphabet, single-symbol
- Comparison vs Shannon entropy (verified <5% overhead)
New library (uc2_cdc.h / uc2_cdc.c) for Phase 3 deduplication:
- Gear rolling hash: O(1) per-byte update, uniform distribution,
content-aware boundary detection via mask-based matching
- Configurable chunker: min/max/target chunk sizes (default avg 8KB),
streaming API with reset support
- FNV-1a content hash for chunk dedup addressing
- 256-entry random lookup table for Gear hash distribution
8 unit tests covering:
- Hash determinism and collision avoidance
- Complete data coverage (no bytes lost)
- Min/max chunk size enforcement
- Content-defined boundary alignment across shifted data
- Cross-file dedup detection (shared 256KB block found between
two files with different unique prefixes/suffixes)
Always assign custom master indices (>= FIRSTMASTER=2) to all files,
never SuperMaster (index 0). The original's ExtractFiles() routes
SuperMaster files through a code path that hangs. The original itself
never uses SuperMaster in file COMPRESS records — it always creates
at least one custom master, even for archives without dedup groups.
For ungrouped files, a default custom master is built from the largest
file's first 64KB. All files reference this master, matching the
original's archive structure.
The automated DOSBox-X test now validates multi-file round-trip in
both directions: 4 files UC2 v3 -> original, 5 files original -> UC2 v3.
All content verified byte-for-byte.
Single-file UC2 v3 archives are now fully backward compatible with the
original UC2 Pro — listing and extraction verified in automated DOSBox-X
test. SFX extraction timeout increased to 600s with 22-file completeness
check (incomplete extraction caused false test results throughout the
earlier investigation). Direction 1 (UC2 v3 -> original) test added.
Root cause: the original UC2 Pro expects csize=0 in the cdir COMPRESS
record (it ignores the field entirely). UC2 v3 was writing the actual
compressed size, which confused the original's archive reader.
Additional changes:
- Use default Huffman tree for all blocks (ensures tree encoding compat)
- Write method=compression_level in cdir COMPRESS (was hardcoded to 1)
- Add tests/scripts/bitdump.py for bit-level bitstream analysis
Single-file UC2 v3 archives are now fully readable by the original UC2
Pro (listing and extraction verified in DOSBox-X). Multi-file archives
still hang — the cdir bitstream decodes correctly in our Python analyzer
but fails in the original's ASM decompressor kernel. Investigation
continues; the bitdump.py tool enables targeted comparison.
Port the original TreeGen/RepairLengths/CodeGen algorithms faithfully
from TREEGEN.CPP for bitstream compatibility with the 1992 UC2 Pro:
- treegen() now accepts max_code_bits parameter (13 for main trees,
7 for tree-encoding meta-tree)
- Heap uses >= for child comparison (prefer right child on ties),
matching original Reheap()
- BuildCodeTree uses extract-one-then-combine pattern
- RepairLengths uses sorted linked lists with cascading space-fill
- Single/zero symbol cases assign length 1 to two symbols
- tree_enc RLE: trigger at run > 6 (not >= 6), max 20 per chunk,
single RepeatCode per run
- First block uses default tree (tree-changed=0) matching original
behavior for small blocks
Full backward compatibility with original UC2 Pro archives (Direction 2)
is maintained. Forward compatibility (UC2 v3 -> original, Direction 1)
remains in progress — the original still hangs, likely due to residual
bitstream-level differences in the ASM decompressor kernel.
Automated test that runs the original 1992 UC2 Pro (UC.EXE) in DOSBox-X
headlessly to create archives from the test corpus, then extracts with
UC2 v3 and verifies byte-for-byte file identity.
Key findings during development:
- uc2pro.exe is a UCEXE self-extracting archive, not the tool itself;
the actual archiver is UC.EXE inside the distribution
- UC.EXE must be run from its own directory (needs DOS.SEA overlay)
- DOSBox-X flatpak requires work dirs under $HOME (filesystem=home)
- The reverse direction (UC2 v3 → original) does not work: the original
UC2 Pro hangs reading UC2 v3 archives due to compression bitstream
differences (added as a roadmap item)
Also fixes create_archives.sh to use the same two-session DOSBox pattern
(extract SFX first, then use UC.EXE).
Recursive directory scanning with parent/child ID tracking, directory
entries in the central directory (OSMETA + DIRMETA + EXTMETA long name
tags), and a CLI round-trip test verifying nested directory hierarchies.
Content-fingerprint grouping via FNV-1a hash of file headers: files
sharing identical first 4096 bytes are assigned a custom master block
built from the largest file in the group. Masters are compressed with
SuperMaster and written as MASMETA records in the central directory.
Files below 1 KB or without a group continue using the SuperMaster.
Includes CLI integration test and documentation updates (format spec,
usage, roadmap).
Implement a compressor that produces bitstreams compatible with the
existing Bobrowski decompressor. The engine uses LZ77 sliding-window
match finding with hash chains, Huffman entropy coding, and delta-coded
tree serialization matching the original UC2 format exactly.
New files:
- lib/src/compress.c: LZ77+Huffman compressor (~950 lines)
- lib/src/uc2_internal.h: shared constants, types, checksums
- lib/src/uc2_tables.c: vval/ivval delta tables, default Huffman tree
- tests/src/test_roundtrip.c: compress→archive→decompress→verify tests
Key details:
- 4 compression levels (Fast/Normal/Tight/Ultra) with tunable search
- Lazy evaluation for better match selection at higher levels
- Delta-coded Huffman tree serialization with RLE
- Fletcher/XOR checksum computation
- Round-trip test covers 8 patterns × 4 levels (32 test cases)
Fixed 28 errors in the hand-computed ivval inverse delta table (rows
9-13) that caused the decompressor to reconstruct wrong Huffman trees
from compressor output.
Test corpus (empty, text, binary, compressible, incompressible) with
reference archives created by original UC2 v2.3 in DOSBox. Two CTest
tests: test_identify (magic detection) and test_extract (full
extraction pipeline verified byte-for-byte against corpus).