76 Commits

Author SHA1 Message Date
Eremey Valetov
3dc9f9684c docs: roadmap maintenance log for the 2026-06-13 security pass
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
2026-06-13 10:56:48 -04:00
Eremey Valetov
fc767a1739 cli: report a write error at fclose on extraction
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
uc2_extract's output file was closed without checking fclose, so a
deferred write error (a full disk, for example) could silently
truncate the extracted file. Fail loudly instead, unless extraction
already reported an error.
v3.0.0-alpha.3
2026-06-13 10:55:25 -04:00
Eremey Valetov
ad923d7ea0 fix heap overflow parsing a damaged central directory
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
A crafted archive could crash the reader with an out-of-bounds read in
the directory-skip path (uc2_finish_cdir -> uc2_read_cdir -> uc2_get_tag).

decompress_cdir allocates cdir_buf inside its decode loop but, on its
error paths (decode failure or a checksum mismatch), returned before
setting cdir_range.end -- leaving cdir_buf non-NULL with a stale end. A
later uc2_read_cdir/uc2_finish_cdir then saw cdir_buf != NULL, skipped
re-reading, and walked a range whose end pointed below its start, so
range_len wrapped and range_get handed out wild pointers. Free cdir_buf
on every error path so the invariant "cdir_buf != NULL iff cdir_range is
valid" holds, and make range_len report an empty range (rather than a
huge one) if end ever precedes ptr, as defense in depth for the whole
parser.

Also add a compression-ratio ceiling to the cdir decode: a tiny crafted
stream can expand via long matches, so abort once the output far
outgrows the compressed bytes consumed.

Found with a new libFuzzer harness (tests/fuzz/, not built by default).
Memory-safety is clean over sustained fuzzing after this change; 22/22
ctest on Release and ASan. A residual slow-input timeout via a separate
decode path is tracked for follow-up.
2026-06-13 10:53:49 -04:00
Eremey Valetov
62a90af101 guard allocation sizes against integer overflow
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
Several allocation sizes were computed from input-controlled counts or
lengths and could wrap before the malloc/fread, yielding an undersized
buffer that is then indexed past its end (mainly on 32-bit targets such
as DJGPP, where size_t is 32 bits):

- ingest restore_v2 multiplied an untrusted 32-bit chunk count from the
  archive header by the entry size; cap the count (also bounds memory).
- ingest write and uc2_dict_serialize had the same multiply/add on
  locally-derived sizes; cap them too.
- uc2_blockstore_ingest checked off + clen > len, which can wrap;
  rewrite as off > len || clen > len - off.
- the libarchive plugin's extract_write grew its buffer with an
  unchecked len addition and power-of-two doubling that could wrap;
  guard both.
- uc2_bwt_revert used the caller-supplied primary_index to index its
  buffers without a bound, and multiplied len by sizeof(uint32_t)
  without an overflow check.

Also: uc2_merkle_build used the realloc result without checking it, so
an OOM left tree->chunks NULL and the next write dereferenced it; keep
the chunks gathered so far instead. 22/22 ctest on Release and ASan.
2026-06-13 08:43:03 -04:00
Eremey Valetov
43cf875dfe cli: reject path-traversal in archive entry names on extraction
extract_cb appended a decoded entry name to the destination path with
no validation, so a crafted archive whose entry name contained "..",
a path separator, or an absolute form could write files outside the
chosen destination directory (a Zip-Slip). Each UC2 entry name is a
single path component -- the directory tree is rebuilt from dirid
parents -- so reject any name that is empty, ".", "..", or contains
'/' or '\'. The bundled writer only ever stores basenames, so this
affects malformed or hostile archives only; normal extraction
(including names like "..foo" and nested directories) is unchanged.
2026-06-13 08:35:59 -04:00
Eremey Valetov
5e0f3852c6 harden decoder against crafted archives: tree overrun, LZ distance, delta stride
A malformed archive could drive several out-of-bounds accesses in the
decoder, all reachable from untrusted input:

- ht_dec() expanded a Huffman RepeatCode without checking the
  destination against the end of the local stream[] array, so a crafted
  tree wrote past it on the stack. Reject the overrun as UC2_Damaged.

- The LZ match copy in both the rANS and the Huffman paths used a match
  distance straight from the bitstream. A distance larger than the
  bytes written so far (or one wrapped huge by a short bits_get on the
  distance extra-bits) made (u16)(tail - dist) reference window bytes
  that were never written, copying uninitialised memory into the
  output. Track produced history (master fill + output, saturating at
  the 64KB window) and reject dist beyond it.

- struct delta carried val[8], but decompressor() accepts methods up to
  49, giving strides up to 10; strides 9 and 10 indexed past the array
  (and silently mis-decoded). Size val[] to cover the accepted range.

Found by a code-review pass. Valid round-trips are unchanged: 22/22
ctest on Release and ASan, plus ASan round-trips across all levels for
inputs spanning the 64KB window. The assemble_name NULL-deref raised in
the same review is not reachable (dos_name is a fixed 11 bytes, far
under the 300-byte name buffer), so it is left as-is.
2026-06-13 08:33:37 -04:00
Eremey Valetov
13e29ee211 ci: install libfl2 for the DJGPP binutils
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
The prebuilt DJGPP ar and ld from the andrewwutw release are linked
against the flex runtime (libfl.so.2), which a clean GitHub runner
does not have, so linking libuc2.a failed with a loader error.
Install libfl2 before extracting the toolchain.
2026-06-13 07:56:32 -04:00
Eremey Valetov
247de54352 harden decoding of damaged archives
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Build / DOS (DJGPP) (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
A truncated or corrupt archive could overrun memory during decode.
decompress_block guarded its match-copy length with an assert that
NDEBUG compiles out, so a short bits_get that underflowed the length
would overrun the 64KB window in release builds. Replace the assert
with a runtime check: an out-of-range length ends the block with
UC2_Damaged before the copy, and the existing checksum and size
validation then reports the archive as damaged. decompress_cdir bound
the walkable range to the buffer allocation rather than the bytes
actually decompressed, so a damaged directory that happened to match
the 16-bit checksum could be parsed into uninitialised heap; bound the
range to the decompressed length. The CLI also leaked the archive
handle and FILE on the directory-read and integrity-test error paths;
close both.

A prefix-sweep fuzzer drove these fixes. It still finds a rare,
heap-state-dependent out-of-bounds read in the directory-skip path
that these changes do not fully close; that and a stable fuzz harness
are tracked separately.
2026-06-13 07:53:53 -04:00
Eremey Valetov
09cdc80986 ci: build the DOS (DJGPP) target and consolidate the toolchain file
A new Linux job installs the andrewwutw DJGPP v3.4 cross-toolchain
(gcc 12.2.0, sha256-pinned), cross-compiles uc2.exe with
cmake/djgpp.cmake, and verifies the result is a DJGPP go32 DOS
executable. The DOS build had no CI coverage and could regress
silently.

The repo carried two diverged DJGPP toolchain files. djgpp.cmake
(referenced by the build docs) forces -nostdinc with explicit DJGPP
include paths, so it builds cleanly even on hosts where /usr/include
would otherwise leak past the cross-compiler. djgpp-toolchain.cmake
(previously referenced by the README) relied on the cross-gcc finding
its own headers and broke in that case. Keep djgpp.cmake as the single
toolchain file, point the README and roadmap at it, and drop
djgpp-toolchain.cmake.
2026-06-13 07:53:46 -04:00
Eremey Valetov
c394106c56 ci: build and test the libarchive plugin on Linux
Some checks failed
Build / Linux (push) Has been cancelled
Build / Windows (MSVC) (push) Has been cancelled
Build / macOS (push) Has been cancelled
Build / libarchive plugin (push) Has been cancelled
Docs / build (push) Has been cancelled
Docs / deploy (push) Has been cancelled
New job fetches libarchive 3.7.7 (sha256-pinned), builds it as a
dependency-free static library, then configures UC2 with the plugin
and runs the libarchive_roundtrip test. Keeps the plugin's
source-tree build path verified on every push without adding a
libarchive dependency to the default matrix.
2026-06-13 02:27:56 -04:00
Eremey Valetov
d26791bfbd libarchive plugin: directory paths, round-trip test (M5-M6)
The read handler now composes full directory paths from the cdir's
directory ids rather than emitting bare leaf names: build_dir_path
walks the parent chain (root dirid 0, depth-capped against cyclic
cdirs), so multi-file archives with subdirectories list correctly.
Master-block resolution (M4) and tagged long names (M6) already work
through libuc2's extract and tag paths; this adds a libarchive
round-trip test that creates archives at Huffman and rANS levels and
verifies every byte back through libarchive's public API. Documents
the plugin build recipe (libarchive source tree + static lib).

Verified against libarchive 3.7.7; round-trip clean under valgrind.
2026-06-13 02:10:56 -04:00
Eremey Valetov
b86309542d cli: fail loudly when archive offsets would exceed 4 GiB
The UC2 container stores 32-bit offsets; ftell results were cast to
unsigned at four sites, so positions past 4 GiB would wrap silently
and corrupt the directory. tell32() now reports the format limit and
exits. Also checks the ftell result reserved for the ingest manifest
instead of seeking to -1 on error. Multi-volume spanning (2b65f0a)
remains the route for larger payloads.
2026-06-12 06:29:12 -04:00
Eremey Valetov
217bf9e53f test_blockstore: portable temp paths and recursive cleanup
Same defect class as test_ingest (ac01b32): hardcoded /tmp and a
shell rm -rf gave the test nothing real to do on the Windows runner.
Temp store now lands in %TEMP% and cleanup uses a portable rmtree
(dirent on POSIX, _findfirst on MSVC) over the store's two-level
layout.
2026-06-12 06:29:12 -04:00
Eremey Valetov
ac01b32273 test_ingest: portable temp paths for Windows CI
The test hardcoded /tmp, which does not exist on the Windows runner.
With NDEBUG compiling the asserts out, the NULL stream from the failed
fopen reached fclose() and tripped the UCRT invalid-parameter fail-fast
(0xc0000409). Temp files now go to %TEMP% on Windows; rm -rf and unlink
are replaced with ISO C remove(); file-handle acquisition failures now
exit loudly instead of relying on assert.
2026-06-11 17:01:29 -04:00
Eremey Valetov
efd41dceb1 add uc2.1 man page and install rules
mdoc man page covering all modes and the OTS/ingest long options,
verified with groff and NetBSD mandoc. CMake installs the binary and
the man page (guarded against add_subdirectory embedding). Also
corrects the stale direction-1 comment in the DOSBox round-trip
script: multi-file archives created by v3 have extracted fine in the
original since the custom-Huffman-tree fix.
v3.0.0-alpha.2
2026-06-11 15:17:50 -04:00
Eremey Valetov
84672c00b6 fix rANS extraction crash and >64KB window corruption
Extraction of level 6-9 archives crashed (first seen on NetBSD/sdf.org,
reproducible everywhere), and files larger than the 64KB sliding window
silently corrupted at every level. Four causes:

- cli: master COMPRESS records hardcoded method 1 while master data was
  compressed at opt.level, so rANS masters were fed to the Huffman
  decoder. Records now carry method 10 at levels 6-9; levels 2-5 keep
  method 1 for original UC2 Pro compatibility.

- decompress: decompressor_rans stopped at remaining == 0 without
  consuming the end-of-block pair and its 12 extra bits, leaving the
  bit cursor desynchronized; the next block-present read landed inside
  the EOB extras and parsed a phantom block. The loop now decodes all
  nsyms symbols and guards output writes instead.

- decompress: a refill read returning a single byte into an empty
  buffer let head overtake tail in bits_feed; the unsigned difference
  wrapped and head walked off the 4KB buffer (the actual segfault).
  The refill now loops until a full byte pair is available, and a
  sticky error flag stops the decoder treating negative bits_get
  returns as data.

- compress/decompress: chunk loads wrote linearly past the circular
  window edge, and the rANS decoder flushed output in one linear write
  that cannot express ring wrap. Loads are now capped at the edge and
  the decoder flushes incrementally in ring order.

Also: BCJ E8/E9 byte assembly no longer shifts promoted ints into the
sign bit, and the libarchive plugin uses timegm on NetBSD/OpenBSD/
DragonFly so DOS timestamps are not offset by the local timezone.

New cli_bigfile regression test (>128KB round-trip at L5 and L6); it
fails against the previous binary and passes now. Verified: 22/22
ctest including the DOSBox-X round-trip against original uc2pro.exe,
ASan/UBSan clean, and the full matrix on NetBSD 10 (sdf.org).
2026-06-11 13:14:01 -04:00
Eremey Valetov
7825eb47b2 ingest v2: self-contained archive (chunk pool inside the file)
Scope shift from the original "make output a real UC2 v3 archive"
issue: that requires a new entry type or compress.c refactor (UC2
archives have one master per file, not a chain).  This commit ships
the closest-in-spirit upgrade -- a self-contained format that solves
v1's main UX wart, the sidecar <archive>.blocks/ directory.

Format v2:
  +0   8B   magic "UC2INGST"
  +8   1B   version (2)
  +9   1B   cdc_bits
  +10  2B   reserved
  +12  4B   chunk_count
  +16  ...  chunk_count * 16B:  8B hash, 4B length, 4B offset
  ...       chunk pool: unique chunks back-to-back at recorded offsets

The dedup map has a small implementation note: cap must be a power
of two for the mask-based linear probe to terminate.  Caught when
test_ingest hung at 25 chunks -- initial_cap=50 is not power-of-two,
so probing wrapped to a non-empty slot indefinitely.  Now rounded up
in dedup_map_init.

Trade-off: cross-archive dedup is not preserved (each --ingest call
overwrites the archive).  v1 archives remain restorable through the
sidecar blockstore; the writer defaults to v2.

Tests: 6 cases (was 5).  test_intra_call_dedup verifies that
identical chunks within a single ingest dedup correctly
(buffer-twice produces > 0 saved bytes).  test_v2_self_contained
asserts the .blocks/ directory is NOT created for v2 archives.

Closes 96ef9b8.  v3 (real UC2 v3 archive output) is filed at 59bec0d.
2026-05-05 03:25:45 -04:00
Eremey Valetov
bd0d1911b1 djgpp: DOSBox-X smoke test for the cross-compiled uc2.exe
tests/scripts/dos_smoke.sh runs the DJGPP-built uc2 inside DOSBox-X
via the flatpak and asserts:
- uc2 -h loads under a real DPMI host and prints the banner
- uc2 -l <archive> opens an existing UC2 archive and produces output

Skips cleanly when any of uc2.exe, CWSDPMI.EXE, or DOSBox-X are
missing.  CWSDPMI.EXE is the standard DJGPP DPMI extender from
csdpmi7b.zip; fetch recipe added to cmake/README-djgpp.md.

Verified locally against build-djgpp/cli/uc2.exe +
tests/archives/basic.uc2.

Closes 20019aa.  CI matrix entry (9379647) remains a separate
follow-up.
2026-05-05 03:00:23 -04:00
Eremey Valetov
1a7b760848 libarchive plugin: milestones 2-3 -- read_header + read_data via libuc2
read_header() slurps the archive on first call (using
__archive_read_ahead + __archive_read_consume), opens libuc2 against
the slurped buffer, walks uc2_read_cdir to cache every entry, and
yields one per call mapped onto archive_entry's pathname / size /
mtime / mode.  Tagged entries are resolved via uc2_get_tag.  Memory
scales with archive size in v1; seekable adapter via
__archive_read_seek is a future revision.

read_data() runs uc2_extract through a buffering write callback, then
yields the decompressed entry as a single slice (libarchive's pull
API permits this).  read_data_skip and cleanup are correct.

Build verified clean against libarchive 3.7.7.  End-to-end runtime
test via bsdtar requires a custom libarchive build that links the
plugin (the read-format API is internal).  Integration recipe added
to contrib/libarchive/README.md.

Closes 591db60.  M4 (master-block dep tracking regression test) and
M7 (bsdtar round-trip) tracked separately.
2026-05-05 02:57:43 -04:00
Eremey Valetov
c4db7cc58f libarchive plugin: milestone 1 -- working bid()
Replaces the skeleton with a real implementation of the bid callback,
self-registration, and graceful-EOF stubs for the rest of the
read-format vtable.  Builds against a libarchive source tree
(LIBARCHIVE_SOURCE_DIR option) because the read-format API is
internal -- the public -devel package only ships archive.h and
archive_entry.h, not archive_read_private.h.

Key changes:
- __archive_read_ahead reads the first 4 bytes; magic check returns
  bid 64 on 0x55 0x43 0x32 0x1A.
- __archive_read_register_format wired with the correct 12-argument
  signature against libarchive 3.7.7.
- archive_platform_config.uc2.h.in stands in for the generated
  config.h, satisfying archive_platform.h's include-or-error gate
  without us needing to run libarchive's own configure.

Resulting libuc2_libarchive.a exports archive_read_support_format_uc2
with three undefined references (__archive_check_magic,
__archive_read_ahead, __archive_read_register_format) that resolve
when linked into a libarchive tree.

Read_header / read_data / cleanup are EOF stubs.  Wiring to libuc2
is milestone 2+.

Closes b0b06a5; M2-3 tracked at 591db60.
2026-05-04 19:06:58 -04:00
Eremey Valetov
779c8b1a28 djgpp: CMake toolchain file + setup notes
cmake/djgpp-toolchain.cmake builds uc2.exe against the andrewwutw/
build-djgpp prebuilt cross toolchain.  Verified with gcc 7.2.0 and
12.2.0; output is a 359 KB MZ + COFF + go32 DOS executable.

libuc2 (14 source files including the new uc2_ingest.c) compiles
unmodified.  CLI uses the existing cli/src/compat/compat_dos.c shim
for BSD err.h and POSIX fnmatch -- already in tree, just needed
the toolchain file to set DJGPP=TRUE so cli/CMakeLists.txt picks it
up.

Documented gotcha: GCC honours CPATH and CPLUS_INCLUDE_PATH from
the build shell regardless of -nostdinc.  On hosts that export them
(e.g. via Intel oneAPI's setvars.sh), host glibc headers end up first
in the cross-compiler's search path and the build fails on stdint.h.
The README walks through 'unset CPATH' as the remediation.

DOSBox-X end-to-end smoke test and CI matrix entry tracked as P2
follow-ups.  Closes 195be9a.
2026-05-04 19:01:01 -04:00
Eremey Valetov
446158e855 ingest v1: streaming dedup sink (--ingest / --ingest-restore)
Reads stdin, splits via CDC, deduplicates chunks against a sidecar
block store at <archive>.blocks/, writes a chunk-hash manifest at
<archive>.  The reverse operation reads the manifest and reassembles
the byte stream from the block store.

Manifest format (magic UC2INGST) is a standalone container, not yet
unified with the master-block archive layout.  Tar boundaries are not
preserved; the input is treated as an opaque byte stream.  Follow-ups
filed for both.

Builds entirely on existing CDC + blockstore + merkle infrastructure.
No new compression or hashing primitives.

Tests cover small + 200 KB multichunk round-trip, idempotent dedup
(repeat ingest of the same data reports zero new chunks and exact
bytes_saved), empty stream, bad-magic rejection.  Lint gate stays
green.

Closes fa0c7d4.
2026-05-04 18:37:18 -04:00
Eremey Valetov
4a51918b83 ci: lint gate + test_ots fixes against assert(side-effect) NDEBUG bug
Same bug class as dae8a50 and 6d8087f: under -DNDEBUG (CMake's default
for Release, which CI uses) the assert macro expands to ((void)0) and
the wrapped expression is not evaluated.  Calls inside assert() are
silently dropped.

Found 6 occurrences in test_ots.c (uc2_ots_varint_decode, parse_file)
where the call writes through output pointers.  Under Release builds
these tests silently no-op rather than testing anything.  Converted to
capture-then-check.

Audit otherwise clean: production code (lib/, cli/) has only one
assert-on-call, and it wraps a pure arithmetic helper.

Adds tests/scripts/check_assert_side_effects.py as a CI gate to keep
this class of bug out: matches assert(IDENT(...)) where IDENT contains
a side-effect verb (encode/decode/parse/...).  Pure queries (_equal,
_match, _verify, _has_, _is_, _id, _root, _attest_name, memcmp, ...)
are not flagged.  Wired into build.yml on the Linux runner.

Also gitignore Testing/ (CTest run outputs) and __pycache__/.
2026-05-04 18:23:55 -04:00
Eremey Valetov
6d8087fd6f test_delta: convert side-effecting asserts to capture-and-check
Same root cause as 97e05ad and dae8a50: assert(call(...)) under NDEBUG
strips the entire expression, including the function call.  In Release
builds, uc2_delta_encode and uc2_delta_apply never ran in test_delta,
leaving 'delta' and 'recon' uninitialized.  Subsequent free(delta) /
free(recon) of garbage pointers triggered Windows STATUS_HEAP_CORRUPTION
(0xc0000374).  Linux glibc happened to be lucky and didn't notice.

Convert all assert(uc2_delta_*(...)) to the capture pattern from
97e05ad: { int _r = call; (void)_r; assert(_r == 0); }.  Now the call
runs unconditionally; the assert (still NDEBUG-stripped in Release)
only loses the post-condition check, not the call itself.
2026-05-04 16:53:35 -04:00
Eremey Valetov
79e0505fc3 test_delta: defensive malloc(0) fix + per-test fflush
Windows MSVC test_delta failed with STATUS_HEAP_CORRUPTION (0xc0000374).
ASan/UBSan on Linux finds nothing; the most likely Windows-specific
issue is malloc(0) in uc2_delta_apply when the target is empty
(test_empty_target).  Bump to malloc(1) to get a canonical
free()-safe pointer.

Add fflush(stdout) between tests so the next CI run shows which
test (if any) still fails on Windows.
2026-05-04 16:49:32 -04:00
Eremey Valetov
87c5cf3b48 Windows MSVC build: more compat-layer fixes
Round 2 of c67b631 cleanup.  After the dirent + utime fixes, the
MSVC link surface still had:

- LNK2005 'fopen already defined': dropped g_fopen so we no longer
  override the SDK's fopen.  UTF-8 paths still work on Windows 10
  with the active-codepage manifest; non-Unicode codepages will see
  ANSI translation.  This is good enough for the public release; a
  full UTF-8 fopen wrapper can be added later if needed.

- LNK2019 'unresolved S_ISDIR / S_ISREG': MSVC's <sys/stat.h> defines
  _S_IFDIR / _S_IFREG but not the POSIX S_IS* macros.  Add them in
  the unistd.h shim (which main.c already pulls).

- LNK1181 'cannot open input file m.lib': test_merkle and test_rans
  linked libm unconditionally.  Math is in the default CRT on MSVC;
  link 'm' only on non-Windows.

- 'unistd.h' not found in test_blockstore.c: it actually only needs
  getpid().  Use <process.h> + #define getpid _getpid on MSVC, keep
  <unistd.h> elsewhere.
2026-05-04 16:45:20 -04:00
Eremey Valetov
345aabd423 Fix Windows utime conflict: rename to compat__utime + macro shim
Win10 SDK 26100's <sys/utime.h> provides an inline utime() wrapper
that forwards to _utime32 (ANSI-codepath, not UTF-8).  Defining our
own utime() collided with the inline (C2084: function already has a
body).

Rename the compat function to compat__utime and have the utime.h
shim translate utime -> compat__utime via #define so UC2's UTF-8
paths still go through compat__wpath at the call site.
2026-05-04 16:41:28 -04:00
Eremey Valetov
994c584918 Fix Windows MSVC build: dirent.h shim + utimbuf
Two pre-existing issues that have failed every Windows CI run since
2026-03-12 (when archive creation added <dirent.h> via 9525a81):

1. cli/src/main.c:33 includes <dirent.h>, which MSVC does not
   provide.  Add a minimal shim under cli/src/compat/include/msvc/
   exposing DIR / struct dirent / opendir / readdir / closedir.
   The implementation in compat_win32.c uses FindFirstFileW /
   FindNextFileW and round-trips filenames through UTF-8 to match
   the rest of the compat layer.

2. cli/src/compat/compat_win32.c:314 redefined struct utimbuf, which
   collides with the Win10 SDK 10.0.26100+ <sys/utime.h>.  The local
   utime.h shim now pulls <sys/utime.h> directly so utimbuf comes
   from the system, and compat_win32.c stops redefining it.  An
   opt-in _COMPAT_UTIMBUF_FALLBACK is provided for older SDKs that
   hide utimbuf behind _CRT_DECLARE_NONSTDC_NAMES.

Linux and macOS builds continue to pass; this commit only touches
the MSVC compat path.  Closes git-bug c67b631.
2026-05-04 16:38:18 -04:00
Eremey Valetov
b697baef43 Add libarchive read-format plugin skeleton (closes 3668a7b)
contrib/libarchive/ contains a design doc, an annotated skeleton of
archive_read_support_format_uc2(), and a CMake target that gates the
build on -DUC2_BUILD_LIBARCHIVE_PLUGIN=ON plus find_package(LibArchive).

The skeleton has the five required callbacks (bid, read_header,
read_data, read_data_skip, cleanup) with TODO markers at each
implementation point.  The bid function has the magic-byte check
ready; the rest call into libuc2 for parsing and decompression.

libarchive's read-format API is internal; an out-of-tree .so cannot
be loaded into an unmodified libarchive.  The integration plan in
contrib/libarchive/README.md is to upstream the file as a PR against
libarchive/libarchive.  Full implementation is tracked as
git-bug b0b06a5.
2026-05-03 12:50:21 -04:00
Eremey Valetov
844c1ab092 Add HN/Lobsters writeup; fix Bezemer/Sagunov attribution
The 2015 LGPL re-release was initiated by Vladislav Sagunov's request
to Nico de Vries; de Vries personally re-released the source.  Earlier
CREDITS.md (and the inherited license-audit.md) misattributed this to
Danny Bezemer.  Bezemer was a co-developer of UC2 during the original
1992-1996 work, per de Vries's release notes.

Add docs/blog/uc2-revival-writeup.md (1200 words) drafted from the
codex-reviewed revision: tightened scoping language ('byte-bitstream-
compatible' rather than ambiguous 'byte-compatible'), removed
overclaims about borg/restic/Kopia equivalence, dropped speculative
PQ/Filecoin/ZK details, and trimmed the demo to one compatibility
example + one OTS example.

Closes git-bug 98904d0.
2026-05-03 12:46:51 -04:00
Eremey Valetov
3dcfb3c4c4 License audit: SPDX headers + per-file provenance (closes 7cbbf97)
Add SPDX-License-Identifier to every source file in lib/ and cli/.
Files derived from Bobrowski's libunuc2 retain LGPL-3.0-only;
cli/src/main.c (derived from his GPL-licensed unuc2 tool) and all
new Phase 2-7 work by Valetov are GPL-3.0-or-later.  No silent
LGPL-to-GPL upgrade has been applied.

CREDITS.md now lists each Bobrowski-derived file specifically rather
than crediting libunuc2 as generic 'inspiration'.

docs/license-audit.md records the full per-file provenance table,
the LGPL-3.0 -> GPL-3.0 chain rationale (LGPL sec. 4 Combined Works
is the operative clause; LGPL sec. 3 single-direction upgrade is
documented but not exercised), and confirms that:
- the 2015 LGPL-3.0 release in original/UC2_source/ is preserved
  unchanged;
- the 2020-2021 LGPL/GPL releases in original/unuc2-0.6/ are preserved
  unchanged;
- lib/src/super.bin is bit-identical to upstream and to de Vries's
  1992 distribution data.
2026-05-03 12:20:19 -04:00
Eremey Valetov
5c01fec996 Add Phase 7 OpenTimestamps integration
uc2_sha256: pure-C FIPS 180-4 implementation, one-shot and incremental
API, validated against published vectors (empty, abc, 56-byte,
1M 'a', byte-by-byte, every-split-point boundary).

uc2_ots: parser, serializer, and walker for the standard .ots binary
format.  Strict canonical varint with 64-bit overflow check, depth-
bounded recursion, varbytes cap, max-digest cap.  Walker supports
the calendar-path subset (APPEND, PREPEND, SHA256); proofs that
include other crypto ops (SHA1, RIPEMD160, KECCAK256) are accepted
as structurally valid but flagged for follow-up via the standard
'ots verify'.

UC2-OTS trailer: magic-bracketed sidecar appended after the recorded
archive bytes.  Reverse-scan-safe; original UC2 Pro reader ignores
trailing bytes past its recorded length so backward compatibility is
preserved.  Layout (all integers little-endian uint32):
  front-magic + version + archive_len + proof_len + proof
  + proof_len + back-magic.

CLI: --ots-attach validates that the proof's leaf digest equals
SHA-256(archive[0..archive_len)) before appending and refuses to
overwrite an existing trailer unless -f is given.  --ots-extract
writes the proof verbatim, byte-compatible with the standard
'ots verify'.  --ots-info parses and prints the leaf, archive-match
status, and attestation list.  uc2 -t recomputes the archive
SHA-256 and walks the proof.

Tests: 17 OTS unit tests (varint round-trip, canonical/overflow
rejection, file-envelope round-trip, walker on append/sha256/
sibling/unsupported-op/truncated/trailing-garbage, attest_name,
trailer round-trip + corruption rejection in 4 scenarios).
Plus an optional ctest target ots_cross_check that round-trips
the .ots through python-opentimestamps when the package is
installed; skipped (return code 77) otherwise.
2026-05-03 12:15:30 -04:00
Eremey Valetov
dae8a503e4 Fix int-truncation in test_merkle and test_dict Debug builds
uc2_merkle_root() and uc2_dict_id() return uint64_t; the int _r
temporaries from 97e05ad's NDEBUG fix truncated the high 32 bits.
Under Release the assertion was stripped, hiding the bug; under
Debug the truncated _r never matched the second uint64_t call.
2026-05-03 12:15:09 -04:00
Eremey Valetov
97e05ad81a Fix assert side effects with NDEBUG (Release mode CI fix) 2026-03-30 17:26:18 -04:00
Eremey Valetov
157a517006 Fix test corpus line endings and source formatting 2026-03-30 17:09:58 -04:00
Eremey Valetov
162cf462b6 Fix CI failures and formatting issues
- Mark test corpus/archives as binary in .gitattributes to prevent
  line ending conversion on CI (fixes extract test size mismatch)
- Fix alignment-unsafe struct cast in uc2_dict.c serialize/deserialize
  (use memcpy-based byte access instead; fixes SEGFAULT on CI)
- Fix formatting issues in docs
2026-03-30 16:57:47 -04:00
Eremey Valetov
d65c9ba9e2 Update README and CREDITS for public release
README: rewritten to reflect current state (Phases 1-4 complete),
with feature list, compression levels table, full usage examples,
and project structure overview.

CREDITS: expanded with detailed attribution for all contributions
to the UC2 v3 revival (compression engine, dedup libraries, backward
compat, testing infrastructure).
2026-03-30 16:43:07 -04:00
Eremey Valetov
d121c2083f Add benchmark mode: uc2 -B tests all methods on input
$ uc2 -B textfile.txt allbytes.bin zeros.bin
  UC2 Benchmark: 67511 bytes input (3 files)

  Method           Compressed    Ratio   Enc (ms)
  Huffman Fast           9938    14.7%       6.3 ms
  Huffman Tight          9938    14.7%       6.0 ms
  rANS Tight             4360     6.5%       7.7 ms
  LZ4 Ultra-fast         1905     2.8%       0.6 ms

Tests all 8 Huffman/rANS levels (2-9) plus LZ4 on the input files,
reporting compressed size, ratio percentage, and encoding time.

Completes Phase 4 (Modern Compression Backends).
2026-03-30 04:48:18 -04:00
Eremey Valetov
b93f1b2a8f Add BLAKE3 cryptographic hashing for archive integrity
New library (uc2_blake3.h / uc2_blake3.c) for Phase 7:

- Pure C BLAKE3 implementation (~300 lines)
- 256-bit (32-byte) digests using BLAKE2s round function
- Bao tree hashing structure for inputs > 1024 bytes
- Incremental API (init/update/final) and one-shot helper
- Constant-time hash comparison (timing-attack resistant)

Suitable for content verification, block integrity checking,
and content-addressable storage (replacing or supplementing
the 64-bit FNV-1a hashes used in Merkle DAG and block store).

7 unit tests:
- Empty input, determinism, collision avoidance
- Incremental vs one-shot consistency
- Single-byte-at-a-time update consistency
- Avalanche effect (1-bit change → ~50% output bits flip)
- Constant-time comparison
2026-03-29 22:21:14 -04:00
Eremey Valetov
33773e6220 Add LZ4 ultra-fast compression
New library (uc2_lz4.h / uc2_lz4.c) for Phase 4:

- Single-probe hash table: O(1) match finding per position
- 4-byte minimum match, 16-bit offset (64KB window)
- Variable-length token encoding (literal/match pairs)
- Handles overlapping matches correctly (byte-by-byte copy)
- Incompressible data passes through with minimal overhead

6 unit tests:
- Text round-trip (90 bytes repeated → compresses to ~60%)
- Binary round-trip (16KB semi-random)
- All-same (4KB of 'A' → >75% savings)
- Fully random (1KB → expands slightly but round-trips)
- Small input (3 bytes) and empty input
2026-03-29 22:14:49 -04:00
Eremey Valetov
38c0898bc2 Add content-aware preprocessing filters (BCJ, BWT, delta)
New library (uc2_preprocess.h / uc2_preprocess.c) for Phase 4:

BCJ (Branch/Call/Jump) filter:
- E8/E9 x86 address normalization (relative → absolute)
- Makes calls to the same function from different locations produce
  identical byte sequences, improving LZ77 matching
- Round-trip verified; address normalization confirmed

BWT (Burrows-Wheeler Transform):
- Suffix-array-based forward transform
- LF-mapping inverse with reverse reconstruction
- Groups similar contexts for better entropy coding
- Round-trip verified for text ("banana") and binary data

Delta filter:
- Byte-wise delta encoding with configurable stride
- Stride 1 for sequential data, stride 2+ for interleaved channels
- Constant-delta sequences (arithmetic progressions) reduce to
  repeated single values

Content detection:
- Automatic content type identification (text/x86/structured/binary)
- MZ/PE and ELF header recognition for x86
- Printable ASCII ratio for text detection

11 unit tests covering all filters and detection.
2026-03-29 20:44:32 -04:00
Eremey Valetov
6d59bc27db Add dictionary metadata for zstd-inspired cross-archive sharing
New library (uc2_dict.h / uc2_dict.c) formalizes master blocks as
proper dictionaries with:

- 64-bit content hash ID (FNV-1a) for cross-archive sharing
- 32-bit integrity checksum with verification
- Portable serialization format (24-byte header + data)
- Deserialization with magic number and size validation

Combined with the block store (uc2_blockstore.h), this enables
distributed dedup: archives in different locations can reference
shared dictionaries by content hash, with integrity verification
before decompression.

6 unit tests including serialization round-trip, corruption
detection, and bad-magic rejection.

Also added plausible deniability (multi-archive with separate
passwords) to Phase 5 roadmap.
2026-03-29 19:39:56 -04:00
Eremey Valetov
e8f0ba5628 Integrate rANS into archive format as method 10 (levels 6-9)
rANS entropy coding is now a usable compression option:

  uc2 -w -L 8 archive.uc2 files...   # rANS Tight
  uc2 archive.uc2                     # decompresses (auto-detects method)

Block format for method 10:
  [block-present:1] [nsyms:16] [rans_len:16]
  [freq_table:344x12bits] [rans_data] [extra_bits]

Symbol IDs (0-343) encoded with rANS for near-optimal entropy.
Extra bits (distance/length parameters) stored separately in the
bitstream, preserving the existing variable-length encoding.

Integration:
- Compressor: flush_block_rans() dispatched when level >= 6
- Decompressor: decompressor_rans() dispatched for method 10
- CLI: levels 6-9 map to rANS Fast/Normal/Tight/Ultra
- COMPRESS records store method=10 for rANS files/cdir
- End-to-end round-trip verified (create/list/extract/verify)

Levels 2-5 (Huffman) remain the default for backward compatibility
with the original UC2 Pro.
2026-03-29 19:26:40 -04:00
Eremey Valetov
db94be6043 Add rANS entropy coder for near-optimal compression
New library (uc2_rans.h / uc2_rans.c) — table-based range Asymmetric
Numeral Systems (rANS) entropy coder:

- 32-bit state with 12-bit probability precision
- Supports up to 344 symbols (matching UC2's LZ77 alphabet)
- Frequency table normalization with minimum-frequency guarantee
- Reverse-order encoding with automatic renormalization
- Fast O(1) decoding via cumulative frequency lookup table

Performance: <5% overhead vs Shannon entropy on tested distributions.
Single-symbol streams compress to ~4 bytes (near-zero information).
Skewed distributions (90% one symbol) achieve sub-bit-per-symbol rates.

6 unit tests:
- Table construction with frequency normalization
- Round-trip: uniform, skewed, 344-symbol alphabet, single-symbol
- Comparison vs Shannon entropy (verified <5% overhead)
2026-03-29 18:33:32 -04:00
Eremey Valetov
6a71c8ec95 Give UC2 a voice: personality messages and -q quiet flag
UC2 now talks during operations, continuing the original's tradition:

  $ uc2 -w archive.uc2 files...
  UC2 compression level: Tight
  Created archive.uc2 (3 files, 0 dirs, 1 master, 4096 bytes)
  Everything went OK

  $ uc2 -t archive.uc2
  Testing archive integrity...
  Everything went OK

  $ uc2 -h
  UC2 3.0.0 (UltraCompressor II)
  "Fast, reliable and superior compression."

Messages are warm, confident, and slightly quirky — not a fun flag,
just how UC2 talks.  Suppressed by -q for scripting:

  $ uc2 -qt archive.uc2  # silent, exit code only

Compression level names: Fast (2), Normal (3), Tight (4), Ultra (5).
"Everything went OK" directly from the original (MAIN.CPP:918).

Completes Phase 2 (Original Compression Engine).
2026-03-29 18:23:50 -04:00
Eremey Valetov
7b1833a94c Add SimHash near-duplicate detection and delta compression
Completes Phase 3 (Modernized Master-Block Deduplication).

SimHash (uc2_simhash.h): 64-bit locality-sensitive fingerprint using
4-byte shingles.  Similar files produce fingerprints with small Hamming
distance.  Detects patched executables (16 bytes changed in 8KB: dist<=8),
slightly edited documents, and minor file revisions.  6 unit tests.

Delta compression (uc2_delta.h): binary diff with COPY (from source)
and INSERT (new data) instructions.  Hash-based source matching for
fast encoding.  16KB file with 96 patched bytes: >50% delta size
savings.  Full round-trip verified for identical, different, patched,
appended, and empty inputs.  6 unit tests.

All Phase 3 items now complete:
- [x] Content-fingerprint grouping (FNV-1a)
- [x] Custom master-block generation
- [x] MASMETA cdir records
- [x] SuperMaster-compressed masters
- [x] CDC with Gear rolling hash
- [x] Merkle DAG content addressing
- [x] Cross-archive block store
- [x] Near-duplicate detection (SimHash)
- [x] Delta compression
2026-03-29 18:05:59 -04:00
Eremey Valetov
5107b659bc Add cross-archive block store for content-addressable dedup
New library (uc2_blockstore.h / uc2_blockstore.c) for Phase 3:

- Content-addressable chunk storage indexed by 64-bit hash
- Two-level directory layout (hash prefix subdirectories)
- Ingest with automatic dedup (existing chunks are skipped)
- Read-back for chunk reconstruction
- Dedup statistics (blocks stored, bytes saved)

6 unit tests:
- Open/close, single file ingest
- Identical data: second ingest stores 0 new chunks
- Read-back: chunk content verified byte-for-byte
- Cross-archive dedup: shared 32KB block detected between
  two different "archives" (ingested sequentially)
- Has/not-has queries
2026-03-29 17:49:19 -04:00
Eremey Valetov
72669a01bb Add Merkle DAG for content-addressable deduplication
New library (uc2_merkle.h / uc2_merkle.c) for Phase 3:

- 64-bit FNV-1a content hashing for chunk addressing
- Merkle tree: file -> list of chunk hashes -> root hash
- Structural similarity comparison and shared chunk counting
- Root hash changes on any content change (integrity)
- Single-byte change affects only 1-2 chunks (locality)

8 unit tests including partial overlap and change resilience.
2026-03-29 17:43:39 -04:00
Eremey Valetov
4c5661eb33 Integrate CDC into archive creation for position-independent dedup
Replace the fixed first-4KB FNV-1a prefix matching with content-defined
chunking.  Files are now split into ~4KB CDC chunks (Gear rolling hash),
each chunk hashed with FNV-1a.  Files sharing any chunk hash are grouped
for master-block deduplication.

This detects shared content at ANY position in the files, not just
identical file prefixes.  The improvement matters for:
- Patched executables (same code, different version strings)
- Edited documents (same body, different headers)
- Similar data files (shared structures at varying offsets)

The fallback master assignment (for backward compat with original UC2
Pro) still applies to ungrouped files, ensuring all files use custom
master indices (>= 2).

All 7 tests pass including master dedup round-trip and CDC unit tests.
2026-03-29 17:33:35 -04:00
Eremey Valetov
92e1b85cea Add content-defined chunking (CDC) library with Gear rolling hash
New library (uc2_cdc.h / uc2_cdc.c) for Phase 3 deduplication:

- Gear rolling hash: O(1) per-byte update, uniform distribution,
  content-aware boundary detection via mask-based matching
- Configurable chunker: min/max/target chunk sizes (default avg 8KB),
  streaming API with reset support
- FNV-1a content hash for chunk dedup addressing
- 256-entry random lookup table for Gear hash distribution

8 unit tests covering:
- Hash determinism and collision avoidance
- Complete data coverage (no bytes lost)
- Min/max chunk size enforcement
- Content-defined boundary alignment across shifted data
- Cross-file dedup detection (shared 256KB block found between
  two files with different unique prefixes/suffixes)
2026-03-29 17:07:01 -04:00