Commit Graph

22 Commits

Author SHA1 Message Date
Eremey Valetov
efd41dceb1 add uc2.1 man page and install rules
mdoc man page covering all modes and the OTS/ingest long options,
verified with groff and NetBSD mandoc. CMake installs the binary and
the man page (guarded against add_subdirectory embedding). Also
corrects the stale direction-1 comment in the DOSBox round-trip
script: multi-file archives created by v3 have extracted fine in the
original since the custom-Huffman-tree fix.
2026-06-11 15:17:50 -04:00
Eremey Valetov
84672c00b6 fix rANS extraction crash and >64KB window corruption
Extraction of level 6-9 archives crashed (first seen on NetBSD/sdf.org,
reproducible everywhere), and files larger than the 64KB sliding window
silently corrupted at every level. Four causes:

- cli: master COMPRESS records hardcoded method 1 while master data was
  compressed at opt.level, so rANS masters were fed to the Huffman
  decoder. Records now carry method 10 at levels 6-9; levels 2-5 keep
  method 1 for original UC2 Pro compatibility.

- decompress: decompressor_rans stopped at remaining == 0 without
  consuming the end-of-block pair and its 12 extra bits, leaving the
  bit cursor desynchronized; the next block-present read landed inside
  the EOB extras and parsed a phantom block. The loop now decodes all
  nsyms symbols and guards output writes instead.

- decompress: a refill read returning a single byte into an empty
  buffer let head overtake tail in bits_feed; the unsigned difference
  wrapped and head walked off the 4KB buffer (the actual segfault).
  The refill now loops until a full byte pair is available, and a
  sticky error flag stops the decoder treating negative bits_get
  returns as data.

- compress/decompress: chunk loads wrote linearly past the circular
  window edge, and the rANS decoder flushed output in one linear write
  that cannot express ring wrap. Loads are now capped at the edge and
  the decoder flushes incrementally in ring order.

Also: BCJ E8/E9 byte assembly no longer shifts promoted ints into the
sign bit, and the libarchive plugin uses timegm on NetBSD/OpenBSD/
DragonFly so DOS timestamps are not offset by the local timezone.

New cli_bigfile regression test (>128KB round-trip at L5 and L6); it
fails against the previous binary and passes now. Verified: 22/22
ctest including the DOSBox-X round-trip against original uc2pro.exe,
ASan/UBSan clean, and the full matrix on NetBSD 10 (sdf.org).
2026-06-11 13:14:01 -04:00
Eremey Valetov
446158e855 ingest v1: streaming dedup sink (--ingest / --ingest-restore)
Reads stdin, splits via CDC, deduplicates chunks against a sidecar
block store at <archive>.blocks/, writes a chunk-hash manifest at
<archive>.  The reverse operation reads the manifest and reassembles
the byte stream from the block store.

Manifest format (magic UC2INGST) is a standalone container, not yet
unified with the master-block archive layout.  Tar boundaries are not
preserved; the input is treated as an opaque byte stream.  Follow-ups
filed for both.

Builds entirely on existing CDC + blockstore + merkle infrastructure.
No new compression or hashing primitives.

Tests cover small + 200 KB multichunk round-trip, idempotent dedup
(repeat ingest of the same data reports zero new chunks and exact
bytes_saved), empty stream, bad-magic rejection.  Lint gate stays
green.

Closes fa0c7d4.
2026-05-04 18:37:18 -04:00
Eremey Valetov
87c5cf3b48 Windows MSVC build: more compat-layer fixes
Round 2 of c67b631 cleanup.  After the dirent + utime fixes, the
MSVC link surface still had:

- LNK2005 'fopen already defined': dropped g_fopen so we no longer
  override the SDK's fopen.  UTF-8 paths still work on Windows 10
  with the active-codepage manifest; non-Unicode codepages will see
  ANSI translation.  This is good enough for the public release; a
  full UTF-8 fopen wrapper can be added later if needed.

- LNK2019 'unresolved S_ISDIR / S_ISREG': MSVC's <sys/stat.h> defines
  _S_IFDIR / _S_IFREG but not the POSIX S_IS* macros.  Add them in
  the unistd.h shim (which main.c already pulls).

- LNK1181 'cannot open input file m.lib': test_merkle and test_rans
  linked libm unconditionally.  Math is in the default CRT on MSVC;
  link 'm' only on non-Windows.

- 'unistd.h' not found in test_blockstore.c: it actually only needs
  getpid().  Use <process.h> + #define getpid _getpid on MSVC, keep
  <unistd.h> elsewhere.
2026-05-04 16:45:20 -04:00
Eremey Valetov
345aabd423 Fix Windows utime conflict: rename to compat__utime + macro shim
Win10 SDK 26100's <sys/utime.h> provides an inline utime() wrapper
that forwards to _utime32 (ANSI-codepath, not UTF-8).  Defining our
own utime() collided with the inline (C2084: function already has a
body).

Rename the compat function to compat__utime and have the utime.h
shim translate utime -> compat__utime via #define so UC2's UTF-8
paths still go through compat__wpath at the call site.
2026-05-04 16:41:28 -04:00
Eremey Valetov
994c584918 Fix Windows MSVC build: dirent.h shim + utimbuf
Two pre-existing issues that have failed every Windows CI run since
2026-03-12 (when archive creation added <dirent.h> via 9525a81):

1. cli/src/main.c:33 includes <dirent.h>, which MSVC does not
   provide.  Add a minimal shim under cli/src/compat/include/msvc/
   exposing DIR / struct dirent / opendir / readdir / closedir.
   The implementation in compat_win32.c uses FindFirstFileW /
   FindNextFileW and round-trips filenames through UTF-8 to match
   the rest of the compat layer.

2. cli/src/compat/compat_win32.c:314 redefined struct utimbuf, which
   collides with the Win10 SDK 10.0.26100+ <sys/utime.h>.  The local
   utime.h shim now pulls <sys/utime.h> directly so utimbuf comes
   from the system, and compat_win32.c stops redefining it.  An
   opt-in _COMPAT_UTIMBUF_FALLBACK is provided for older SDKs that
   hide utimbuf behind _CRT_DECLARE_NONSTDC_NAMES.

Linux and macOS builds continue to pass; this commit only touches
the MSVC compat path.  Closes git-bug c67b631.
2026-05-04 16:38:18 -04:00
Eremey Valetov
3dcfb3c4c4 License audit: SPDX headers + per-file provenance (closes 7cbbf97)
Add SPDX-License-Identifier to every source file in lib/ and cli/.
Files derived from Bobrowski's libunuc2 retain LGPL-3.0-only;
cli/src/main.c (derived from his GPL-licensed unuc2 tool) and all
new Phase 2-7 work by Valetov are GPL-3.0-or-later.  No silent
LGPL-to-GPL upgrade has been applied.

CREDITS.md now lists each Bobrowski-derived file specifically rather
than crediting libunuc2 as generic 'inspiration'.

docs/license-audit.md records the full per-file provenance table,
the LGPL-3.0 -> GPL-3.0 chain rationale (LGPL sec. 4 Combined Works
is the operative clause; LGPL sec. 3 single-direction upgrade is
documented but not exercised), and confirms that:
- the 2015 LGPL-3.0 release in original/UC2_source/ is preserved
  unchanged;
- the 2020-2021 LGPL/GPL releases in original/unuc2-0.6/ are preserved
  unchanged;
- lib/src/super.bin is bit-identical to upstream and to de Vries's
  1992 distribution data.
2026-05-03 12:20:19 -04:00
Eremey Valetov
5c01fec996 Add Phase 7 OpenTimestamps integration
uc2_sha256: pure-C FIPS 180-4 implementation, one-shot and incremental
API, validated against published vectors (empty, abc, 56-byte,
1M 'a', byte-by-byte, every-split-point boundary).

uc2_ots: parser, serializer, and walker for the standard .ots binary
format.  Strict canonical varint with 64-bit overflow check, depth-
bounded recursion, varbytes cap, max-digest cap.  Walker supports
the calendar-path subset (APPEND, PREPEND, SHA256); proofs that
include other crypto ops (SHA1, RIPEMD160, KECCAK256) are accepted
as structurally valid but flagged for follow-up via the standard
'ots verify'.

UC2-OTS trailer: magic-bracketed sidecar appended after the recorded
archive bytes.  Reverse-scan-safe; original UC2 Pro reader ignores
trailing bytes past its recorded length so backward compatibility is
preserved.  Layout (all integers little-endian uint32):
  front-magic + version + archive_len + proof_len + proof
  + proof_len + back-magic.

CLI: --ots-attach validates that the proof's leaf digest equals
SHA-256(archive[0..archive_len)) before appending and refuses to
overwrite an existing trailer unless -f is given.  --ots-extract
writes the proof verbatim, byte-compatible with the standard
'ots verify'.  --ots-info parses and prints the leaf, archive-match
status, and attestation list.  uc2 -t recomputes the archive
SHA-256 and walks the proof.

Tests: 17 OTS unit tests (varint round-trip, canonical/overflow
rejection, file-envelope round-trip, walker on append/sha256/
sibling/unsupported-op/truncated/trailing-garbage, attest_name,
trailer round-trip + corruption rejection in 4 scenarios).
Plus an optional ctest target ots_cross_check that round-trips
the .ots through python-opentimestamps when the package is
installed; skipped (return code 77) otherwise.
2026-05-03 12:15:30 -04:00
Eremey Valetov
d121c2083f Add benchmark mode: uc2 -B tests all methods on input
$ uc2 -B textfile.txt allbytes.bin zeros.bin
  UC2 Benchmark: 67511 bytes input (3 files)

  Method           Compressed    Ratio   Enc (ms)
  Huffman Fast           9938    14.7%       6.3 ms
  Huffman Tight          9938    14.7%       6.0 ms
  rANS Tight             4360     6.5%       7.7 ms
  LZ4 Ultra-fast         1905     2.8%       0.6 ms

Tests all 8 Huffman/rANS levels (2-9) plus LZ4 on the input files,
reporting compressed size, ratio percentage, and encoding time.

Completes Phase 4 (Modern Compression Backends).
2026-03-30 04:48:18 -04:00
Eremey Valetov
e8f0ba5628 Integrate rANS into archive format as method 10 (levels 6-9)
rANS entropy coding is now a usable compression option:

  uc2 -w -L 8 archive.uc2 files...   # rANS Tight
  uc2 archive.uc2                     # decompresses (auto-detects method)

Block format for method 10:
  [block-present:1] [nsyms:16] [rans_len:16]
  [freq_table:344x12bits] [rans_data] [extra_bits]

Symbol IDs (0-343) encoded with rANS for near-optimal entropy.
Extra bits (distance/length parameters) stored separately in the
bitstream, preserving the existing variable-length encoding.

Integration:
- Compressor: flush_block_rans() dispatched when level >= 6
- Decompressor: decompressor_rans() dispatched for method 10
- CLI: levels 6-9 map to rANS Fast/Normal/Tight/Ultra
- COMPRESS records store method=10 for rANS files/cdir
- End-to-end round-trip verified (create/list/extract/verify)

Levels 2-5 (Huffman) remain the default for backward compatibility
with the original UC2 Pro.
2026-03-29 19:26:40 -04:00
Eremey Valetov
6a71c8ec95 Give UC2 a voice: personality messages and -q quiet flag
UC2 now talks during operations, continuing the original's tradition:

  $ uc2 -w archive.uc2 files...
  UC2 compression level: Tight
  Created archive.uc2 (3 files, 0 dirs, 1 master, 4096 bytes)
  Everything went OK

  $ uc2 -t archive.uc2
  Testing archive integrity...
  Everything went OK

  $ uc2 -h
  UC2 3.0.0 (UltraCompressor II)
  "Fast, reliable and superior compression."

Messages are warm, confident, and slightly quirky — not a fun flag,
just how UC2 talks.  Suppressed by -q for scripting:

  $ uc2 -qt archive.uc2  # silent, exit code only

Compression level names: Fast (2), Normal (3), Tight (4), Ultra (5).
"Everything went OK" directly from the original (MAIN.CPP:918).

Completes Phase 2 (Original Compression Engine).
2026-03-29 18:23:50 -04:00
Eremey Valetov
4c5661eb33 Integrate CDC into archive creation for position-independent dedup
Replace the fixed first-4KB FNV-1a prefix matching with content-defined
chunking.  Files are now split into ~4KB CDC chunks (Gear rolling hash),
each chunk hashed with FNV-1a.  Files sharing any chunk hash are grouped
for master-block deduplication.

This detects shared content at ANY position in the files, not just
identical file prefixes.  The improvement matters for:
- Patched executables (same code, different version strings)
- Edited documents (same body, different headers)
- Similar data files (shared structures at varying offsets)

The fallback master assignment (for backward compat with original UC2
Pro) still applies to ungrouped files, ensuring all files use custom
master indices (>= 2).

All 7 tests pass including master dedup round-trip and CDC unit tests.
2026-03-29 17:33:35 -04:00
Eremey Valetov
6e62a7aa28 Fix multi-file backward compatibility with original UC2 Pro
Always assign custom master indices (>= FIRSTMASTER=2) to all files,
never SuperMaster (index 0).  The original's ExtractFiles() routes
SuperMaster files through a code path that hangs.  The original itself
never uses SuperMaster in file COMPRESS records — it always creates
at least one custom master, even for archives without dedup groups.

For ungrouped files, a default custom master is built from the largest
file's first 64KB.  All files reference this master, matching the
original's archive structure.

The automated DOSBox-X test now validates multi-file round-trip in
both directions: 4 files UC2 v3 -> original, 5 files original -> UC2 v3.
All content verified byte-for-byte.
2026-03-29 15:21:30 -04:00
Eremey Valetov
8a7326d668 Add comment clarifying SuperMaster masterPrefix in cdir
Investigation of multi-file extraction hang:
- LocMacNtx(0) returns VNULL for SuperMaster, but V(VNULL) is safe
  in the original's virtual memory system (not a null dereference)
- 0xDEDEDEDE is a legacy sentinel for old archives, not the standard
  value (original uses masterPrefix=0)
- The hang is in the file data decompression phase, not the cdir
  parsing (listing of multi-file archives works correctly)

Multi-file backward compat remains under investigation.
2026-03-29 11:33:00 -04:00
Eremey Valetov
382f4ae6ce Set versionMadeBy=203 and method=level for original UC2 Pro compat
The original UC2 Pro handles archives differently based on versionMadeBy:
300 causes multi-file listing to fail, while 203 (the original's own
version) works correctly.  Also write method=compression_level in file
COMPRESS records instead of hardcoded 1.

Combined with the earlier csize=0 and default-tree fixes, single-file
UC2 v3 archives are now fully backward compatible with the original
UC2 Pro (listing + extraction verified).  Multi-file archives can be
listed but extraction still hangs — under investigation.
2026-03-29 10:58:04 -04:00
Eremey Valetov
c736b19bae Fix single-file backward compatibility with original UC2 Pro
Root cause: the original UC2 Pro expects csize=0 in the cdir COMPRESS
record (it ignores the field entirely).  UC2 v3 was writing the actual
compressed size, which confused the original's archive reader.

Additional changes:
- Use default Huffman tree for all blocks (ensures tree encoding compat)
- Write method=compression_level in cdir COMPRESS (was hardcoded to 1)
- Add tests/scripts/bitdump.py for bit-level bitstream analysis

Single-file UC2 v3 archives are now fully readable by the original UC2
Pro (listing and extraction verified in DOSBox-X).  Multi-file archives
still hang — the cdir bitstream decodes correctly in our Python analyzer
but fails in the original's ASM decompressor kernel.  Investigation
continues; the bitdump.py tool enables targeted comparison.
2026-03-29 09:58:36 -04:00
Eremey Valetov
de51cfea7c Add directory archival support for archive creation
Recursive directory scanning with parent/child ID tracking, directory
entries in the central directory (OSMETA + DIRMETA + EXTMETA long name
tags), and a CLI round-trip test verifying nested directory hierarchies.
2026-03-28 18:06:28 -04:00
Eremey Valetov
8e70d4cab9 Add custom master-block deduplication for archive creation
Content-fingerprint grouping via FNV-1a hash of file headers: files
sharing identical first 4096 bytes are assigned a custom master block
built from the largest file in the group. Masters are compressed with
SuperMaster and written as MASMETA records in the central directory.
Files below 1 KB or without a group continue using the SuperMaster.

Includes CLI integration test and documentation updates (format spec,
usage, roadmap).
2026-03-12 02:18:12 -04:00
Eremey Valetov
a30c8cf694 Add archive creation with SuperMaster compression
CLI: uc2 -w [-L level] archive.uc2 files...
Creates UC2 archives with long filename tags and the built-in 49KB
SuperMaster dictionary for improved compression via LZ77 prefix matching.

Library: uc2_compress_ex() accepts master data to pre-fill the sliding
window and hash chains. uc2_get_supermaster() decompresses the embedded
super.bin. uc2_compress() unchanged (backward compatible, NoMaster).

Tests: 5 SuperMaster roundtrip tests, CLI create/extract CTest script.
2026-03-12 02:04:13 -04:00
Eremey Valetov
40af7e877e Add Windows MSVC compilation target and CI
Restructure compat layer: #include_next headers moved to posix/ for
MinGW, new standalone headers in msvc/ for MSVC (unistd.h, utime.h,
getopt.h). Add getopt() implementation, chmod/unlink/chdir compat
functions, MSVC CRT initializer for UTF-8 console, _pgmptr fix.
2026-03-11 08:16:11 -04:00
Eremey Valetov
145c948804 Add DOS/DJGPP cross-compilation support and CI for Linux + macOS
Add a DJGPP CMake toolchain file and DOS compatibility layer (err.h,
fnmatch, getprogname/setprogname) so UC2 builds as a native DOS
executable via cross-compilation from Linux.  The toolchain works
around a baked-in /usr/include in the DJGPP GCC binary by using -I
instead of -isystem to ensure DJGPP headers take precedence.

Add GitHub Actions CI workflow that builds and smoke-tests on both
ubuntu-latest and macos-latest.
2026-03-08 08:23:09 -04:00
Eremey Valetov
9bb8153cef UC2 v3.0.0-alpha.1: cross-platform revival of UltraCompressor II
Decompression MVP based on Jan Bobrowski's portable unuc2/libunuc2.
CMake build system targeting Linux (GCC/Clang) with MSVC fallback.
Includes original UC2 source by Nico de Vries and unuc2-0.6 for reference.
2026-02-24 13:32:45 -05:00