evv/uc2 - uc2 - SDF GIT Society

evv/uc2

Author	SHA1	Message	Date
Eremey Valetov	d26791bfbd	libarchive plugin: directory paths, round-trip test (M5-M6) The read handler now composes full directory paths from the cdir's directory ids rather than emitting bare leaf names: build_dir_path walks the parent chain (root dirid 0, depth-capped against cyclic cdirs), so multi-file archives with subdirectories list correctly. Master-block resolution (M4) and tagged long names (M6) already work through libuc2's extract and tag paths; this adds a libarchive round-trip test that creates archives at Huffman and rANS levels and verifies every byte back through libarchive's public API. Documents the plugin build recipe (libarchive source tree + static lib). Verified against libarchive 3.7.7; round-trip clean under valgrind.	2026-06-13 02:10:56 -04:00
Eremey Valetov	844c1ab092	Add HN/Lobsters writeup; fix Bezemer/Sagunov attribution The 2015 LGPL re-release was initiated by Vladislav Sagunov's request to Nico de Vries; de Vries personally re-released the source. Earlier CREDITS.md (and the inherited license-audit.md) misattributed this to Danny Bezemer. Bezemer was a co-developer of UC2 during the original 1992-1996 work, per de Vries's release notes. Add docs/blog/uc2-revival-writeup.md (1200 words) drafted from the codex-reviewed revision: tightened scoping language ('byte-bitstream- compatible' rather than ambiguous 'byte-compatible'), removed overclaims about borg/restic/Kopia equivalence, dropped speculative PQ/Filecoin/ZK details, and trimmed the demo to one compatibility example + one OTS example. Closes git-bug 98904d0.	2026-05-03 12:46:51 -04:00
Eremey Valetov	3dcfb3c4c4	License audit: SPDX headers + per-file provenance (closes 7cbbf97) Add SPDX-License-Identifier to every source file in lib/ and cli/. Files derived from Bobrowski's libunuc2 retain LGPL-3.0-only; cli/src/main.c (derived from his GPL-licensed unuc2 tool) and all new Phase 2-7 work by Valetov are GPL-3.0-or-later. No silent LGPL-to-GPL upgrade has been applied. CREDITS.md now lists each Bobrowski-derived file specifically rather than crediting libunuc2 as generic 'inspiration'. docs/license-audit.md records the full per-file provenance table, the LGPL-3.0 -> GPL-3.0 chain rationale (LGPL sec. 4 Combined Works is the operative clause; LGPL sec. 3 single-direction upgrade is documented but not exercised), and confirms that: - the 2015 LGPL-3.0 release in original/UC2_source/ is preserved unchanged; - the 2020-2021 LGPL/GPL releases in original/unuc2-0.6/ are preserved unchanged; - lib/src/super.bin is bit-identical to upstream and to de Vries's 1992 distribution data.	2026-05-03 12:20:19 -04:00
Eremey Valetov	162cf462b6	Fix CI failures and formatting issues - Mark test corpus/archives as binary in .gitattributes to prevent line ending conversion on CI (fixes extract test size mismatch) - Fix alignment-unsafe struct cast in uc2_dict.c serialize/deserialize (use memcpy-based byte access instead; fixes SEGFAULT on CI) - Fix formatting issues in docs	2026-03-30 16:57:47 -04:00
Eremey Valetov	d121c2083f	Add benchmark mode: uc2 -B tests all methods on input $ uc2 -B textfile.txt allbytes.bin zeros.bin UC2 Benchmark: 67511 bytes input (3 files) Method Compressed Ratio Enc (ms) Huffman Fast 9938 14.7% 6.3 ms Huffman Tight 9938 14.7% 6.0 ms rANS Tight 4360 6.5% 7.7 ms LZ4 Ultra-fast 1905 2.8% 0.6 ms Tests all 8 Huffman/rANS levels (2-9) plus LZ4 on the input files, reporting compressed size, ratio percentage, and encoding time. Completes Phase 4 (Modern Compression Backends).	2026-03-30 04:48:18 -04:00
Eremey Valetov	b93f1b2a8f	Add BLAKE3 cryptographic hashing for archive integrity New library (uc2_blake3.h / uc2_blake3.c) for Phase 7: - Pure C BLAKE3 implementation (~300 lines) - 256-bit (32-byte) digests using BLAKE2s round function - Bao tree hashing structure for inputs > 1024 bytes - Incremental API (init/update/final) and one-shot helper - Constant-time hash comparison (timing-attack resistant) Suitable for content verification, block integrity checking, and content-addressable storage (replacing or supplementing the 64-bit FNV-1a hashes used in Merkle DAG and block store). 7 unit tests: - Empty input, determinism, collision avoidance - Incremental vs one-shot consistency - Single-byte-at-a-time update consistency - Avalanche effect (1-bit change → ~50% output bits flip) - Constant-time comparison	2026-03-29 22:21:14 -04:00
Eremey Valetov	33773e6220	Add LZ4 ultra-fast compression New library (uc2_lz4.h / uc2_lz4.c) for Phase 4: - Single-probe hash table: O(1) match finding per position - 4-byte minimum match, 16-bit offset (64KB window) - Variable-length token encoding (literal/match pairs) - Handles overlapping matches correctly (byte-by-byte copy) - Incompressible data passes through with minimal overhead 6 unit tests: - Text round-trip (90 bytes repeated → compresses to ~60%) - Binary round-trip (16KB semi-random) - All-same (4KB of 'A' → >75% savings) - Fully random (1KB → expands slightly but round-trips) - Small input (3 bytes) and empty input	2026-03-29 22:14:49 -04:00
Eremey Valetov	38c0898bc2	Add content-aware preprocessing filters (BCJ, BWT, delta) New library (uc2_preprocess.h / uc2_preprocess.c) for Phase 4: BCJ (Branch/Call/Jump) filter: - E8/E9 x86 address normalization (relative → absolute) - Makes calls to the same function from different locations produce identical byte sequences, improving LZ77 matching - Round-trip verified; address normalization confirmed BWT (Burrows-Wheeler Transform): - Suffix-array-based forward transform - LF-mapping inverse with reverse reconstruction - Groups similar contexts for better entropy coding - Round-trip verified for text ("banana") and binary data Delta filter: - Byte-wise delta encoding with configurable stride - Stride 1 for sequential data, stride 2+ for interleaved channels - Constant-delta sequences (arithmetic progressions) reduce to repeated single values Content detection: - Automatic content type identification (text/x86/structured/binary) - MZ/PE and ELF header recognition for x86 - Printable ASCII ratio for text detection 11 unit tests covering all filters and detection.	2026-03-29 20:44:32 -04:00
Eremey Valetov	6d59bc27db	Add dictionary metadata for zstd-inspired cross-archive sharing New library (uc2_dict.h / uc2_dict.c) formalizes master blocks as proper dictionaries with: - 64-bit content hash ID (FNV-1a) for cross-archive sharing - 32-bit integrity checksum with verification - Portable serialization format (24-byte header + data) - Deserialization with magic number and size validation Combined with the block store (uc2_blockstore.h), this enables distributed dedup: archives in different locations can reference shared dictionaries by content hash, with integrity verification before decompression. 6 unit tests including serialization round-trip, corruption detection, and bad-magic rejection. Also added plausible deniability (multi-archive with separate passwords) to Phase 5 roadmap.	2026-03-29 19:39:56 -04:00
Eremey Valetov	e8f0ba5628	Integrate rANS into archive format as method 10 (levels 6-9) rANS entropy coding is now a usable compression option: uc2 -w -L 8 archive.uc2 files... # rANS Tight uc2 archive.uc2 # decompresses (auto-detects method) Block format for method 10: [block-present:1] [nsyms:16] [rans_len:16] [freq_table:344x12bits] [rans_data] [extra_bits] Symbol IDs (0-343) encoded with rANS for near-optimal entropy. Extra bits (distance/length parameters) stored separately in the bitstream, preserving the existing variable-length encoding. Integration: - Compressor: flush_block_rans() dispatched when level >= 6 - Decompressor: decompressor_rans() dispatched for method 10 - CLI: levels 6-9 map to rANS Fast/Normal/Tight/Ultra - COMPRESS records store method=10 for rANS files/cdir - End-to-end round-trip verified (create/list/extract/verify) Levels 2-5 (Huffman) remain the default for backward compatibility with the original UC2 Pro.	2026-03-29 19:26:40 -04:00
Eremey Valetov	db94be6043	Add rANS entropy coder for near-optimal compression New library (uc2_rans.h / uc2_rans.c) — table-based range Asymmetric Numeral Systems (rANS) entropy coder: - 32-bit state with 12-bit probability precision - Supports up to 344 symbols (matching UC2's LZ77 alphabet) - Frequency table normalization with minimum-frequency guarantee - Reverse-order encoding with automatic renormalization - Fast O(1) decoding via cumulative frequency lookup table Performance: <5% overhead vs Shannon entropy on tested distributions. Single-symbol streams compress to ~4 bytes (near-zero information). Skewed distributions (90% one symbol) achieve sub-bit-per-symbol rates. 6 unit tests: - Table construction with frequency normalization - Round-trip: uniform, skewed, 344-symbol alphabet, single-symbol - Comparison vs Shannon entropy (verified <5% overhead)	2026-03-29 18:33:32 -04:00
Eremey Valetov	6a71c8ec95	Give UC2 a voice: personality messages and -q quiet flag UC2 now talks during operations, continuing the original's tradition: $ uc2 -w archive.uc2 files... UC2 compression level: Tight Created archive.uc2 (3 files, 0 dirs, 1 master, 4096 bytes) Everything went OK $ uc2 -t archive.uc2 Testing archive integrity... Everything went OK $ uc2 -h UC2 3.0.0 (UltraCompressor II) "Fast, reliable and superior compression." Messages are warm, confident, and slightly quirky — not a fun flag, just how UC2 talks. Suppressed by -q for scripting: $ uc2 -qt archive.uc2 # silent, exit code only Compression level names: Fast (2), Normal (3), Tight (4), Ultra (5). "Everything went OK" directly from the original (MAIN.CPP:918). Completes Phase 2 (Original Compression Engine).	2026-03-29 18:23:50 -04:00
Eremey Valetov	7b1833a94c	Add SimHash near-duplicate detection and delta compression Completes Phase 3 (Modernized Master-Block Deduplication). SimHash (uc2_simhash.h): 64-bit locality-sensitive fingerprint using 4-byte shingles. Similar files produce fingerprints with small Hamming distance. Detects patched executables (16 bytes changed in 8KB: dist<=8), slightly edited documents, and minor file revisions. 6 unit tests. Delta compression (uc2_delta.h): binary diff with COPY (from source) and INSERT (new data) instructions. Hash-based source matching for fast encoding. 16KB file with 96 patched bytes: >50% delta size savings. Full round-trip verified for identical, different, patched, appended, and empty inputs. 6 unit tests. All Phase 3 items now complete: - [x] Content-fingerprint grouping (FNV-1a) - [x] Custom master-block generation - [x] MASMETA cdir records - [x] SuperMaster-compressed masters - [x] CDC with Gear rolling hash - [x] Merkle DAG content addressing - [x] Cross-archive block store - [x] Near-duplicate detection (SimHash) - [x] Delta compression	2026-03-29 18:05:59 -04:00
Eremey Valetov	5107b659bc	Add cross-archive block store for content-addressable dedup New library (uc2_blockstore.h / uc2_blockstore.c) for Phase 3: - Content-addressable chunk storage indexed by 64-bit hash - Two-level directory layout (hash prefix subdirectories) - Ingest with automatic dedup (existing chunks are skipped) - Read-back for chunk reconstruction - Dedup statistics (blocks stored, bytes saved) 6 unit tests: - Open/close, single file ingest - Identical data: second ingest stores 0 new chunks - Read-back: chunk content verified byte-for-byte - Cross-archive dedup: shared 32KB block detected between two different "archives" (ingested sequentially) - Has/not-has queries	2026-03-29 17:49:19 -04:00
Eremey Valetov	72669a01bb	Add Merkle DAG for content-addressable deduplication New library (uc2_merkle.h / uc2_merkle.c) for Phase 3: - 64-bit FNV-1a content hashing for chunk addressing - Merkle tree: file -> list of chunk hashes -> root hash - Structural similarity comparison and shared chunk counting - Root hash changes on any content change (integrity) - Single-byte change affects only 1-2 chunks (locality) 8 unit tests including partial overlap and change resilience.	2026-03-29 17:43:39 -04:00
Eremey Valetov	4c5661eb33	Integrate CDC into archive creation for position-independent dedup Replace the fixed first-4KB FNV-1a prefix matching with content-defined chunking. Files are now split into ~4KB CDC chunks (Gear rolling hash), each chunk hashed with FNV-1a. Files sharing any chunk hash are grouped for master-block deduplication. This detects shared content at ANY position in the files, not just identical file prefixes. The improvement matters for: - Patched executables (same code, different version strings) - Edited documents (same body, different headers) - Similar data files (shared structures at varying offsets) The fallback master assignment (for backward compat with original UC2 Pro) still applies to ungrouped files, ensuring all files use custom master indices (>= 2). All 7 tests pass including master dedup round-trip and CDC unit tests.	2026-03-29 17:33:35 -04:00
Eremey Valetov	92e1b85cea	Add content-defined chunking (CDC) library with Gear rolling hash New library (uc2_cdc.h / uc2_cdc.c) for Phase 3 deduplication: - Gear rolling hash: O(1) per-byte update, uniform distribution, content-aware boundary detection via mask-based matching - Configurable chunker: min/max/target chunk sizes (default avg 8KB), streaming API with reset support - FNV-1a content hash for chunk dedup addressing - 256-entry random lookup table for Gear hash distribution 8 unit tests covering: - Hash determinism and collision avoidance - Complete data coverage (no bytes lost) - Min/max chunk size enforcement - Content-defined boundary alignment across shifted data - Cross-file dedup detection (shared 256KB block found between two files with different unique prefixes/suffixes)	2026-03-29 17:07:01 -04:00
Eremey Valetov	b042b4b48b	Enable custom Huffman trees for large blocks (37% better compression) Use the default tree for the first block when ibuf < 256 entries, matching the original's bFlag logic (ULTRACMP.CPP:1105). For larger blocks, generate custom Huffman trees from actual symbol frequencies. Compression improvement on text data (textfile.txt, 1719 bytes): Before (default-only): 1688 bytes compressed After (custom trees): 1066 bytes compressed (37% smaller) All tests pass including the bidirectional DOSBox-X round-trip.	2026-03-29 16:06:39 -04:00
Eremey Valetov	6e62a7aa28	Fix multi-file backward compatibility with original UC2 Pro Always assign custom master indices (>= FIRSTMASTER=2) to all files, never SuperMaster (index 0). The original's ExtractFiles() routes SuperMaster files through a code path that hangs. The original itself never uses SuperMaster in file COMPRESS records — it always creates at least one custom master, even for archives without dedup groups. For ungrouped files, a default custom master is built from the largest file's first 64KB. All files reference this master, matching the original's archive structure. The automated DOSBox-X test now validates multi-file round-trip in both directions: 4 files UC2 v3 -> original, 5 files original -> UC2 v3. All content verified byte-for-byte.	2026-03-29 15:21:30 -04:00
Eremey Valetov	7c832ac7dd	Identify multi-file extraction root cause: SuperMaster path hangs Compared raw cdir bytes between original UC2 Pro and UC2 v3 archives (using fixed bitdump.py match decoder). Key finding: the original NEVER uses SuperMaster (masterPrefix=0) in file COMPRESS records — it always assigns custom master indices (>= 2). UC2 v3 uses SuperMaster for ungrouped files. The original's ExtractFiles() processes SuperMaster files through ToToWalk(TTefl, SUPERMASTER, ...) which hangs. Custom master files go through the masroot chain walk, which works correctly. Fix: always assign custom master indices to all files, generating at least one custom master block even for archives without dedup groups. This matches the original's behavior. Other differences found: original uses hidden=0xDE (sentinel), tag=0 (no EXTMETA), method=3, and csize=8 (much smaller due to custom master compression).	2026-03-29 14:59:24 -04:00
Eremey Valetov	eddecfcfc2	Add bidirectional cross-tool round-trip test (both directions pass) Single-file UC2 v3 archives are now fully backward compatible with the original UC2 Pro — listing and extraction verified in automated DOSBox-X test. SFX extraction timeout increased to 600s with 22-file completeness check (incomplete extraction caused false test results throughout the earlier investigation). Direction 1 (UC2 v3 -> original) test added.	2026-03-29 14:29:33 -04:00
Eremey Valetov	09ba31da1c	Identify extraction hang root cause in NUKE.CPP buffer management Located the nuke1 ASM decompressor source at NUKE.CPP (305 lines of C++ with inline x86 assembly). Analysis of the half-buffer flush logic in FlushIt (ULTRACMP.CPP) reveals the extraction hang root cause: For a 49152-byte SuperMaster, the original adjusts the output position: wTOE = 49152 - 32768 = 16384 (lower buffer half) wComp = 0 (flush when output enters lower half) FlushIt triggers immediately after the first decompressed byte (because wTOE is already in the lower half with wComp=0). It reads decompressed output from position 32768 + wSkip = 49152, but the actual output was written at position 16384. This produces garbage and triggers early termination. UC2 v3's compressor starts the LZ77 window at position 49152 (raw SuperMaster size), while the original's decompressor expects the window at position 16384 (49152 mod 32768). Fixing this requires aligning the compressor's starting position to match the decompressor's adjusted wTOE. Also corrected: the earlier "single-file extraction works" finding was a false positive from incomplete UC2DIST extraction (15/22 files). With complete UC2DIST, listing works but extraction hangs for all files.	2026-03-29 13:47:51 -04:00
Eremey Valetov	c731bd75c2	Confirm default tree required; custom trees break single-file extraction Testing revealed that custom Huffman trees from our treegen cause the original UC2 Pro to hang even for single-file archives. The original's ASM decompressor kernel (nuke1) has undocumented assumptions about tree shapes that our treegen doesn't match. Default tree is the only working option for backward compatibility. Multi-file extraction remains a separate open issue (hangs even with default tree, while listing works).	2026-03-29 12:05:54 -04:00
Eremey Valetov	c736b19bae	Fix single-file backward compatibility with original UC2 Pro Root cause: the original UC2 Pro expects csize=0 in the cdir COMPRESS record (it ignores the field entirely). UC2 v3 was writing the actual compressed size, which confused the original's archive reader. Additional changes: - Use default Huffman tree for all blocks (ensures tree encoding compat) - Write method=compression_level in cdir COMPRESS (was hardcoded to 1) - Add tests/scripts/bitdump.py for bit-level bitstream analysis Single-file UC2 v3 archives are now fully readable by the original UC2 Pro (listing and extraction verified in DOSBox-X). Multi-file archives still hang — the cdir bitstream decodes correctly in our Python analyzer but fails in the original's ASM decompressor kernel. Investigation continues; the bitdump.py tool enables targeted comparison.	2026-03-29 09:58:36 -04:00
Eremey Valetov	be7085c4d3	Rewrite Huffman tree generation to match original UC2 Pro Port the original TreeGen/RepairLengths/CodeGen algorithms faithfully from TREEGEN.CPP for bitstream compatibility with the 1992 UC2 Pro: - treegen() now accepts max_code_bits parameter (13 for main trees, 7 for tree-encoding meta-tree) - Heap uses >= for child comparison (prefer right child on ties), matching original Reheap() - BuildCodeTree uses extract-one-then-combine pattern - RepairLengths uses sorted linked lists with cascading space-fill - Single/zero symbol cases assign length 1 to two symbols - tree_enc RLE: trigger at run > 6 (not >= 6), max 20 per chunk, single RepeatCode per run - First block uses default tree (tree-changed=0) matching original behavior for small blocks Full backward compatibility with original UC2 Pro archives (Direction 2) is maintained. Forward compatibility (UC2 v3 -> original, Direction 1) remains in progress — the original still hangs, likely due to residual bitstream-level differences in the ASM decompressor kernel.	2026-03-29 06:25:21 -04:00
Eremey Valetov	ab2d37286c	Add cross-tool round-trip test vs original UC2 Pro in DOSBox-X Automated test that runs the original 1992 UC2 Pro (UC.EXE) in DOSBox-X headlessly to create archives from the test corpus, then extracts with UC2 v3 and verifies byte-for-byte file identity. Key findings during development: - uc2pro.exe is a UCEXE self-extracting archive, not the tool itself; the actual archiver is UC.EXE inside the distribution - UC.EXE must be run from its own directory (needs DOS.SEA overlay) - DOSBox-X flatpak requires work dirs under $HOME (filesystem=home) - The reverse direction (UC2 v3 → original) does not work: the original UC2 Pro hangs reading UC2 v3 archives due to compression bitstream differences (added as a roadmap item) Also fixes create_archives.sh to use the same two-session DOSBox pattern (extract SFX first, then use UC.EXE).	2026-03-28 18:55:03 -04:00
Eremey Valetov	8e70d4cab9	Add custom master-block deduplication for archive creation Content-fingerprint grouping via FNV-1a hash of file headers: files sharing identical first 4096 bytes are assigned a custom master block built from the largest file in the group. Masters are compressed with SuperMaster and written as MASMETA records in the central directory. Files below 1 KB or without a group continue using the SuperMaster. Includes CLI integration test and documentation updates (format spec, usage, roadmap).	2026-03-12 02:18:12 -04:00
Eremey Valetov	a30c8cf694	Add archive creation with SuperMaster compression CLI: uc2 -w [-L level] archive.uc2 files... Creates UC2 archives with long filename tags and the built-in 49KB SuperMaster dictionary for improved compression via LZ77 prefix matching. Library: uc2_compress_ex() accepts master data to pre-fill the sliding window and hash chains. uc2_get_supermaster() decompresses the embedded super.bin. uc2_compress() unchanged (backward compatible, NoMaster). Tests: 5 SuperMaster roundtrip tests, CLI create/extract CTest script.	2026-03-12 02:04:13 -04:00
Eremey Valetov	7691dcc4fa	Add Sphinx documentation and GitHub Pages deployment Sphinx docs with Furo theme covering CLI usage, library API reference, archive format specification, build instructions, history, and roadmap. GitHub Actions workflow deploys to Pages on push to main.	2026-03-12 00:52:47 -04:00

29 Commits