evv/uc2 - uc2 - SDF GIT Society

evv/uc2

Author	SHA1	Message	Date
Eremey Valetov	ad923d7ea0	fix heap overflow parsing a damaged central directory Some checks failed Build / Linux (push) Has been cancelled Details Build / Windows (MSVC) (push) Has been cancelled Details Build / macOS (push) Has been cancelled Details Build / libarchive plugin (push) Has been cancelled Details Build / DOS (DJGPP) (push) Has been cancelled Details Docs / build (push) Has been cancelled Details Docs / deploy (push) Has been cancelled Details A crafted archive could crash the reader with an out-of-bounds read in the directory-skip path (uc2_finish_cdir -> uc2_read_cdir -> uc2_get_tag). decompress_cdir allocates cdir_buf inside its decode loop but, on its error paths (decode failure or a checksum mismatch), returned before setting cdir_range.end -- leaving cdir_buf non-NULL with a stale end. A later uc2_read_cdir/uc2_finish_cdir then saw cdir_buf != NULL, skipped re-reading, and walked a range whose end pointed below its start, so range_len wrapped and range_get handed out wild pointers. Free cdir_buf on every error path so the invariant "cdir_buf != NULL iff cdir_range is valid" holds, and make range_len report an empty range (rather than a huge one) if end ever precedes ptr, as defense in depth for the whole parser. Also add a compression-ratio ceiling to the cdir decode: a tiny crafted stream can expand via long matches, so abort once the output far outgrows the compressed bytes consumed. Found with a new libFuzzer harness (tests/fuzz/, not built by default). Memory-safety is clean over sustained fuzzing after this change; 22/22 ctest on Release and ASan. A residual slow-input timeout via a separate decode path is tracked for follow-up.	2026-06-13 10:53:49 -04:00
Eremey Valetov	62a90af101	guard allocation sizes against integer overflow Some checks failed Build / Linux (push) Has been cancelled Details Build / Windows (MSVC) (push) Has been cancelled Details Build / macOS (push) Has been cancelled Details Build / libarchive plugin (push) Has been cancelled Details Build / DOS (DJGPP) (push) Has been cancelled Details Docs / build (push) Has been cancelled Details Docs / deploy (push) Has been cancelled Details Several allocation sizes were computed from input-controlled counts or lengths and could wrap before the malloc/fread, yielding an undersized buffer that is then indexed past its end (mainly on 32-bit targets such as DJGPP, where size_t is 32 bits): - ingest restore_v2 multiplied an untrusted 32-bit chunk count from the archive header by the entry size; cap the count (also bounds memory). - ingest write and uc2_dict_serialize had the same multiply/add on locally-derived sizes; cap them too. - uc2_blockstore_ingest checked off + clen > len, which can wrap; rewrite as off > len \|\| clen > len - off. - the libarchive plugin's extract_write grew its buffer with an unchecked len addition and power-of-two doubling that could wrap; guard both. - uc2_bwt_revert used the caller-supplied primary_index to index its buffers without a bound, and multiplied len by sizeof(uint32_t) without an overflow check. Also: uc2_merkle_build used the realloc result without checking it, so an OOM left tree->chunks NULL and the next write dereferenced it; keep the chunks gathered so far instead. 22/22 ctest on Release and ASan.	2026-06-13 08:43:03 -04:00
Eremey Valetov	5e0f3852c6	harden decoder against crafted archives: tree overrun, LZ distance, delta stride A malformed archive could drive several out-of-bounds accesses in the decoder, all reachable from untrusted input: - ht_dec() expanded a Huffman RepeatCode without checking the destination against the end of the local stream[] array, so a crafted tree wrote past it on the stack. Reject the overrun as UC2_Damaged. - The LZ match copy in both the rANS and the Huffman paths used a match distance straight from the bitstream. A distance larger than the bytes written so far (or one wrapped huge by a short bits_get on the distance extra-bits) made (u16)(tail - dist) reference window bytes that were never written, copying uninitialised memory into the output. Track produced history (master fill + output, saturating at the 64KB window) and reject dist beyond it. - struct delta carried val[8], but decompressor() accepts methods up to 49, giving strides up to 10; strides 9 and 10 indexed past the array (and silently mis-decoded). Size val[] to cover the accepted range. Found by a code-review pass. Valid round-trips are unchanged: 22/22 ctest on Release and ASan, plus ASan round-trips across all levels for inputs spanning the 64KB window. The assemble_name NULL-deref raised in the same review is not reachable (dos_name is a fixed 11 bytes, far under the 300-byte name buffer), so it is left as-is.	2026-06-13 08:33:37 -04:00
Eremey Valetov	247de54352	harden decoding of damaged archives Some checks failed Build / Linux (push) Has been cancelled Details Build / Windows (MSVC) (push) Has been cancelled Details Build / macOS (push) Has been cancelled Details Build / libarchive plugin (push) Has been cancelled Details Build / DOS (DJGPP) (push) Has been cancelled Details Docs / build (push) Has been cancelled Details Docs / deploy (push) Has been cancelled Details A truncated or corrupt archive could overrun memory during decode. decompress_block guarded its match-copy length with an assert that NDEBUG compiles out, so a short bits_get that underflowed the length would overrun the 64KB window in release builds. Replace the assert with a runtime check: an out-of-range length ends the block with UC2_Damaged before the copy, and the existing checksum and size validation then reports the archive as damaged. decompress_cdir bound the walkable range to the buffer allocation rather than the bytes actually decompressed, so a damaged directory that happened to match the 16-bit checksum could be parsed into uninitialised heap; bound the range to the decompressed length. The CLI also leaked the archive handle and FILE on the directory-read and integrity-test error paths; close both. A prefix-sweep fuzzer drove these fixes. It still finds a rare, heap-state-dependent out-of-bounds read in the directory-skip path that these changes do not fully close; that and a stable fuzz harness are tracked separately.	2026-06-13 07:53:53 -04:00
Eremey Valetov	b86309542d	cli: fail loudly when archive offsets would exceed 4 GiB The UC2 container stores 32-bit offsets; ftell results were cast to unsigned at four sites, so positions past 4 GiB would wrap silently and corrupt the directory. tell32() now reports the format limit and exits. Also checks the ftell result reserved for the ingest manifest instead of seeking to -1 on error. Multi-volume spanning (2b65f0a) remains the route for larger payloads.	2026-06-12 06:29:12 -04:00
Eremey Valetov	84672c00b6	fix rANS extraction crash and >64KB window corruption Extraction of level 6-9 archives crashed (first seen on NetBSD/sdf.org, reproducible everywhere), and files larger than the 64KB sliding window silently corrupted at every level. Four causes: - cli: master COMPRESS records hardcoded method 1 while master data was compressed at opt.level, so rANS masters were fed to the Huffman decoder. Records now carry method 10 at levels 6-9; levels 2-5 keep method 1 for original UC2 Pro compatibility. - decompress: decompressor_rans stopped at remaining == 0 without consuming the end-of-block pair and its 12 extra bits, leaving the bit cursor desynchronized; the next block-present read landed inside the EOB extras and parsed a phantom block. The loop now decodes all nsyms symbols and guards output writes instead. - decompress: a refill read returning a single byte into an empty buffer let head overtake tail in bits_feed; the unsigned difference wrapped and head walked off the 4KB buffer (the actual segfault). The refill now loops until a full byte pair is available, and a sticky error flag stops the decoder treating negative bits_get returns as data. - compress/decompress: chunk loads wrote linearly past the circular window edge, and the rANS decoder flushed output in one linear write that cannot express ring wrap. Loads are now capped at the edge and the decoder flushes incrementally in ring order. Also: BCJ E8/E9 byte assembly no longer shifts promoted ints into the sign bit, and the libarchive plugin uses timegm on NetBSD/OpenBSD/ DragonFly so DOS timestamps are not offset by the local timezone. New cli_bigfile regression test (>128KB round-trip at L5 and L6); it fails against the previous binary and passes now. Verified: 22/22 ctest including the DOSBox-X round-trip against original uc2pro.exe, ASan/UBSan clean, and the full matrix on NetBSD 10 (sdf.org).	2026-06-11 13:14:01 -04:00
Eremey Valetov	7825eb47b2	ingest v2: self-contained archive (chunk pool inside the file) Scope shift from the original "make output a real UC2 v3 archive" issue: that requires a new entry type or compress.c refactor (UC2 archives have one master per file, not a chain). This commit ships the closest-in-spirit upgrade -- a self-contained format that solves v1's main UX wart, the sidecar <archive>.blocks/ directory. Format v2: +0 8B magic "UC2INGST" +8 1B version (2) +9 1B cdc_bits +10 2B reserved +12 4B chunk_count +16 ... chunk_count * 16B: 8B hash, 4B length, 4B offset ... chunk pool: unique chunks back-to-back at recorded offsets The dedup map has a small implementation note: cap must be a power of two for the mask-based linear probe to terminate. Caught when test_ingest hung at 25 chunks -- initial_cap=50 is not power-of-two, so probing wrapped to a non-empty slot indefinitely. Now rounded up in dedup_map_init. Trade-off: cross-archive dedup is not preserved (each --ingest call overwrites the archive). v1 archives remain restorable through the sidecar blockstore; the writer defaults to v2. Tests: 6 cases (was 5). test_intra_call_dedup verifies that identical chunks within a single ingest dedup correctly (buffer-twice produces > 0 saved bytes). test_v2_self_contained asserts the .blocks/ directory is NOT created for v2 archives. Closes 96ef9b8. v3 (real UC2 v3 archive output) is filed at 59bec0d.	2026-05-05 03:25:45 -04:00
Eremey Valetov	446158e855	ingest v1: streaming dedup sink (--ingest / --ingest-restore) Reads stdin, splits via CDC, deduplicates chunks against a sidecar block store at <archive>.blocks/, writes a chunk-hash manifest at <archive>. The reverse operation reads the manifest and reassembles the byte stream from the block store. Manifest format (magic UC2INGST) is a standalone container, not yet unified with the master-block archive layout. Tar boundaries are not preserved; the input is treated as an opaque byte stream. Follow-ups filed for both. Builds entirely on existing CDC + blockstore + merkle infrastructure. No new compression or hashing primitives. Tests cover small + 200 KB multichunk round-trip, idempotent dedup (repeat ingest of the same data reports zero new chunks and exact bytes_saved), empty stream, bad-magic rejection. Lint gate stays green. Closes fa0c7d4.	2026-05-04 18:37:18 -04:00
Eremey Valetov	79e0505fc3	test_delta: defensive malloc(0) fix + per-test fflush Windows MSVC test_delta failed with STATUS_HEAP_CORRUPTION (0xc0000374). ASan/UBSan on Linux finds nothing; the most likely Windows-specific issue is malloc(0) in uc2_delta_apply when the target is empty (test_empty_target). Bump to malloc(1) to get a canonical free()-safe pointer. Add fflush(stdout) between tests so the next CI run shows which test (if any) still fails on Windows.	2026-05-04 16:49:32 -04:00
Eremey Valetov	3dcfb3c4c4	License audit: SPDX headers + per-file provenance (closes 7cbbf97) Add SPDX-License-Identifier to every source file in lib/ and cli/. Files derived from Bobrowski's libunuc2 retain LGPL-3.0-only; cli/src/main.c (derived from his GPL-licensed unuc2 tool) and all new Phase 2-7 work by Valetov are GPL-3.0-or-later. No silent LGPL-to-GPL upgrade has been applied. CREDITS.md now lists each Bobrowski-derived file specifically rather than crediting libunuc2 as generic 'inspiration'. docs/license-audit.md records the full per-file provenance table, the LGPL-3.0 -> GPL-3.0 chain rationale (LGPL sec. 4 Combined Works is the operative clause; LGPL sec. 3 single-direction upgrade is documented but not exercised), and confirms that: - the 2015 LGPL-3.0 release in original/UC2_source/ is preserved unchanged; - the 2020-2021 LGPL/GPL releases in original/unuc2-0.6/ are preserved unchanged; - lib/src/super.bin is bit-identical to upstream and to de Vries's 1992 distribution data.	2026-05-03 12:20:19 -04:00
Eremey Valetov	5c01fec996	Add Phase 7 OpenTimestamps integration uc2_sha256: pure-C FIPS 180-4 implementation, one-shot and incremental API, validated against published vectors (empty, abc, 56-byte, 1M 'a', byte-by-byte, every-split-point boundary). uc2_ots: parser, serializer, and walker for the standard .ots binary format. Strict canonical varint with 64-bit overflow check, depth- bounded recursion, varbytes cap, max-digest cap. Walker supports the calendar-path subset (APPEND, PREPEND, SHA256); proofs that include other crypto ops (SHA1, RIPEMD160, KECCAK256) are accepted as structurally valid but flagged for follow-up via the standard 'ots verify'. UC2-OTS trailer: magic-bracketed sidecar appended after the recorded archive bytes. Reverse-scan-safe; original UC2 Pro reader ignores trailing bytes past its recorded length so backward compatibility is preserved. Layout (all integers little-endian uint32): front-magic + version + archive_len + proof_len + proof + proof_len + back-magic. CLI: --ots-attach validates that the proof's leaf digest equals SHA-256(archive[0..archive_len)) before appending and refuses to overwrite an existing trailer unless -f is given. --ots-extract writes the proof verbatim, byte-compatible with the standard 'ots verify'. --ots-info parses and prints the leaf, archive-match status, and attestation list. uc2 -t recomputes the archive SHA-256 and walks the proof. Tests: 17 OTS unit tests (varint round-trip, canonical/overflow rejection, file-envelope round-trip, walker on append/sha256/ sibling/unsupported-op/truncated/trailing-garbage, attest_name, trailer round-trip + corruption rejection in 4 scenarios). Plus an optional ctest target ots_cross_check that round-trips the .ots through python-opentimestamps when the package is installed; skipped (return code 77) otherwise.	2026-05-03 12:15:30 -04:00
Eremey Valetov	157a517006	Fix test corpus line endings and source formatting	2026-03-30 17:09:58 -04:00
Eremey Valetov	162cf462b6	Fix CI failures and formatting issues - Mark test corpus/archives as binary in .gitattributes to prevent line ending conversion on CI (fixes extract test size mismatch) - Fix alignment-unsafe struct cast in uc2_dict.c serialize/deserialize (use memcpy-based byte access instead; fixes SEGFAULT on CI) - Fix formatting issues in docs	2026-03-30 16:57:47 -04:00
Eremey Valetov	b93f1b2a8f	Add BLAKE3 cryptographic hashing for archive integrity New library (uc2_blake3.h / uc2_blake3.c) for Phase 7: - Pure C BLAKE3 implementation (~300 lines) - 256-bit (32-byte) digests using BLAKE2s round function - Bao tree hashing structure for inputs > 1024 bytes - Incremental API (init/update/final) and one-shot helper - Constant-time hash comparison (timing-attack resistant) Suitable for content verification, block integrity checking, and content-addressable storage (replacing or supplementing the 64-bit FNV-1a hashes used in Merkle DAG and block store). 7 unit tests: - Empty input, determinism, collision avoidance - Incremental vs one-shot consistency - Single-byte-at-a-time update consistency - Avalanche effect (1-bit change → ~50% output bits flip) - Constant-time comparison	2026-03-29 22:21:14 -04:00
Eremey Valetov	33773e6220	Add LZ4 ultra-fast compression New library (uc2_lz4.h / uc2_lz4.c) for Phase 4: - Single-probe hash table: O(1) match finding per position - 4-byte minimum match, 16-bit offset (64KB window) - Variable-length token encoding (literal/match pairs) - Handles overlapping matches correctly (byte-by-byte copy) - Incompressible data passes through with minimal overhead 6 unit tests: - Text round-trip (90 bytes repeated → compresses to ~60%) - Binary round-trip (16KB semi-random) - All-same (4KB of 'A' → >75% savings) - Fully random (1KB → expands slightly but round-trips) - Small input (3 bytes) and empty input	2026-03-29 22:14:49 -04:00
Eremey Valetov	38c0898bc2	Add content-aware preprocessing filters (BCJ, BWT, delta) New library (uc2_preprocess.h / uc2_preprocess.c) for Phase 4: BCJ (Branch/Call/Jump) filter: - E8/E9 x86 address normalization (relative → absolute) - Makes calls to the same function from different locations produce identical byte sequences, improving LZ77 matching - Round-trip verified; address normalization confirmed BWT (Burrows-Wheeler Transform): - Suffix-array-based forward transform - LF-mapping inverse with reverse reconstruction - Groups similar contexts for better entropy coding - Round-trip verified for text ("banana") and binary data Delta filter: - Byte-wise delta encoding with configurable stride - Stride 1 for sequential data, stride 2+ for interleaved channels - Constant-delta sequences (arithmetic progressions) reduce to repeated single values Content detection: - Automatic content type identification (text/x86/structured/binary) - MZ/PE and ELF header recognition for x86 - Printable ASCII ratio for text detection 11 unit tests covering all filters and detection.	2026-03-29 20:44:32 -04:00
Eremey Valetov	6d59bc27db	Add dictionary metadata for zstd-inspired cross-archive sharing New library (uc2_dict.h / uc2_dict.c) formalizes master blocks as proper dictionaries with: - 64-bit content hash ID (FNV-1a) for cross-archive sharing - 32-bit integrity checksum with verification - Portable serialization format (24-byte header + data) - Deserialization with magic number and size validation Combined with the block store (uc2_blockstore.h), this enables distributed dedup: archives in different locations can reference shared dictionaries by content hash, with integrity verification before decompression. 6 unit tests including serialization round-trip, corruption detection, and bad-magic rejection. Also added plausible deniability (multi-archive with separate passwords) to Phase 5 roadmap.	2026-03-29 19:39:56 -04:00
Eremey Valetov	e8f0ba5628	Integrate rANS into archive format as method 10 (levels 6-9) rANS entropy coding is now a usable compression option: uc2 -w -L 8 archive.uc2 files... # rANS Tight uc2 archive.uc2 # decompresses (auto-detects method) Block format for method 10: [block-present:1] [nsyms:16] [rans_len:16] [freq_table:344x12bits] [rans_data] [extra_bits] Symbol IDs (0-343) encoded with rANS for near-optimal entropy. Extra bits (distance/length parameters) stored separately in the bitstream, preserving the existing variable-length encoding. Integration: - Compressor: flush_block_rans() dispatched when level >= 6 - Decompressor: decompressor_rans() dispatched for method 10 - CLI: levels 6-9 map to rANS Fast/Normal/Tight/Ultra - COMPRESS records store method=10 for rANS files/cdir - End-to-end round-trip verified (create/list/extract/verify) Levels 2-5 (Huffman) remain the default for backward compatibility with the original UC2 Pro.	2026-03-29 19:26:40 -04:00
Eremey Valetov	db94be6043	Add rANS entropy coder for near-optimal compression New library (uc2_rans.h / uc2_rans.c) — table-based range Asymmetric Numeral Systems (rANS) entropy coder: - 32-bit state with 12-bit probability precision - Supports up to 344 symbols (matching UC2's LZ77 alphabet) - Frequency table normalization with minimum-frequency guarantee - Reverse-order encoding with automatic renormalization - Fast O(1) decoding via cumulative frequency lookup table Performance: <5% overhead vs Shannon entropy on tested distributions. Single-symbol streams compress to ~4 bytes (near-zero information). Skewed distributions (90% one symbol) achieve sub-bit-per-symbol rates. 6 unit tests: - Table construction with frequency normalization - Round-trip: uniform, skewed, 344-symbol alphabet, single-symbol - Comparison vs Shannon entropy (verified <5% overhead)	2026-03-29 18:33:32 -04:00
Eremey Valetov	7b1833a94c	Add SimHash near-duplicate detection and delta compression Completes Phase 3 (Modernized Master-Block Deduplication). SimHash (uc2_simhash.h): 64-bit locality-sensitive fingerprint using 4-byte shingles. Similar files produce fingerprints with small Hamming distance. Detects patched executables (16 bytes changed in 8KB: dist<=8), slightly edited documents, and minor file revisions. 6 unit tests. Delta compression (uc2_delta.h): binary diff with COPY (from source) and INSERT (new data) instructions. Hash-based source matching for fast encoding. 16KB file with 96 patched bytes: >50% delta size savings. Full round-trip verified for identical, different, patched, appended, and empty inputs. 6 unit tests. All Phase 3 items now complete: - [x] Content-fingerprint grouping (FNV-1a) - [x] Custom master-block generation - [x] MASMETA cdir records - [x] SuperMaster-compressed masters - [x] CDC with Gear rolling hash - [x] Merkle DAG content addressing - [x] Cross-archive block store - [x] Near-duplicate detection (SimHash) - [x] Delta compression	2026-03-29 18:05:59 -04:00
Eremey Valetov	5107b659bc	Add cross-archive block store for content-addressable dedup New library (uc2_blockstore.h / uc2_blockstore.c) for Phase 3: - Content-addressable chunk storage indexed by 64-bit hash - Two-level directory layout (hash prefix subdirectories) - Ingest with automatic dedup (existing chunks are skipped) - Read-back for chunk reconstruction - Dedup statistics (blocks stored, bytes saved) 6 unit tests: - Open/close, single file ingest - Identical data: second ingest stores 0 new chunks - Read-back: chunk content verified byte-for-byte - Cross-archive dedup: shared 32KB block detected between two different "archives" (ingested sequentially) - Has/not-has queries	2026-03-29 17:49:19 -04:00
Eremey Valetov	72669a01bb	Add Merkle DAG for content-addressable deduplication New library (uc2_merkle.h / uc2_merkle.c) for Phase 3: - 64-bit FNV-1a content hashing for chunk addressing - Merkle tree: file -> list of chunk hashes -> root hash - Structural similarity comparison and shared chunk counting - Root hash changes on any content change (integrity) - Single-byte change affects only 1-2 chunks (locality) 8 unit tests including partial overlap and change resilience.	2026-03-29 17:43:39 -04:00
Eremey Valetov	92e1b85cea	Add content-defined chunking (CDC) library with Gear rolling hash New library (uc2_cdc.h / uc2_cdc.c) for Phase 3 deduplication: - Gear rolling hash: O(1) per-byte update, uniform distribution, content-aware boundary detection via mask-based matching - Configurable chunker: min/max/target chunk sizes (default avg 8KB), streaming API with reset support - FNV-1a content hash for chunk dedup addressing - 256-entry random lookup table for Gear hash distribution 8 unit tests covering: - Hash determinism and collision avoidance - Complete data coverage (no bytes lost) - Min/max chunk size enforcement - Content-defined boundary alignment across shifted data - Cross-file dedup detection (shared 256KB block found between two files with different unique prefixes/suffixes)	2026-03-29 17:07:01 -04:00
Eremey Valetov	b042b4b48b	Enable custom Huffman trees for large blocks (37% better compression) Use the default tree for the first block when ibuf < 256 entries, matching the original's bFlag logic (ULTRACMP.CPP:1105). For larger blocks, generate custom Huffman trees from actual symbol frequencies. Compression improvement on text data (textfile.txt, 1719 bytes): Before (default-only): 1688 bytes compressed After (custom trees): 1066 bytes compressed (37% smaller) All tests pass including the bidirectional DOSBox-X round-trip.	2026-03-29 16:06:39 -04:00
Eremey Valetov	75a5ea541e	Confirm custom Huffman trees still incompatible with original nuke1 Tested custom trees with the custom master fix in place — the original still hangs. The tree incompatibility is a separate issue from the SuperMaster path hang (both are real problems, both are now understood). Custom trees give ~40% better compression for text data (1066 vs 1688 bytes for textfile.txt) but are incompatible with nuke1's assumptions. Default tree retained for backward compatibility. Updated roadmap to separate the backward compat (done) from the tree optimization (remaining).	2026-03-29 15:36:48 -04:00
Eremey Valetov	c731bd75c2	Confirm default tree required; custom trees break single-file extraction Testing revealed that custom Huffman trees from our treegen cause the original UC2 Pro to hang even for single-file archives. The original's ASM decompressor kernel (nuke1) has undocumented assumptions about tree shapes that our treegen doesn't match. Default tree is the only working option for backward compatibility. Multi-file extraction remains a separate open issue (hangs even with default tree, while listing works).	2026-03-29 12:05:54 -04:00
Eremey Valetov	c736b19bae	Fix single-file backward compatibility with original UC2 Pro Root cause: the original UC2 Pro expects csize=0 in the cdir COMPRESS record (it ignores the field entirely). UC2 v3 was writing the actual compressed size, which confused the original's archive reader. Additional changes: - Use default Huffman tree for all blocks (ensures tree encoding compat) - Write method=compression_level in cdir COMPRESS (was hardcoded to 1) - Add tests/scripts/bitdump.py for bit-level bitstream analysis Single-file UC2 v3 archives are now fully readable by the original UC2 Pro (listing and extraction verified in DOSBox-X). Multi-file archives still hang — the cdir bitstream decodes correctly in our Python analyzer but fails in the original's ASM decompressor kernel. Investigation continues; the bitdump.py tool enables targeted comparison.	2026-03-29 09:58:36 -04:00
Eremey Valetov	be7085c4d3	Rewrite Huffman tree generation to match original UC2 Pro Port the original TreeGen/RepairLengths/CodeGen algorithms faithfully from TREEGEN.CPP for bitstream compatibility with the 1992 UC2 Pro: - treegen() now accepts max_code_bits parameter (13 for main trees, 7 for tree-encoding meta-tree) - Heap uses >= for child comparison (prefer right child on ties), matching original Reheap() - BuildCodeTree uses extract-one-then-combine pattern - RepairLengths uses sorted linked lists with cascading space-fill - Single/zero symbol cases assign length 1 to two symbols - tree_enc RLE: trigger at run > 6 (not >= 6), max 20 per chunk, single RepeatCode per run - First block uses default tree (tree-changed=0) matching original behavior for small blocks Full backward compatibility with original UC2 Pro archives (Direction 2) is maintained. Forward compatibility (UC2 v3 -> original, Direction 1) remains in progress — the original still hangs, likely due to residual bitstream-level differences in the ASM decompressor kernel.	2026-03-29 06:25:21 -04:00
Eremey Valetov	a30c8cf694	Add archive creation with SuperMaster compression CLI: uc2 -w [-L level] archive.uc2 files... Creates UC2 archives with long filename tags and the built-in 49KB SuperMaster dictionary for improved compression via LZ77 prefix matching. Library: uc2_compress_ex() accepts master data to pre-fill the sliding window and hash chains. uc2_get_supermaster() decompresses the embedded super.bin. uc2_compress() unchanged (backward compatible, NoMaster). Tests: 5 SuperMaster roundtrip tests, CLI create/extract CTest script.	2026-03-12 02:04:13 -04:00
Eremey Valetov	9525a81e11	Add original UC2 compression engine with LZ77+Huffman coding Implement a compressor that produces bitstreams compatible with the existing Bobrowski decompressor. The engine uses LZ77 sliding-window match finding with hash chains, Huffman entropy coding, and delta-coded tree serialization matching the original UC2 format exactly. New files: - lib/src/compress.c: LZ77+Huffman compressor (~950 lines) - lib/src/uc2_internal.h: shared constants, types, checksums - lib/src/uc2_tables.c: vval/ivval delta tables, default Huffman tree - tests/src/test_roundtrip.c: compress→archive→decompress→verify tests Key details: - 4 compression levels (Fast/Normal/Tight/Ultra) with tunable search - Lazy evaluation for better match selection at higher levels - Delta-coded Huffman tree serialization with RLE - Fletcher/XOR checksum computation - Round-trip test covers 8 patterns × 4 levels (32 test cases) Fixed 28 errors in the hand-computed ivval inverse delta table (rows 9-13) that caused the decompressor to reconstruct wrong Huffman trees from compressor output.	2026-03-12 00:47:19 -04:00
Eremey Valetov	9bb8153cef	UC2 v3.0.0-alpha.1: cross-platform revival of UltraCompressor II Decompression MVP based on Jan Bobrowski's portable unuc2/libunuc2. CMake build system targeting Linux (GCC/Clang) with MSVC fallback. Includes original UC2 source by Nico de Vries and unuc2-0.6 for reference.	2026-02-24 13:32:45 -05:00

31 Commits