16 KiB
UC2 Roadmap
Phase 1: Decompression MVP (DONE)
- Port Bobrowski's libunuc2 decompression engine
- CLI tool with list/extract/test/pipe modes
- CMake build system (Linux, MSVC fallback for super.bin)
- Win32 compat layer carried over
- Tagged v3.0.0-alpha.1
Phase 2: Original Compression Engine (DONE)
- Port LZ77+Huffman compressor from
ULTRACMP.CPP,TREEGEN.CPP,TREEENC.CPP - Write as the inverse of the decompressor (Bobrowski's code is the spec)
- Compression levels 2=Fast, 3=Normal, 4=Tight, 5=Ultra
- CLI create mode (
uc2 -w), compression level flag (-L) - SuperMaster dictionary support (built-in 49 KB dictionary)
- Round-trip testing: 37 unit tests + CLI integration tests
- Round-trip testing vs original
uc2pro.exein DOSBox (Direction: original creates -> UC2 v3 extracts -- verified. Reverse direction is a known limitation: the original UC2 Pro cannot read UC2 v3 archives due to compression bitstream differences.) - Backward compatibility with original UC2 Pro (listing + extraction verified for multi-file archives in both directions in automated DOSBox-X test).
- Custom Huffman tree optimization: use default tree for first small block (< 256 ibuf entries), custom trees for larger blocks. Matches the original's bFlag logic. 37% compression improvement on text data while maintaining backward compat.
- UC2 personality: status messages continuing the original's tradition ("Everything went OK", compression level names, "Fast, reliable and superior compression"). Suppressed by -q.
Phase 3: Modernized Master-Block Deduplication
UC2's signature feature from 1992, ahead of its time. Modernize into something no mainstream archiver offers.
- Content-fingerprint file grouping (FNV-1a hash of first 4096 bytes)
- Custom master-block generation from largest file in each group
- MASMETA central directory records with full metadata
- Masters compressed with SuperMaster, files compressed with custom master
- CLI integration test validating master deduplication round-trip
- Content-defined chunking (CDC) with Gear rolling hash
(
uc2_cdc.h): chunker library + integration into archive creation. Files sharing content at ANY position (not just identical prefixes) are now grouped for master-block dedup. - Merkle DAG of deduplicated blocks (
uc2_merkle.h): content-addressable chunk trees with 64-bit FNV-1a hashes, structural similarity comparison, single-byte-change resilience. 8 unit tests including partial overlap detection. - Cross-archive dedup via shared block store (
uc2_blockstore.h): content-addressable chunk storage with two-level directory layout, dedup statistics, read-back verification. 6 unit tests including cross-archive dedup scenario. - Near-duplicate detection via SimHash (
uc2_simhash.h): 64-bit locality-sensitive fingerprint with Hamming distance, detects patched executables (16 changed bytes in 8KB: dist <= 8). 6 unit tests. - Delta compression (
uc2_delta.h): binary diff with COPY/INSERT instructions, hash-based source matching. 96-byte patch in 16KB file -> >50% size savings. 6 unit tests including round-trip.
Phase 4: Modern Compression Backends
Pluggable algorithms behind new method IDs; original Method 4 kept for backward compatibility.
- rANS entropy coder (
uc2_rans.h) integrated into archive format as method 10. Levels 6-9 use rANS (vs 2-5 Huffman). 32-bit table-based rANS, <5% overhead vs Shannon entropy. End-to-end round-trip verified (create/list/extract/verify). - zstd-inspired dictionary compression (
uc2_dict.h): formal dictionary metadata with content-hash IDs, integrity checksums, serialization format, and cross-archive sharing via block store. 6 unit tests including round-trip and corruption detection. - LZ4 ultra-fast mode (
uc2_lz4.h): single-probe hash table, O(1) match finding, 4-byte minimum match, variable-length literal/match token encoding. 6 unit tests including text, binary, all-same, incompressible, and small inputs. - Content-aware preprocessing (
uc2_preprocess.h): BCJ (E8/E9 x86 address normalization), BWT (Burrows-Wheeler for text), delta filter (byte-wise with configurable stride), automatic content detection (text/x86/structured/binary). 11 unit tests. - Built-in benchmark mode (
uc2 -B files...): tests all 8 Huffman/rANS levels plus LZ4, reports compressed size, ratio, and timing.
Phase 5: Quantum-Resistant Encryption
No mainstream archiver offers post-quantum encryption.
- CRYSTALS-Kyber (NIST PQC standard) for key encapsulation, pure C (PQClean project, public domain)
- AES-256-GCM for authenticated payload encryption
- Hybrid mode: classical ECDH + Kyber for transition period
- Passphrase-based key derivation via Argon2
- Per-file selective encryption within archives
- Plausible deniability: multi-archive-in-one with separate passwords. Each password decrypts a different archive layer. Under hostile pressure, revealing one password gives access to a decoy layer while the real archive remains hidden and indistinguishable from random padding. (Inspired by VeraCrypt hidden volumes.)
Phase 6: DOS / FreeDOS / Retro-Computing
- DJGPP cross-compilation toolchain:
cmake/djgpp.cmakebuildsuc2.exeagainst the prebuilt DJGPP gcc 7.2 / 12.2 fromandrewwutw/build-djgpp. Output is a 32-bit DPMI DOS executable (MZ + COFF + go32 stub). Seecmake/README-djgpp.mdfor the one-time setup (CPATH unset is required on hosts that export it). - DOSBox-X smoke test:
tests/scripts/dos_smoke.shrunsuc2 -handuc2 -l <archive>under DOSBox-X via the flatpak; verifies the cross-compiled binary actually loads under a real DPMI host. Real vintage hardware test still pending. - Method 80 (Turbo) support
- Multi-volume archive spanning across physical media (floppies)
- Self-extracting archives per platform (DOS COM/EXE, Linux ELF, Windows PE)
- ANSI art progress display, CP850 codepage handling
- Position as the archiver for retrocomputing preservation: disk images, ROM collections, BBS archive redistribution
Phase 7: Cryptographic Integrity & Timestamping
- BLAKE3 content hashing (
uc2_blake3.h): pure C implementation, 256-bit digests, incremental and one-shot API, constant-time comparison, tree hashing structure. 7 unit tests including avalanche, incremental-vs-oneshot, and single-byte updates. - SHA-256 (
uc2_sha256.h): pure-C FIPS 180-4 implementation, one-shot and incremental API. 6 unit tests against published test vectors (empty, "abc", 56-byte, 1M'a', byte-by-byte incremental, every-split-point boundary). - OpenTimestamps integration (
uc2_ots.h): pure-C parser, serializer, and walker for the standard.otsproof format. Append-only sidecar trailer (magic-bracketed, reverse-scan-safe) stores the proof verbatim and preserves backward compatibility with the original UC2 Pro reader. Walker supports the calendar-path subset (APPEND, PREPEND, SHA256); proofs with other crypto ops are accepted as structurally valid but flagged forots verifyfollow-up. CLI:--ots-attach,--ots-extract,--ots-info;uc2 -trecomputes archive SHA-256 and verifies the leaf and walk. Strict-canonical-varint parser, 64-bit overflow check, depth-bounded recursion, varbytes cap. 17 unit tests. - OTS upgrade: fetch the upgraded proof from the calendar after the Bitcoin attestation has been minted (~1-6h), replace the pending-only trailer with the Bitcoin block-header attestation.
- Useful for legal/forensic archiving, software provenance, digital preservation
Phase 8: Decentralized & Cloud Integration
- IPFS pinning:
uc2 --ipfs-pin archive.uc2to publish,uc2 --ipfs-get <CID>to retrieve - Content-addressable dedup maps directly to IPFS CIDs; master blocks become sharable across users ("swarm dedup")
- Cloud archiving backend:
uc2 --s3 s3://bucket/pathfor streaming compress-to-cloud with dedup-aware incremental uploads - Filecoin/Sia for decentralized paid storage (optional)
Phase 9: Zero-Knowledge Proofs (Experimental)
ZK proofs extend the Merkle DAG and encryption layers with privacy-preserving verification. Most valuable for decentralized and compliance scenarios; heavyweight, so implemented as an optional module.
- Prove archive integrity without revealing contents -- ZK proof that the archive's Merkle root matches claimed file hashes, without exposing the tree structure. Enables auditing of IPFS-shared encrypted archives.
- Selective disclosure from encrypted archives -- prove a specific file (by hash) exists in an encrypted archive without decrypting anything else. Useful for collaborative encrypted team archives.
- Verifiable deduplication -- ZK proof that master-block dedup was performed correctly across archives without revealing block contents. Builds trust in distributed dedup without data leaks.
- Compliance proofs -- prove properties ("archive created before date Y", "archive does not contain file with hash H") without revealing contents. For regulatory/legal use cases.
- Implementation: Halo2 or Bulletproofs (no trusted setup) via Rust-to-C wrapper or WASM bridge; compile-time optional module. STARKs preferred over SNARKs for quantum resistance alignment with Phase 5.
ZK Feasibility Notes
ZK adds genuine value for privacy-focused decentralized archiving (Phases 7--8) but is heavyweight for a CLI tool. SNARKs require pairing-friendly curves (not quantum-resistant); STARKs are preferred as they align with the post-quantum direction and need no trusted setup. Proof generation is slow (seconds to minutes for complex circuits) so this is an opt-in feature, not on the critical path. Prototype in a fork first.
Phase 10: Ecosystem Integrations
libarchive plugin
Highest-leverage integration. Adding UC2 read/write support to libarchive
makes .uc2 a first-class format for bsdtar, cmake, pkg(8),
file-roller, Ark, and dozens of other tools across the Linux ecosystem.
- [-] libarchive read handler (decompression/listing): milestones 1-3 shipped -- bid() recognises UC2 magic; read_header() slurps the archive, walks uc2_read_cdir, yields each entry mapped onto archive_entry; read_data() drives uc2_extract through a buffering write callback and yields the result via libarchive's pull API. Memory scales with archive size in v1. Remaining: master-block dependency tracking (M4), seekable adapter (deferred), bsdtar round-trip test (M7), upstream PR (M8).
- libarchive write handler (compression, once Phase 2 is done)
Streaming dedup ingestion
Position UC2 as a deduplicating storage layer that other tools pipe into. No other CLI archiver offers this.
rsync -a /data/ | uc2 --ingest repo.uc2 # dedup on receive
tar cf - /project | uc2 --ingest backup.uc2 # dedup tar stream
cp -a /snapshot/ | uc2 --ingest backup.uc2 # incremental dedup
uc2 --ingestmode v1: stdin -> CDC -> sidecar blockstore at<archive>.blocks/-> chunk-hash manifest.uc2 --ingest-restorereverses the round-trip. Tested: small/multichunk round-trip, idempotent dedup on repeat ingest, empty stream, bad-magic rejection. Now legacy: writer defaults to v2.uc2 --ingestv2 (default): self-contained archive with the chunk pool embedded inside the archive file itself. No sidecar directory. Manifest entries carry absolute file offsets; duplicate hashes share an offset (intra-call dedup). Cross-archive dedup is not preserved -- the trade-off is the single-file UX. v1 archives still readable for restore.uc2 --ingestv3: integrate with master-block archive layout so output is a real UC2 v3 archive consumable by uc2 -x / -l- Tar-entry preservation: parse tar boundaries inside --ingest so individual files are recoverable as archive entries
- Incremental snapshots:
uc2 snapshot /path repo.uc2(borg/restic-style deduplicating backups without filesystem support)
Foreign archive format support
Read (and optionally write) other archive formats, enabling UC2 as a universal archive tool and migration path for legacy collections.
- ZIP read/write (deflate, store; the universal baseline format)
- RAR read (v4/v5; for extraction from existing collections)
- TGZ/tar.gz read/write (tar + gzip; Unix ecosystem staple)
- ISO 9660 read (CD/DVD images; retro-computing preservation)
File manager plugins
Bobrowski already shipped prototypes; update for UC2 v3.
- Midnight Commander VFS plugin (update
misc/mc.extandmisc/uuc2) - Total Commander WCX plugin (update
misc/unuc2-wcx.c)
Phase 11: Advanced Features
- Archive-as-filesystem: FUSE mount for
.uc2on Linux (read-only, decompress-on-the-fly with master-block caching) - Compression tournaments / community challenges
- Neural/learned compression preprocessor (modern platforms only, not DOS -- optional compile-time module)
- Jupyter kernel for interactive archive exploration and compression research (Python, building on foxkernel experience): - Rich HTML tables for archive listings with compression ratios - Interactive dedup graph visualization (master-block DAG: which files share blocks, space savings) - Inline benchmark charts comparing methods/levels (ratio vs speed) - Version diff visualization between archive snapshots - Huffman tree / ANS state table visualization for algorithm development
Testing Strategy
- Create reference UC2 archives using original
uc2pro.exein DOSBox - Unit tests: magic detection, Fletcher checksum, CP850->UTF-8
- Integration: extract test archives, compare SHA-256 against manifest
- Phase 2: round-trip (new compress -> old extract in DOSBox, and vice versa)
- Phase 3+: dedup correctness, cross-archive block sharing
- Phase 5: encryption round-trip, key derivation vectors
- Phase 9: ZK proof soundness and completeness
Maintenance Log
-
2026-06-11: Fixed the rANS (L6-9) extraction crash and >64KB silent corruption (git-bug d747658, closed): master COMPRESS records now carry the real method (10 at L6-9); the rANS decoder consumes the EOB pair instead of desyncing the bit cursor; bits_feed handles short reads without overrunning its buffer; compressor chunk loads and rANS output flushing respect the 64KB circular-window edge. Found debugging extraction on sdf.org (NetBSD 10) but reproducible everywhere. New regression test: cli_bigfile. Follow-up filed: bf73896 (ftell offsets >4GB truncate silently; P2).
-
2026-06-13: DOS build now has CI coverage (DJGPP v3.4 toolchain, sha-pinned; builds uc2.exe via cmake/djgpp.cmake; git-bug 9379647). Consolidated the two DJGPP toolchain files onto djgpp.cmake and removed the redundant djgpp-toolchain.cmake.
-
2026-06-13: Damaged-archive decode hardening (git-bug f049d6d): decompress_block match-length overflow guard (runtime check replacing an NDEBUG assert), decompress_cdir end-bounding, and a CLI handle/FILE leak fix on the cdir-error path. A prefix-sweep fuzzer drove the fixes; a residual rare cdir-parser OOB it surfaces is tracked for a systematic hardening + fuzzing pass (git-bug 69e8e52).
-
2026-06-13: Security task-qa + fixes. A libFuzzer harness (tests/fuzz/) found a heap overflow in the damaged-cdir parse path (fixed, 69e8e52); also fixed Zip-Slip extraction, decoder bounds (tree/LZ/delta), and allocation-overflow guards. v3.0.0-alpha.3 tagged. Residual decompression-bomb DoS tracked (b8f933c).