WeasyPrint is a smart solution helping web developers to create PDF
documents. It turns simple HTML pages into gorgeous statistical
reports, invoices, tickets..
From a technical point of view, WeasyPrint is a visual rendering engine
for HTML and CSS that can export to PDF. It aims to support web
standards for printing.
It is based on various libraries but not on a full rendering engine
like WebKit or Gecko. The CSS layout engine is written in Python,
designed for pagination, and meant to be easy to hack on.
www: https://weasyprint.org/
ok sthen@ semarie@
fontTools is a library for manipulating fonts, written in Python. The
project includes the TTX tool, that can convert TrueType and OpenType
fonts to and from an XML text format, which is also called TTX. It
supports TrueType, OpenType, AFM and to an extent Type 1 and some
Mac-specific formats.
ok sthen@ semarie@
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to
be searched or copy+pasted.
- Generates a searchable PDF/A file from a regular PDF
- Places OCR text accurately below the image to ease copy / paste
- Keeps the exact resolution of the original embedded images
- When possible, inserts OCR information as a "lossless" operation
without disrupting any other content
- Optimizes PDF images, often producing files smaller than the input file
- If requested, deskews and/or cleans the image before performing OCR
- Validates input and output files
- Distributes work across all available CPU cores
- Uses Tesseract OCR engine to recognize more than 100 languages
(use "pkg_info -Q tesseract" to locate language packs to install)
- Keeps your private data private
- Scales properly to handle files with thousands of pages
- Battle-tested on millions of PDFs
ocrmypdf # it's a scriptable command line program
-l eng+fra # it supports multiple languages
--rotate-pages # it can fix pages that are misrotated
--deskew # it can deskew crooked PDFs!
--title "My PDF" # it can change output metadata
--jobs 4 # it uses multiple cores by default
--output-type pdfa # it produces PDF/A by default
input_scanned.pdf # takes PDF input (or images)
output_searchable.pdf # produces validated PDF output
(No update to a newer poppler, because it requires C++-17, which grows
tentacles and breaks ports depending on poppler and exiv2 because the
latter still uses auto_ptr)
files were dropped (mostly entry_points.txt) or .egg-info files changed
to directories. Small patches were needed where some other build systems
were calling Python tools to install due to changes in setuptools.
Messy patching needed for games/0ad which bundles a spidermonkey tar of
a specific version and patches it using files in its own distribution.
Been through a bulk on i386, plus I tested a few things separately on
amd64 where fallout from the recent qscintilla update has broken some ports
on !LP64 which was blocking them on i386.