Text::CSV provides facilities for the composition and decomposition of
comma-separated values. An instance of the Text::CSV class can combine
fields into a CSV string and parse a CSV string into fields.
from eric olsen
A Pure-Python library built as a PDF toolkit. It is capable of:
* extracting document information (title, author, ...),
* splitting documents page by page,
* merging documents page by page,
* cropping pages,
* merging multiple pages into a single page,
* encrypting and decrypting PDF files.
from Daniel Dickman
entity management like textproc/sp (which it can supersede) or
textproc/lq-sp.
Provides onsgmls/osgmlnorm/ospam/spcat/ospent/osx binaries which were
previously in openjade-1.3 package, so @conflict with it.
ok ajacoutot@ jasper@
NLTK, the Natural Language Toolkit, is a suite of open source Python
modules, data sets and tutorials supporting research and development in
Natural Language Processing.
A substantial amount of documentation about how to use NLTK, including a
textbook and API documention, is available from the NLTK website:
http://www.nltk.org/
This module analyses English text in either a string or file. Totals are
then calculated for the number of characters, words, sentences, blank
and non blank (text) lines and paragraphs.
Three common readability statistics are also derived, the Fog, Flesch
and Kincaid indices.
CVE-2009-3603
CVE-2009-3604
CVE-2009-3605
CVE-2009-3606
CVE-2009-3608
CVE-2009-3609
Official patch from xpdf developers integrated into build.
OK kili@
Prawn: Fast, Nimble PDF Generation For Ruby
Prawn takes the pain out of generating beautiful printable documents,
while still remaining fast, tiny and nimble. It is also named after a
majestic sea creature, and that has to count for something.
It supports UTF-8, image embedding, flexible table drawing and has a
simplified content positioning.
provided by the C preprocessor to be used with any file type. filepp supports
the usual C preprocessor keywords and more.
WWW: http://www.cabaret.demon.co.uk/filepp/
ok kili@
The Template-GD distribution provides a number of Template Toolkit
plugin modules to interface with Lincoln Stein's GD modules. These in
turn provide an interface to Thomas Boutell's GD graphics library.
from Stephan A. Rickauer (MAINTAINER), with a few minor tweaks by me
Treetop is a Ruby-based DSL for text parsing and interpretation.
It facilitates an extension of the object-oriented paradigm called
syntax-oriented programming.
ok bernd@
Namazu is a full-text indexer/search engine intended for easy use.
Not only does it work as a small or medium scale Web search engine,
but also as a personal search system for email or other files.
It provides a CGI interface for web searches, and a command-line
search tool. Third-party frontends are available such as namazu.el
and Wanderlust on Emacs and Tknamazu on X Window System.
Filters enable namazu to index various formats of files. Some are
standalone (e.g. Mail/News); others require external dependencies.
Libtextcat is a library with functions that implement the classification
technique described in Cavnar & Trenkle, "N-Gram-Based Text
Categorization". It was primarily developed for language guessing, a
task on which it is known to perform with near-perfect accuracy.
Based on the FreeBSD port.
This is a prerequisite for pinot.
Hunspell is a spell checker and morphological analyzer library and
program designed for languages with rich morphology and complex word
compounding or character encoding.
Note that this is not to be considered as an aspell replacement just
yet. We install no hunspell dictionnaries for now but use the ones from
mozilla.
Reworked from an original port by Edd Barrett (maintainer).
Tested by sthen@ in a bulk, thanks!
ok sthen@
This module parses a query string into a data structure to be handled
by external search engines. For examples of such engines, see
File::Tabular and Search::Indexer.
The query string can contain simple terms, "exact phrases", field
names and comparison operators, '+/-' prefixes, parentheses,
and boolean connectors.
from Ian Mcwilliam (MAINTAINER)
Catfish is a handy file searching tool for linux and unix. Basically it
is a frontend for different search engines (daemons) which provides a
unified interface. The interface is intentionally lightweight and
simple, using only GTK+2. You can configure it to your needs by using
several command line options.
ok ajacoutot@
Meld is a visual diff and merge tool. You can compare two or three files
and edit them in place (diffs update dynamically). You can compare two
or three folders and launch file comparisons. You can browse and view a
working copy from popular version control systems such such as CVS,
Subversion, Bazaar-ng and Mercurial if the corresponding commands are
installed.
ok ajacoutot@ wcmaier@
disabled for now. "i'm stunned by the quality and that it doesn't
choke on a recent document[0] where xpdf had issues with" simon@
(who also helped tracking down the key bindings, thanks!).
Fitz is a project to create a new and modern graphics library.
At the core of Fitz is the display tree: a scene graph of vector
graphics, images and text making up the contents of a page.
The standard components of Fitz are:
* Base runtime (thin memory and error handling layer)
* Streams and filters (standard postscript, pdf and tiff filters)
* World model (display trees and resources)
* Drawing (draw the tree to a bitmap raster)
MuPDF is a PDF parser that reads PDF files and creates Fitz trees.
MuPDF also has an API to modify internal objects in the PDF files
and write PDF files. For instance, it is possible to use the MuPDF
library to encrypt existing PDF files, or to rearrange the pages.
pdftool is a commandline demo of this functionality; it is a portable
pdf swiss army knife for fixing broken pdf files, changing permissions,
merging and extracting pages, and examining the internal object
structure of a PDF file.
The mupdf binary (aka pdfview) is a bare bones PDF viewer.
fcbanner is a variant on the banner program that uses fontconfig
and freetype to draw its characters. Thus, it can easily draw using
various fonts - any font you can get in Gnome or Mozilla for example
- and handles non-ASCII characters if they are present in the font.
See http://rhn.redhat.com/errata/RHSA-2009-0430.html for details.
Also, fix license marker, update plists and simplify the pkgname
(dropping the pl, which seems to confuse bsd.port.mk's update
target).
ok naddy@, who had almost the same diff
odt2txt is a simple converter from OpenDocument Text to plain text. It's
a command-line tool which extracts the text out of OpenDocument Texts
produced by OpenOffice.org, StarOffice, KOffice and others.
This is David Loren Parsons's implementation of John Gruber's Markdown
text to html language. There's not much here that differentiates it from
any of the existing Markdown implementations except that it's written in
C instead of one of the vast flock of scripting languages that are
fighting it out for the Perl crown.
Markdown provides a library that gives you formatting functions suitable
for marking down entire documents or lines of text, a command-line
program that you can use to mark down documents interactively or from a
script, and a tiny suite of example programs that show how to fully
utilize the markdown library.
Sphinx is a tool that makes it easy to create intelligent and
beautiful documentation for Python projects (or other documents
consisting of multiple reStructuredText sources), by Georg Brandl.
It was originally created to translate the new Python documentation,
but has now been cleaned up in the hope that it will be useful to
many other projects.
input, testing, ok fgs@
- explicitely add build_depends on rarian where gnome-doc-utils is also a
build dependency as it does not itself run_depends on rarian anymore
This was the 2nd and hopefully last pass of rarian/scrollkeeper cleaning.
discussed with jasper@
* patch from gentoo so that we don't need to run_depends on gnugetopt
* remove obsolete + add missing configure switches
* fix print_usage() output
* fix typo in substituted file
* add @sample /var/db/rarian/
* /var/log/rarian.log* is not supposed to exist
Regexp::Assemble takes an arbitrary number of regular expressions and
assembles them into a single regular expression (or RE) that matches all
that the individual REs match.
As a result, instead of having a large list of expressions to loop over,
a target string only needs to be tested against one expression. This is
interesting when you have several thousand patterns to deal with.
Serious effort is made to produce the smallest pattern possible.
HTMLEntities is a simple library to facilitate encoding and decoding
of named (ý and so on) or numerical ({ or Ī)
entities in HTML and XHTML documents.