912e6da097
WWW: ${HOMEPAGE} while touching 'em
20 lines
828 B
Plaintext
20 lines
828 B
Plaintext
The crawl utility starts a depth-first traversal of the web at the
|
|
specified URLs. It stores all JPEG images that match the configured
|
|
constraints. Crawl is fairly fast and allows for graceful termination.
|
|
After terminating crawl, it is possible to restart it at exactly the
|
|
same spot where it was terminated. Crawl keeps a persistent database
|
|
that allows multiple crawls without revisiting sites.
|
|
|
|
The main reason for writing crawl was the lack of simple open source
|
|
web crawlers. Crawl is only a few thousand lines of code and fairly
|
|
easy to debug and customize.
|
|
|
|
Features
|
|
|
|
+ Saves encountered JPEG images
|
|
+ Image selection based on regular expressions and size constraints
|
|
+ Resume previous crawl after graceful termination
|
|
+ Persistent database of visited URLs
|
|
+ Very small and efficient code
|
|
+ Supports robots.txt
|