crawl-0.1b import - provos@ ok

The crawl utility starts a depth-first traversal of the web at the
specified URLs.  It stores all JPEG images that match the configured
constraints. Crawl is fairly fast and allows for graceful termination. 
After terminating crawl, it is possible to restart it at exactly the
same spot where it was terminated. Crawl keeps a persistent database
that allows multiple crawls without revisiting sites.

The main reason for writing crawl was the lack of simple open source
web crawlers. Crawl is only a few thousand lines of code and fairly
easy to debug and customize. 

Features

+ Saves encountered JPEG images 
+ Image selection based on regular expressions and size contraints 
+ Resume previous crawl after graceful termination 
+ Persistent database of visited URLs 
+ Very small and efficient code 
+ Supports robots.txt
This commit is contained in:
obecian 2001-09-09 21:57:12 +00:00
parent d5f20fb692
commit 4b01747aa5
6 changed files with 92 additions and 0 deletions

26
net/crawl/Makefile Normal file
View File

@ -0,0 +1,26 @@
# $OpenBSD: Makefile,v 1.1.1.1 2001/09/09 21:57:12 obecian Exp $
COMMENT= "small and efficient HTTP crawler"
DISTNAME= crawl-0.1b
CATEGORIES= net
NEED_VERSION= 1.406
HOMEPAGE= http://www.monkey.org/~provos/crawl/
MAINTAINER= Mark Grimes <obecian@openbsd.org>
PERMIT_PACKAGE_CDROM= Yes
PERMIT_PACKAGE_FTP= Yes
PERMIT_DISTFILES_CDROM= Yes
PERMIT_DISTFILES_FTP= Yes
MASTER_SITES= http://www.monkey.org/~provos/
BUILD_DEPENDS= ${LOCALBASE}/lib/libevent.a::devel/libevent
CONFIGURE_STYLE= autoconf
WRKDIST= ${WRKDIR}/crawl
.include <bsd.port.mk>

3
net/crawl/files/md5 Normal file
View File

@ -0,0 +1,3 @@
MD5 (crawl-0.1b.tar.gz) = 3809c8b13fb5d629a799a1972185500a
RMD160 (crawl-0.1b.tar.gz) = 6bbf94508728632cd604317e03a25f50aa9930bb
SHA1 (crawl-0.1b.tar.gz) = 033e8dcbe4eddfa2b721c77ca1496261787c4790

View File

@ -0,0 +1,28 @@
--- Makefile.in.orig Sun Jul 22 17:24:41 2001
+++ Makefile.in Sun Jul 22 17:26:44 2001
@@ -3,12 +3,13 @@
srcdir = @srcdir@
VPATH = @srcdir@
-install_prefix =
prefix = @prefix@
exec_prefix = @exec_prefix@
bindir = @bindir@
mandir = @mandir@
+DESTDIR =
+
CC = @CC@
CFLAGS = -Wall @CFLAGS@ @USRINCLUDE@ -I$(srcdir) \
-I$(srcdir)/missing @DBINC@ @EVENTINC@
@@ -29,8 +30,8 @@ crawl: $(OBJS)
$(CC) $(CFLAGS) $(INCS) -o $@ $(OBJS) $(LIBS)
install: all
- $(INSTALL_PROG) -m 755 crawl $(install_prefix)$(bindir)
- $(INSTALL_DATA) crawl.1 $(install_prefix)$(mandir)/man1
+ $(INSTALL_PROG) -m 755 crawl $(DESTDIR)$(bindir)
+ $(INSTALL_DATA) crawl.1 $(DESTDIR)$(mandir)/man1
clean:
rm -f crawl *~ *.core *.db $(OBJS)

View File

@ -0,0 +1,11 @@
--- configure.orig Thu Jul 19 12:21:50 2001
+++ configure Thu Jul 19 12:22:01 2001
@@ -9,7 +9,7 @@
# Defaults:
ac_help=
-ac_default_prefix=/usr/local
+ac_default_prefix=${LOCALBASE}
# Any additions from configure.in:
ac_help="$ac_help
--with-libevent=DIR use libevent build directory"

21
net/crawl/pkg/DESCR Normal file
View File

@ -0,0 +1,21 @@
The crawl utility starts a depth-first traversal of the web at the
specified URLs. It stores all JPEG images that match the configured
constraints. Crawl is fairly fast and allows for graceful termination.
After terminating crawl, it is possible to restart it at exactly the
same spot where it was terminated. Crawl keeps a persistent database
that allows multiple crawls without revisiting sites.
The main reason for writing crawl was the lack of simple open source
web crawlers. Crawl is only a few thousand lines of code and fairly
easy to debug and customize.
Features
+ Saves encountered JPEG images
+ Image selection based on regular expressions and size contraints
+ Resume previous crawl after graceful termination
+ Persistent database of visited URLs
+ Very small and efficient code
+ Supports robots.txt
WWW: ${HOMEPAGE}

3
net/crawl/pkg/PLIST Normal file
View File

@ -0,0 +1,3 @@
@comment $OpenBSD: PLIST,v 1.1.1.1 2001/09/09 21:57:12 obecian Exp $
bin/crawl
man/man1/crawl.1