Web crawling

Description

Since text data from the World Wide Web are the source for our corpora, Web pages need to be downloaded first.

For crawling, we use Heritrix, the web crawler developed for the Internet Archive.

Earlier, we used generic web crawlers such as HTTrack. All that is needed as a starting point is a list of domains which are to be crawled.

Input

A list of domains in a plain text file.

Output

Heritrix: Warc-Files containing HTML-Files and metadata.

HTTrack: HTML-files of the specified domains in local folders.

Parameters

More information about Heritrix you can find here.

To learn more on how to run HTTrack, please have a look at the official Website of the HTTrack-project.

Download

http://www.httrack.com/

http://crawler.archive.org/downloads.html