Anemone

An easy-to-use Ruby web spider framework

<< main page

Contents

A Domain-Specific Language for Ruby Web Spiders

The core concept of Anemone is that it should be simple for Ruby developers to create programs to spider the web; they shouldn't have to take care of any of the dirty-work like extracting links, queueing the page to spider, checking if a page has been visited before, normalizing URLs, or following redirects.

Anemone provides a DSL for specifying what to do with the pages of a site, and takes care of all the spidering for you.

Here are the verbs you can use, and what they do:

How Anemone Crawls a Website

Provide Anemone with one or more URLs within a domain to start at, and it will visit every page it can find within that domain. It finds new pages by looking for <a> HTML tags and extracting the URL contained in the HREF attribute. Anemone will only crawl pages from the same domain as the start URL, and can optionally obey the Robots Exclusion Protocol (i.e. robots.txt).

Anemone is multi-threaded, and by default will spawn 4 threads which each run an instance of the Tentacle class. The Core of Anemone runs in its own thread, and inserts newly discovered links into a shared queue. Each Tentacle pulls a URL out of the queue, fetches the page using Net::HTTP, finds all the links within the page, and sends the result back to the Core.

Interacting with Pages

Any URL that Anemone finds in an <a> tag's HREF attribute is considered a page, and represented by an instance of the Page class. A page might not be an HTML document; it could be an RSS feed, an image, etc.

Each Page object stores several pieces of information:

For your convenience, each Page also has a data attribute, which is an OpenStruct for storing any data you wish to associate with a Page, such as title, meta-description, etc.

Here is an example of using Anemone to print a sorted list of all page titles on a site:

Anemone.crawl("http://www.example.com/") do |anemone|
  titles = []
  anemone.on_every_page { |page| titles.push page.doc.at('title').inner_html rescue nil }
  anemone.after_crawl { puts titles.compact.sort }
end

Page Storage Engines

By default Anemone stores all the pages it crawls in an in-memory hash. This is fine for small sites, and is very fast. How many pages it can handle will depend on the amount of RAM in your machine, but after a certain point you will want to start storing the pages on disk during the crawl.

There are several options you can use for persistent disk storage of pages during an Anemone crawl:

To use one of these storage engines, first make sure you have the corresponding gem installed (i.e. redis, mongo, tokyocabinet) as well as the library or database server you want to use. Then when you start a crawl, set the storage option like so:

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
end

Each storage engine has different options you can set, but by default they will connect to a local server or create a default data file.

Note: Every storage engine will clear out existing Anemone data before beginning a new crawl.

Focusing the Crawl

If you need fine-grained control over which links Anemone follows on each page, look no further than the focus_crawl method. Simply pass a block to this method which will select the links to follow for a given page. For example, to folow only the first 100 links on each page:

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.focus_crawl { |page| page.links.slice(0..100) }
end