Anemone
An easy-to-use Ruby web spider framework
<< main pageContents
- A DSL for Ruby Web Spiders
- How Anemone Crawls a Website
- Interacting with Pages
- Page Storage Engines
- Focusing the Crawl
A Domain-Specific Language for Ruby Web Spiders
The core concept of Anemone is that it should be simple for Ruby developers to create programs to spider the web; they shouldn't have to take care of any of the dirty-work like extracting links, queueing the page to spider, checking if a page has been visited before, normalizing URLs, or following redirects.
Anemone provides a DSL for specifying what to do with the pages of a site, and takes care of all the spidering for you.
Here are the verbs you can use, and what they do:
- after_crawl - run a block on the PageHash (a data-structure of all the crawled pages) after the crawl is finished
- focus_crawl - use a block to select which links to follow on each page (more information below)
- on_every_page - run a block on each page as they are encountered
- on_pages_like - given one or more RegEx patterns, run a block on every page with a matching URL
- skip_links_like - given one or more RegEx patterns, do not follow any link that matches
How Anemone Crawls a Website
Provide Anemone with one or more URLs within a domain to start at, and it will visit every page it can find within that domain. It finds new pages by looking for <a> HTML tags and extracting the URL contained in the HREF attribute. Anemone will only crawl pages from the same domain as the start URL, and can optionally obey the Robots Exclusion Protocol (i.e. robots.txt).
Anemone is multi-threaded, and by default will spawn 4 threads which each run an instance of the Tentacle class. The Core of Anemone runs in its own thread, and inserts newly discovered links into a shared queue. Each Tentacle pulls a URL out of the queue, fetches the page using Net::HTTP, finds all the links within the page, and sends the result back to the Core.
Interacting with Pages
Any URL that Anemone finds in an <a> tag's HREF attribute is considered a page, and represented by an instance of the Page class. A page might not be an HTML document; it could be an RSS feed, an image, etc.
Each Page object stores several pieces of information:
- url - The URL of the page
- aliases - Other URLs that redirected to this page, or the Page that this one redirects to
- headers - The full HTTP response headers
- code - The HTTP response code (e.g. 200, 301, 404)
- body - The raw HTTP response body
- doc - A Nokogiri::HTML::Document of the page body (if applicable)
- links - An Array of all the URLs found on the page that point to the same domain
For your convenience, each Page also has a data attribute, which is an OpenStruct for storing any data you wish to associate with a Page, such as title, meta-description, etc.
Here is an example of using Anemone to print a sorted list of all page titles on a site:
Anemone.crawl("http://www.example.com/") do |anemone|
titles = []
anemone.on_every_page { |page| titles.push page.doc.at('title').inner_html rescue nil }
anemone.after_crawl { puts titles.compact.sort }
end
Page Storage Engines
By default Anemone stores all the pages it crawls in an in-memory hash. This is fine for small sites, and is very fast. How many pages it can handle will depend on the amount of RAM in your machine, but after a certain point you will want to start storing the pages on disk during the crawl.
There are several options you can use for persistent disk storage of pages during an Anemone crawl:
- Redis (>= 2.0.0)
- MongoDB
- TokyoCabinet
- PStore
To use one of these storage engines, first make sure you have the corresponding gem installed (i.e. redis, mongo, tokyocabinet) as well as the library or database server you want to use. Then when you start a crawl, set the storage option like so:
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.storage = Anemone::Storage.MongoDB
end
Each storage engine has different options you can set, but by default they will connect to a local server or create a default data file.
Note: Every storage engine will clear out existing Anemone data before beginning a new crawl.
Focusing the Crawl
If you need fine-grained control over which links Anemone follows on each page, look no further than the focus_crawl method. Simply pass a block to this method which will select the links to follow for a given page. For example, to folow only the first 100 links on each page:
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.focus_crawl { |page| page.links.slice(0..100) }
end