An easy-to-use Ruby web spider framework
What is it?
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.
The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.
- 02/17/2011 - Version 0.6.0 released. Added support for proxies, HTTP Basic Auth, and HTTP read timeout. Fixed a bug with double-encoding links and with erroring on a read timeout.
- 09/01/2010 - GitHub Issue Tracker - The Anemone project issue tracker has moved from Lighthouse to GitHub Issues.
- 09/01/2010 - Version 0.5.0 released. Added Redis and MongoDB page storage engines, and skip_query_strings option.
Where do I get it?
$ gem install anemone
You can also browse the code on GitHub.
How do I use it?
You can use Anemone to write tasks to gather useful statistics on your websites. Just point Anemone at a URL, and it will crawl every page in that domain. You can also tell Anemone to skip pages that match certain regular expressions. Using blocks, you tell Anemone what code to run on every page, or after it's done crawling.
For example, to print the URL of every page on a site:
require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end
Anemone also comes with a command-line frontend for several web-spider tasks. Just run 'anemone' on the command-line. The source for several example programs is in the lib/anemone/cli directory of the project.
Who wrote it?
Anemone is free to use under the terms of the MIT License.