I’ve been building a project lately, working with Python and networking tools. My homemade crawler ran into a litany of issues. From getting to our page, to parsing out the links, I kept running into issues. Luckily for mine and for many other issues that we try to tackle as software developers and aspiring developers is that there’s a library for that.
I chose to forgo the use of home brew solutions that don’t work for something tried and true. Something like Scrapy. This is all over the internet, used by a ton of companies and a ton of other projects. That’s always the sign of a great library, it’s open sourced and has a great community surrounding it. Google it and you’ll see many Stackoverflow questions and answers (including mine) pertaining to Scrapy.
The goal was simple, take an input and run it through a crawler script. I chose to work with individual crawl scripts instead if full blown crawlers. For my situation it was just far easier to handle, as it it going to be integrated into a larger tool set. Below you can see my code
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from scrapy.spiders import BaseSpider from scrapy import Request from scrapy.http import Request from scrapy.utils.httpobj import urlparse class InputSpider(CrawlSpider): name = "Input" def __init__(self, url=None, *args, **kwargs): super(InputSpider, self).__init__(*args, **kwargs) self.allowed_domains = [url] self.start_urls = ["http://" + url] rules = [ Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item') ] def parse_item(self, response): x = HtmlXPathSelector(response) filename = "output.txt" open(filename, 'ab').write(response.url + "\n") print "Crawling site..." print "See output.txt for links"
I run the spider using the following command:
scrapy runspider -a url=quotes.toscrape.com crawler.py
Crawling is pretty simple, you create the Crawling class using the CrawlSpider argument. I wanted to set it up so that users can put in the url they want to crawl. To do that we had to create our own __init__ function and set the args and kwargs to look for the url argument we put in when we run the script from the terminal. From there we set the start_urls to the url argument. For those who have never used Scrapy the start_urls is where the spider starts crawling. You might also noticed that we set the allowed_domains to the same thing. We do this so that the only URLs that are crawled are the URLs from the indicated domain
We set the rules for our crawler so that we can extract our URLs and put them somewhere. Here I am using the SgmlLinkExtractor. This is something of note, as this is depreciated in the current version of Scrapy. I use it because I kind of knew it and it works for what I need. In the future I’ll be switching this out for the LxmlLinkExtractor. The reason it’s been phased out is that Sgml uses the sgmllib.SGMLParser to do it’s work. That library is depreciated as of Python 2.6 and is incompatible with Python 3. So if you want to follow this example please look into LxmlLinkExtractor.
That said this rule actually does something cool for us. If you noticed one of the arguments is follow=True, that is used to follow all the links we crawl to the next page and crawl the links on those pages. Makes it easy to crawl through the whole site. We have the callback=’parse_item’ set, which uses our parse function towards the bottom to extract our links from the response and records them in our output file.
That’s it, it’s dead simple. I just wanted something easy that I could integrate into my tool set. We accomplished this in under thirty lines. I suggest you read through the Scrapy docs because they are useful. Thanks for reading!