Crawling Simply with Scrapy

I’ve been building a project lately, working with Python and networking tools. My homemade crawler ran into a litany of issues. From getting to our page, to parsing out the links, I kept running into issues. Luckily for mine and for many other issues that we try to tackle as software developers and aspiring developers is that there’s a library for that.


I chose to forgo the use of home brew solutions that don’t work for something tried and true. Something like Scrapy. This is all over the internet, used by a ton of companies and a ton of other projects. That’s always the sign of a great library, it’s open sourced and has a great community surrounding it. Google it and you’ll see many Stackoverflow questions and answers (including mine) pertaining to Scrapy.


The goal was simple, take an input and run it through a crawler script. I chose to work with individual crawl scripts instead if full blown crawlers. For my situation it was just far easier to handle, as it it going to be integrated into a larger tool set. Below you can see my code

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spiders import BaseSpider
from scrapy import Request
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse

class InputSpider(CrawlSpider):
        name = "Input"

        def __init__(self, url=None, *args, **kwargs):
            super(InputSpider, self).__init__(*args, **kwargs)
            self.allowed_domains = [url]
            self.start_urls = ["http://" + url]

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

print "Crawling site..."
print "See output.txt for links"

I run the spider using the following command:

scrapy runspider -a url=quotes.toscrape.com crawler.py

Crawling is pretty simple, you create the Crawling class using the CrawlSpider argument. I wanted to set it up so that users can put in the url they want to crawl. To do that we had to create our own __init__ function and set the args and kwargs to look for the url argument we put in when we run the script from the terminal. From there we set the start_urls to the url argument. For those who have never used Scrapy the start_urls is where the spider starts crawling. You might also noticed that we set the allowed_domains to the same thing. We do this so that the only URLs that are crawled are the URLs from the indicated domain


We set the rules for our crawler so that we can extract our URLs and put them somewhere. Here I am using the SgmlLinkExtractor. This is something of note, as this is depreciated in the  current version of Scrapy. I use it because I kind of knew it and it works for what I need. In the future I’ll be switching this out for the LxmlLinkExtractor. The reason it’s been phased out is that Sgml uses the sgmllib.SGMLParser to do it’s work. That library is depreciated as of Python 2.6 and is incompatible with Python 3. So if you want to follow this example please look into LxmlLinkExtractor.


That said this rule actually does something cool for us. If you noticed one of the arguments is follow=True, that is used to follow all the links we crawl to the next page and crawl the links on those pages. Makes it easy to crawl through the whole site. We have the callback=’parse_item’ set, which uses our parse function towards the bottom to extract our links from the response and records them in our output file.


That’s it, it’s dead simple. I just wanted something easy that I could integrate into my tool set. We accomplished this in under thirty lines. I suggest you read through the Scrapy docs because they are useful. Thanks for reading!

We’ve Decided to go With Someone With More Experience

I’m sure anyone who’s ever looked for a job has been through this. You go through your day and then you get the email from the company you interviewed with recently. Your heart stops and, depending on how much you wanted this job, that moment before opening the email seems to last for days. As that panic starts to set in you open the email and the only line that you really read expounds “thank you for your interest, however we’ve decided to go with someone with more experience.” You take your breath and reread the email, fully grasping the full content of the message. Unfortunately however, the damage to your moral is done.


Now this is all much less dramatic than described above however that small twinge in the stomach is still present, no matter how many times you’ve been through it. Coming from a self taught background this may be something you hear often. Even coming from a traditional four year college background you’ll run into this. Often times it conjures up the scene from Batman Begins when Bruce Wayne, played by Christian Bale, gets his mansion burned down. Lying there forlorn with his spirit broken; Alfred, played by the great Michael Caine, repeats something Thomas Wayne always told Bruce. “Why do we fall? So we can learn to pick ourselves back up.” My life isn’t nearly as interesting or heroic, nor is the idea of being turned down for a job, as stated above it happens often. However the message is just as relevant, especially when it comes to trying to break into software development.


Unfortunately people’s usual ideas of software gurus conjures up images of geniuses that have been coding since they were four years old, have a similar etiquette to that of an annoyed Vulcan, and are part cyborg. Those people exist and I don’t begrudge them their hard work associated with that moniker “genius.” By the time they reach working age they often have a huge portfolio of work they can show to potential employers. For the rest of us when we pivot at some point in our lives and decide to go into technology, we have a different experience.


When you have no professional experience it’s common to be overlooked for a job. There are ways in which we can distinguish ourselves for our next interview. A quick Google search of “how to get experience in software development” will yield plenty of threads explaining how to overcome that obstacle. The author is completely in favor of the suggested approaches gleaned from performing that search. It’s something I have been doing for some time. Recently my personal projects have been ramping up and I have been helping to create networking tools during my free time. As a result my coding skills have improved and I’m starting to find myself getting more and more comfortable using new tools and new languages.


I love doing this, I love working with people of a similar mind and working with code. This is how you keep going. You focus on your projects and your product, you. You are your greatest product, and investing in it is exactly what you should be doing with your personal projects. Branching out, trying new languages, trying new approaches, building something you would love to use, or maybe writing.


On a note about applying for jobs. The best advice I’ve seen on applying for jobs came from Eli The Computer Guy. You should apply for every job you want . That’s what I’ve been doing. Every two weeks or so I go through a curated job postings list and look for jobs I might be even a little qualified for and apply.  For me, I am targeting companies in which I’ve read reviews about or use their products personally and apply for jobs that way. I also expand my search into different cities around different parts of the country I live in. Simply put, sometimes you have to go where the jobs are. The concept of drag netting and applying for every job out there is not something I would suggest anyone do. As you end up interviewing at ten places you really didn’t want to work for and fending off tenacious recruiters.


This article says nothing new. You work hard and sometimes that hard works pays off in the form of breaking into an industry you weren’t previously a part of. Sometimes it doesn’t. That experience is a rallying cry to keep going. It shouldn’t even be called a failure as often times not enough things aligned for you to be selected for that job. You keep going, you keep learning; Persistence is the enemy of failure.


