,

Crawling Simply with Scrapy

I’ve been building a project lately, working with Python and networking tools. My homemade crawler ran into a litany of issues. From getting to our page, to parsing out the links, I kept running into issues. Luckily for mine and for many other issues that we try to tackle as software developers and aspiring developers is that there’s a library for that.

 

I chose to forgo the use of home brew solutions that don’t work for something tried and true. Something like Scrapy. This is all over the internet, used by a ton of companies and a ton of other projects. That’s always the sign of a great library, it’s open sourced and has a great community surrounding it. Google it and you’ll see many Stackoverflow questions and answers (including mine) pertaining to Scrapy.

 

The goal was simple, take an input and run it through a crawler script. I chose to work with individual crawl scripts instead if full blown crawlers. For my situation it was just far easier to handle, as it it going to be integrated into a larger tool set. Below you can see my code

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spiders import BaseSpider
from scrapy import Request
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse

class InputSpider(CrawlSpider):
        name = "Input"

        def __init__(self, url=None, *args, **kwargs):
            super(InputSpider, self).__init__(*args, **kwargs)
            self.allowed_domains = [url]
            self.start_urls = ["http://" + url]

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

print "Crawling site..."
print "See output.txt for links"

I run the spider using the following command:


scrapy runspider -a url=quotes.toscrape.com crawler.py

Crawling is pretty simple, you create the Crawling class using the CrawlSpider argument. I wanted to set it up so that users can put in the url they want to crawl. To do that we had to create our own __init__ function and set the args and kwargs to look for the url argument we put in when we run the script from the terminal. From there we set the start_urls to the url argument. For those who have never used Scrapy the start_urls is where the spider starts crawling. You might also noticed that we set the allowed_domains to the same thing. We do this so that the only URLs that are crawled are the URLs from the indicated domain

 

We set the rules for our crawler so that we can extract our URLs and put them somewhere. Here I am using the SgmlLinkExtractor. This is something of note, as this is depreciated in the  current version of Scrapy. I use it because I kind of knew it and it works for what I need. In the future I’ll be switching this out for the LxmlLinkExtractor. The reason it’s been phased out is that Sgml uses the sgmllib.SGMLParser to do it’s work. That library is depreciated as of Python 2.6 and is incompatible with Python 3. So if you want to follow this example please look into LxmlLinkExtractor.

 

That said this rule actually does something cool for us. If you noticed one of the arguments is follow=True, that is used to follow all the links we crawl to the next page and crawl the links on those pages. Makes it easy to crawl through the whole site. We have the callback=’parse_item’ set, which uses our parse function towards the bottom to extract our links from the response and records them in our output file.

 

That’s it, it’s dead simple. I just wanted something easy that I could integrate into my tool set. We accomplished this in under thirty lines. I suggest you read through the Scrapy docs because they are useful. Thanks for reading!

,

Balancing Your Way Through a DDoS Attack

We talked in depth about load balancers in use for distributing traffic to a web app the last time after I screwed up my DevOps interview. For this I would like to look a little deeper in some of the issues surrounding load balancers; Specifically how we can use them to mitigate DDoS attacks. There are many ways in which to breach and cause havoc to a web app. One of the more popular ways to make web admin’s lives harder are distributed denial of service attacks. We are going to explore some of the finer points in conducting one of these attacks, then we will explore the effects of such attacks on our differently configured load balancers.
Read more

,

How Not to Screw Up Your DevOps Interview

I screwed up a tiny bit. I went into an interview unprepared. What’s worse is that I was unprepared for the part I should have known all about. The technical portion.

Read more

,

Networking in Python: Hexadecimal to Decimal Converter in Python

This is the first in a series about networking using Python and Python libraries to accomplish a myriad of networking goals. I’m currently working to build some networking tools so I want to share some of the tools I’m making.
Read more

,

My First Tango with Blender and 3D Animation

Creating cool animations using Blender

I recently had a cool IRC discussion with someone about developing games using Unity. I had lamented that I have no experience in making my own 3D assets, to which he told me how he knows how to create all of those. I was dazzled. Since I had read about how companies like Disney put together whole data centers to make their movies; I had always thought you would need some expensive rig, or a particular set of skills that made me a nightmare for certain people. It was actually suggested to me to check out a few tools that are used for making 3D models. Specifically I zoned in on The Blender Project, an open source 3D animation software.

Read more

,

Testing in Python Part II: PyTest

After a shallow dive into the world of unit testing in the Python I took to the internet comments for guidance. I know what you’re thinking, it’s an exercise in poor thinking, keyboard warrior syndrome, and people telling me to go fuck myself. However every once in awhile you come across someone who knows what they are talking about and you can get some pretty cool information from them. In this exercise someone suggested to me (on top of definitely upgrading to Python 3) to switch to using Pytest to write my test scripts.

Read more