,

Crawling Simply with Scrapy

I’ve been building a project lately, working with Python and networking tools. My homemade crawler ran into a litany of issues. From getting to our page, to parsing out the links, I kept running into issues. Luckily for mine and for many other issues that we try to tackle as software developers and aspiring developers is that there’s a library for that.

 

I chose to forgo the use of home brew solutions that don’t work for something tried and true. Something like Scrapy. This is all over the internet, used by a ton of companies and a ton of other projects. That’s always the sign of a great library, it’s open sourced and has a great community surrounding it. Google it and you’ll see many Stackoverflow questions and answers (including mine) pertaining to Scrapy.

 

The goal was simple, take an input and run it through a crawler script. I chose to work with individual crawl scripts instead if full blown crawlers. For my situation it was just far easier to handle, as it it going to be integrated into a larger tool set. Below you can see my code

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spiders import BaseSpider
from scrapy import Request
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse

class InputSpider(CrawlSpider):
        name = "Input"

        def __init__(self, url=None, *args, **kwargs):
            super(InputSpider, self).__init__(*args, **kwargs)
            self.allowed_domains = [url]
            self.start_urls = ["http://" + url]

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

print "Crawling site..."
print "See output.txt for links"

I run the spider using the following command:


scrapy runspider -a url=quotes.toscrape.com crawler.py

Crawling is pretty simple, you create the Crawling class using the CrawlSpider argument. I wanted to set it up so that users can put in the url they want to crawl. To do that we had to create our own __init__ function and set the args and kwargs to look for the url argument we put in when we run the script from the terminal. From there we set the start_urls to the url argument. For those who have never used Scrapy the start_urls is where the spider starts crawling. You might also noticed that we set the allowed_domains to the same thing. We do this so that the only URLs that are crawled are the URLs from the indicated domain

 

We set the rules for our crawler so that we can extract our URLs and put them somewhere. Here I am using the SgmlLinkExtractor. This is something of note, as this is depreciated in the  current version of Scrapy. I use it because I kind of knew it and it works for what I need. In the future I’ll be switching this out for the LxmlLinkExtractor. The reason it’s been phased out is that Sgml uses the sgmllib.SGMLParser to do it’s work. That library is depreciated as of Python 2.6 and is incompatible with Python 3. So if you want to follow this example please look into LxmlLinkExtractor.

 

That said this rule actually does something cool for us. If you noticed one of the arguments is follow=True, that is used to follow all the links we crawl to the next page and crawl the links on those pages. Makes it easy to crawl through the whole site. We have the callback=’parse_item’ set, which uses our parse function towards the bottom to extract our links from the response and records them in our output file.

 

That’s it, it’s dead simple. I just wanted something easy that I could integrate into my tool set. We accomplished this in under thirty lines. I suggest you read through the Scrapy docs because they are useful. Thanks for reading!

We’ve Decided to go With Someone With More Experience

I’m sure anyone who’s ever looked for a job has been through this. You go through your day and then you get the email from the company you interviewed with recently. Your heart stops and, depending on how much you wanted this job, that moment before opening the email seems to last for days. As that panic starts to set in you open the email and the only line that you really read expounds “thank you for your interest, however we’ve decided to go with someone with more experience.” You take your breath and reread the email, fully grasping the full content of the message. Unfortunately however, the damage to your moral is done.

 

Now this is all much less dramatic than described above however that small twinge in the stomach is still present, no matter how many times you’ve been through it. Coming from a self taught background this may be something you hear often. Even coming from a traditional four year college background you’ll run into this. Often times it conjures up the scene from Batman Begins when Bruce Wayne, played by Christian Bale, gets his mansion burned down. Lying there forlorn with his spirit broken; Alfred, played by the great Michael Caine, repeats something Thomas Wayne always told Bruce. “Why do we fall? So we can learn to pick ourselves back up.” My life isn’t nearly as interesting or heroic, nor is the idea of being turned down for a job, as stated above it happens often. However the message is just as relevant, especially when it comes to trying to break into software development.

 

Unfortunately people’s usual ideas of software gurus conjures up images of geniuses that have been coding since they were four years old, have a similar etiquette to that of an annoyed Vulcan, and are part cyborg. Those people exist and I don’t begrudge them their hard work associated with that moniker “genius.” By the time they reach working age they often have a huge portfolio of work they can show to potential employers. For the rest of us when we pivot at some point in our lives and decide to go into technology, we have a different experience.

 

When you have no professional experience it’s common to be overlooked for a job. There are ways in which we can distinguish ourselves for our next interview. A quick Google search of “how to get experience in software development” will yield plenty of threads explaining how to overcome that obstacle. The author is completely in favor of the suggested approaches gleaned from performing that search. It’s something I have been doing for some time. Recently my personal projects have been ramping up and I have been helping to create networking tools during my free time. As a result my coding skills have improved and I’m starting to find myself getting more and more comfortable using new tools and new languages.

 

I love doing this, I love working with people of a similar mind and working with code. This is how you keep going. You focus on your projects and your product, you. You are your greatest product, and investing in it is exactly what you should be doing with your personal projects. Branching out, trying new languages, trying new approaches, building something you would love to use, or maybe writing.

 

On a note about applying for jobs. The best advice I’ve seen on applying for jobs came from Eli The Computer Guy. You should apply for every job you want . That’s what I’ve been doing. Every two weeks or so I go through a curated job postings list and look for jobs I might be even a little qualified for and apply.  For me, I am targeting companies in which I’ve read reviews about or use their products personally and apply for jobs that way. I also expand my search into different cities around different parts of the country I live in. Simply put, sometimes you have to go where the jobs are. The concept of drag netting and applying for every job out there is not something I would suggest anyone do. As you end up interviewing at ten places you really didn’t want to work for and fending off tenacious recruiters.

 

This article says nothing new. You work hard and sometimes that hard works pays off in the form of breaking into an industry you weren’t previously a part of. Sometimes it doesn’t. That experience is a rallying cry to keep going. It shouldn’t even be called a failure as often times not enough things aligned for you to be selected for that job. You keep going, you keep learning; Persistence is the enemy of failure.

,

Balancing Your Way Through a DDoS Attack

We talked in depth about load balancers in use for distributing traffic to a web app the last time after I screwed up my DevOps interview. For this I would like to look a little deeper in some of the issues surrounding load balancers; Specifically how we can use them to mitigate DDoS attacks. There are many ways in which to breach and cause havoc to a web app. One of the more popular ways to make web admin’s lives harder are distributed denial of service attacks. We are going to explore some of the finer points in conducting one of these attacks, then we will explore the effects of such attacks on our differently configured load balancers.
Read more

,

How Not to Screw Up Your DevOps Interview

I screwed up a tiny bit. I went into an interview unprepared. What’s worse is that I was unprepared for the part I should have known all about. The technical portion.

Read more

,

Networking in Python: Hexadecimal to Decimal Converter in Python

This is the first in a series about networking using Python and Python libraries to accomplish a myriad of networking goals. I’m currently working to build some networking tools so I want to share some of the tools I’m making.
Read more

,

My First Tango with Blender and 3D Animation

Creating cool animations using Blender

I recently had a cool IRC discussion with someone about developing games using Unity. I had lamented that I have no experience in making my own 3D assets, to which he told me how he knows how to create all of those. I was dazzled. Since I had read about how companies like Disney put together whole data centers to make their movies; I had always thought you would need some expensive rig, or a particular set of skills that made me a nightmare for certain people. It was actually suggested to me to check out a few tools that are used for making 3D models. Specifically I zoned in on The Blender Project, an open source 3D animation software.

Read more

,

Testing in Python Part II: PyTest

After a shallow dive into the world of unit testing in the Python I took to the internet comments for guidance. I know what you’re thinking, it’s an exercise in poor thinking, keyboard warrior syndrome, and people telling me to go fuck myself. However every once in awhile you come across someone who knows what they are talking about and you can get some pretty cool information from them. In this exercise someone suggested to me (on top of definitely upgrading to Python 3) to switch to using Pytest to write my test scripts.

Read more

Unit Testing in Python

When I started reading more into and practicing unit testing in python my first question was; “How did I not know about this, you jagweeds?” The truth is that unit testing is an incredibly important part of developing software. I consider myself at a level where I have moved beyond learning the basics and have graduated to actually putting together projects, building new sites, and learning to make tools to automate my life. Now it is up to me to learn about and make unit tests for my projects. I’d like to pass along what I have learned to others.

Read more

,

Creating a Simple Spam Filter in PHP

SPAM

 

I started with a post giving in in depth look into creating and utilizing objects in PHP. That post will follow soon enough, I want to make sure that I am fully prepared to teach all I can about objects in PHP. What I ended up here with is a super simple spam filter for email. The form is easy, you enter in your name, email address, password (hidden), and the domain for the person sending you an email. It’s a simple app if you receive an email from someone and you’re not sure whether or not they’re spamming you.

Read more

, ,

Creating Easy Fluid Layouts with Flexbox

Easy

 

I see a lot of designers and developers jumping ship to frameworks and that is commendable because it makes life easier can sometimes be better than hand written code. For dinosaurs like myself I like to write code. Now CSS isn’t technically code so my horse is not that high. What I really mean is that when you talk about technology hands on experience is essential and that means the stuff out front as well. Even though frameworks are dominating developers and designers should be able to write code. It’s easy to just use Bootstrap, I do but I love using Flexbox as one of my layout tools. It makes me a less stressed out and less of a swear word sayer.

Read more