Scrapy The Tool

As part of my job, I have to scrape some website to help our sales team with data on the market, as of now they were doing it manually which is a bit of tedious job to do and consumes lot of their productive time. So on bit searching and going through different tools and framework came across a framework named Scrapy. So here I am going to share how to set up and use Scrapy.

Scrapy is a free and open source web-crawling framework written in python which is used to extract data from a website without much of hassle. They have a very nice documentation you can check out here.

Steps to Install Scrapy

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
pip install Scrapy

Steps to Create New Project

To create a Scrapy project type this command in your terminal.scrapy startproject <project name>. Project structure will look like this

now go ahead and create a python file at path /spiders and paste below code.

#!/usr/bin/env python3
import scrapy

class RedditSpider(scrapy.Spider):
    # name of the scrapper, it should be unique.
    name = "reddit"
    # list of the URL need to be iterated.
    start_urls = ['https://www.reddit.com/']

    # Called to do any operation on the response of the above URL.
    def parse(self, response):
       # css selector of the anchor tag which contains the headers
       top_post = response.css("a.SQnoC3ObvgnGjWt90zD9Z")
       for post in top_post:
           self.log(post.css('::text').extract_first())

To start scrapping, type

scrapy crawl reddit

Here we are scrapping the Reddit website for the latest post and getting the header of all the post. The output of the above code will look like this.

  • Trump Organization ‘Sold Property to Shell Company Linked to Maduro Regime,’ Says Report
  • Blind people of Reddit, what do you find sexually attractive?
  • A “caravan” of Americans is crossing the Canadian border to get affordable medical care
  • A “caravan” of Americans is crossing the Canadian border to get affordable medical care
  • [Post Game Thread] The Houston Rockets defeat the Golden State Warriors, 112-108, behind Harden’s 38 points to level the series 2-2, despite the continued brilliance of Kevin Durant 18, my friend here is failing biology and thinks she’s unroastable. Go for it guys, and go hard If you strike me down, I shall become more powerful than you can possibly imagine. [BOTW]
  • ELI5: Why are all economies expected to “grow”? Why is an equilibrium bad?
    ….

Now the best part of Scrapy is if you want to experiment around any website before creating any project you can easily do that.

scrapy shell 'https://www.reddit.com/'

and then can try different CSS selector on the response . Though there is a lot more you can do with Scrapy like saving the result in JSON, CSV format and even integrate with Django project might show that in next post, till then good bye.

Cheers

2 thoughts on “Scrapy The Tool

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.