Creating a Simple web crawler

We are going to create a web crawler to scrape all the book details(URL, Title, Price) from a Web Scraping Sandbox website

1. Installation of packages – run the following command from the terminal 

pip install scrapy

2.  Create a Scrapy project – run the following command from the terminal 

scrapy startproject booklist
cd booklist 
scrapy genspider book http://books.toscrape.com/

Here,

  • Project Name:  “booklist”
  • Spider Name: “book”
  • Domain to be Scraped: “http://books.toscrape.com/” 

Directory Structure:

Directory Structure

3. Create an Item – replace the contents of “booklist\items.py” file with the below code

We define each Item scraped from the website as an object with the following 3 fields:

  • url
  • title
  • price

Python




# booklist\items.py
  
# Define here the models for your scraped items
from scrapy.item import Item, Field
  
class BooklistItem(Item):
    url = Field()
    title = Field()
    price = Field()


4. Define the Parse function –  Add the following code to “booklist\spiders\book.py”

The response from the crawler is parsed to extract the book details (i.e. URL, Title, Price) as shown in the below code

Python




# booklist\spiders\book.py
  
import scrapy
from booklist.items import BooklistItem
  
class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
  
    def parse(self, response):
        for article in response.css('article.product_pod'):
            book_item = BooklistItem(
                url=article.css("h3 > a::attr(href)").get(),
                title=article.css("h3 > a::attr(title)").extract_first(),
                price=article.css(".price_color::text").extract_first(),
            )
            yield book_item


5. Run the spider using following command:

scrapy crawl book

Output:

Output Items

Scrapy – Feed exports

Scrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing.

This article is divided into 2 sections:

  1. Creating a Simple web crawler to scrape the details from a Web Scraping Sandbox website (http://books.toscrape.com/)
  2. Exploring how Scrapy Feed exports can be used to store the scraped data to export files in various formats.

Similar Reads

Creating a Simple web crawler

We are going to create a web crawler to scrape all the book details(URL, Title, Price) from a Web Scraping Sandbox website...

Scrapy Feed Exports

...

Contact Us