Creating a Simple web crawler

Scrapy Feed Exports

We are going to create a web crawler to scrape all the book details(URL, Title, Price) from a Web Scraping Sandbox website

1. Installation of packages – run the following command from the terminal

pip install scrapy

2. Create a Scrapy project – run the following command from the terminal

scrapy startproject booklist
cd booklist 
scrapy genspider book http://books.toscrape.com/

Here,

Project Name: “booklist”
Spider Name: “book”
Domain to be Scraped: “http://books.toscrape.com/”

Directory Structure:

Directory Structure

3. Create an Item – replace the contents of “booklist\items.py” file with the below code

We define each Item scraped from the website as an object with the following 3 fields:

url
title
price

Python

# booklist\items.py 
  
# Define here the models for your scraped items 
from scrapy.item import Item, Field 
  
class BooklistItem(Item): 
    url = Field() 
    title = Field() 
    price = Field() 

4. Define the Parse function – Add the following code to “booklist\spiders\book.py”

The response from the crawler is parsed to extract the book details (i.e. URL, Title, Price) as shown in the below code

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

5. Run the spider using following command:

scrapy crawl book

Output:

Output Items

Scrapy – Feed exports

Scrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing.

This article is divided into 2 sections:

Creating a Simple web crawler to scrape the details from a Web Scraping Sandbox website (http://books.toscrape.com/)
Exploring how Scrapy Feed exports can be used to store the scraped data to export files in various formats.

Tags:

#Python-Scrapy #Technical Scripter 2022 #Python #Technical Scripter #python

Scrapy Feed Exports

Creating a Simple web crawler

Python

Python

Output:

Scrapy – Feed exports

This article is divided into 2 sections:

Similar Reads

Contact Us