Creating a Simple web crawler
We are going to create a web crawler to scrape all the book details(URL, Title, Price) from a Web Scraping Sandbox website
1. Installation of packages – run the following command from the terminal
pip install scrapy
2. Create a Scrapy project – run the following command from the terminal
scrapy startproject booklist cd booklist scrapy genspider book http://books.toscrape.com/
Here,
- Project Name: “booklist”
- Spider Name: “book”
- Domain to be Scraped: “http://books.toscrape.com/”
Directory Structure:
3. Create an Item – replace the contents of “booklist\items.py” file with the below code
We define each Item scraped from the website as an object with the following 3 fields:
- url
- title
- price
Python
# booklist\items.py # Define here the models for your scraped items from scrapy.item import Item, Field class BooklistItem(Item): url = Field() title = Field() price = Field() |
4. Define the Parse function – Add the following code to “booklist\spiders\book.py”
The response from the crawler is parsed to extract the book details (i.e. URL, Title, Price) as shown in the below code
Python
# booklist\spiders\book.py import scrapy from booklist.items import BooklistItem class BookSpider(scrapy.Spider): name = 'book' allowed_domains = [ 'books.toscrape.com' ] start_urls = [ 'http://books.toscrape.com/' ] def parse( self , response): for article in response.css( 'article.product_pod' ): book_item = BooklistItem( url = article.css( "h3 > a::attr(href)" ).get(), title = article.css( "h3 > a::attr(title)" ).extract_first(), price = article.css( ".price_color::text" ).extract_first(), ) yield book_item |
5. Run the spider using following command:
scrapy crawl book
Output:
Scrapy – Feed exports
Scrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing.
This article is divided into 2 sections:
- Creating a Simple web crawler to scrape the details from a Web Scraping Sandbox website (http://books.toscrape.com/)
- Exploring how Scrapy Feed exports can be used to store the scraped data to export files in various formats.
Contact Us