Data Extraction Using Scrapy Items

We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data, to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same.

‘items.py’ file, mention the attributes, we need to scrape.

We define them as follows:

Python3

# Define here the models for your scraped item
import scrapy
 
# Item class name for the book title and price
class GfgItemloadersItem(scrapy.Item):
   
    # Scrape Book price
    price = scrapy.Field()
     
    # Scrape Book Title
    title = scrapy.Field()

Please note that Field() allows, a way to define all field metadata, in one location. It does not provide, any extra attributes.
XPath expressions, allow us to traverse the webpage, and, extract the data. Right-click, on one of the books, and, select the ‘Inspect’ option. This should show its HTML attributes, in the browser. All the books on the webpage, are contained, in the same <article> HTML tag, having class attribute, as ‘product_pod’. It can be seen as below –

All books belong to the same ‘class’ attribute ‘product_pod’

Hence, we can iterate through, the <article> tag class attribute, to extract all Book titles and Price, on the webpage. The XPath expression , for the same, will be books =response.xpath(‘//*[@class=”product_pod”]’). This should return, all the book HTML tags, belonging to the class attribute as “product_pod”. The ‘*’ operator indicates, all tags, belonging to the class ‘product_pod’. Hence, we can now have a loop, that navigates to each and every Book, on the page.
Inside the loop, we need to get the Book Title. Hence, right-click on the title and choose ‘Inspect’. It is included, in <a> tag, inside header <h3> tag. We will fetch the “title” attribute of the <a> tag. The XPath expression, for the same, would be, books.xpath(‘.//h3/a/@title’).extract(). The dot operator indicates, we will be using the ‘books’ object now, to extract data from it. This syntax will traverse through the header, and then, <a> tag, to get the title of the book.
Similarly, to get the Price of the book, right click and say Inspect on it, to get its HTML attributes. All the price elements, belong to the <div> tag, having class attribute as “product_price”. The actual price is mentioned, inside the paragraph tag, present, inside the <div> element. Hence, the XPath expression, to get the actual text of Price, would be books.xpath(‘.//*[@class=”product_price”]/p/text()’).extract_first(). The extract_first() method, returns, the first price value.

We will create, an object of the above, Item class, in the spider, and, yield the same. The spider code file will look as follows:

Python3

# Import Scrapy library
import scrapy
 
# Import Item class
from ..items import GfgItemloadersItem
 
# Spider class name
class GfgLoadbookdataSpider(scrapy.Spider):
   
    # Name of the spider
    name = 'gfg_loadbookdata'
     
    # The domain to be scraped
    allowed_domains = [
        'books.toscrape.com/catalogue/category/books/womens-fiction_9']
     
    # The URL to be scraped
    start_urls = [
        'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/']
     
    # Default parse callback method
    def parse(self, response):
       
        # Create an object of Item class
        item = GfgItemloadersItem()
         
        # loop through all books
        for books in response.xpath('//*[@class="product_pod"]'):
           
            # XPath expression for the book price
            price = books.xpath(
                './/*[@class="product_price"]/p/text()').extract_first()
             
            # place price value in item key
            item['price'] = price
             
            # XPath expression for the book title
            title = books.xpath('.//h3/a/text()').extract()
             
            # place title value in item key
            item['title'] = title
             
            # yield the item
            yield item

When we execute, the above code, using scrapy “crawl” command, using the syntax as, scrapy crawl spider_name, at the terminal as –

scrapy crawl gfg_loadbookdata -o not_parsed_data.json

The data is exported, in the “not_parsed_data.json” file, which can be seen as below:

The items yielded when data is not parsed

Now, suppose we want to process, the scraped data, before yielding and storing them, in any file format, then we can use Item Loaders.

Scrapy – Item Loaders

In this article, we are going to discuss Item Loaders in Scrapy.

Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.

Data Extraction Using Scrapy Items

Python3

Python3

Scrapy – Item Loaders

Similar Reads

Contact Us