Data  Extraction Using Scrapy Items

We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data,  to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same.

‘items.py’ file, mention the attributes, we need to scrape.

We define them as follows:

Python3




# Define here the models for your scraped item
import scrapy
 
# Item class name for the book title and price
class GfgItemloadersItem(scrapy.Item):
   
    # Scrape Book price
    price = scrapy.Field()
     
    # Scrape Book Title
    title = scrapy.Field()


  • Please note that Field() allows, a way to define all field metadata, in one location. It does not provide, any extra attributes.
  • XPath expressions, allow us to traverse the webpage, and, extract the data. Right-click, on one of the books, and, select the ‘Inspect’ option. This should show its HTML attributes, in the browser. All the books on the webpage, are contained, in the same <article> HTML tag, having class attribute, as ‘product_pod’.  It can be seen as below –

All books belong to the same ‘class’ attribute ‘product_pod’

  • Hence, we can iterate through, the <article> tag class attribute, to extract all Book titles and Price, on the webpage. The XPath expression , for the same, will be books =response.xpath(‘//*[@class=”product_pod”]’). This should return, all the book HTML tags, belonging to the class attribute as “product_pod”. The ‘*’ operator indicates, all tags, belonging to the class ‘product_pod’. Hence, we can now have a loop, that navigates to each and every Book, on the page.
  • Inside the loop, we need to get the Book Title. Hence, right-click on the title and choose ‘Inspect’. It is included, in <a> tag, inside header <h3> tag. We will fetch the “title” attribute of the <a> tag. The XPath expression, for the same, would be, books.xpath(‘.//h3/a/@title’).extract(). The dot operator indicates, we will be using the ‘books’ object now, to extract data from it. This syntax will traverse through the header, and then, <a> tag, to get the title of the book.
  • Similarly, to get the Price of the book, right click and say Inspect on it, to get its HTML attributes. All the price elements, belong to the <div> tag, having class attribute as “product_price”. The actual price is mentioned, inside the paragraph tag, present, inside the <div> element. Hence, the XPath expression, to get the actual text of Price, would be books.xpath(‘.//*[@class=”product_price”]/p/text()’).extract_first(). The extract_first() method, returns, the first price value.

We will create, an object of the above, Item class, in the spider, and, yield the same. The spider code file will look as follows:

Python3




# Import Scrapy library
import scrapy
 
# Import Item class
from ..items import GfgItemloadersItem
 
# Spider class name
class GfgLoadbookdataSpider(scrapy.Spider):
   
    # Name of the spider
    name = 'gfg_loadbookdata'
     
    # The domain to be scraped
    allowed_domains = [
        'books.toscrape.com/catalogue/category/books/womens-fiction_9']
     
    # The URL to be scraped
    start_urls = [
        'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/']
     
    # Default parse callback method
    def parse(self, response):
       
        # Create an object of Item class
        item = GfgItemloadersItem()
         
        # loop through all books
        for books in response.xpath('//*[@class="product_pod"]'):
           
            # XPath expression for the book price
            price = books.xpath(
                './/*[@class="product_price"]/p/text()').extract_first()
             
            # place price value in item key
            item['price'] = price
             
            # XPath expression for the book title
            title = books.xpath('.//h3/a/text()').extract()
             
            # place title value in item key
            item['title'] = title
             
            # yield the item
            yield item


  • When we execute, the above code, using scrapy “crawl” command, using the syntax as, scrapy crawl spider_name, at the terminal as –
scrapy crawl gfg_loadbookdata -o not_parsed_data.json

The data is exported, in the “not_parsed_data.json” file, which can be seen as below:

The items yielded when data is not parsed

Now, suppose we want to process, the scraped data, before yielding and storing them, in any file format, then we can use Item Loaders.

Scrapy – Item Loaders

In this article, we are going to discuss Item Loaders in Scrapy.

Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields.  In this article, we will learn about Item Loaders.

Similar Reads

Installing Scrapy:

Scrapy, requires a Python version, of 3.6 and above. Install it, using the pip  command, at the terminal as:...

Create a Scrapy Spider Project

Scrapy comes with an efficient command-line tool, called the Scrapy tool. The commands have a different set of arguments, based on their purpose. To write the Spider code, we begin by creating, a Scrapy project. Use the following, ‘startproject’ command, at the terminal –...

Data  Extraction Using Scrapy Items

We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data,  to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same....

Introduction to Item Loaders

...

How do Item Loaders work?

...

Built-in processors:

Item loaders, allow a smoother way, to manage scraped data. Many times, we may need to process, the data we scrape. This processing can be:...

Item Loader Objects

So far we know, Item Loaders are used to parse, the data, before Item fields are populated. Let us understand, how Item Loaders work –...

Following are the methods available for ItemLoader objects:

Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them –...

Nested Loaders

...

Reusing and Extending Item Loaders

...

Declaring Custom Item Loaders Processors

...

Implementing Item Loaders to Parse Data:

...

Contact Us