Implementing Item Loaders to Parse Data

Now, we have a general understanding of Item Loaders. Let us implement, the above concepts, in our example –

  • In the spider ‘gfg_loadbookdata.py’ file, we define ItemLoaders, by making use of  Scrapy.Loader.Itemloader module. The syntax will be -“from scrapy.loader import ItemLoader”.
  • In the parse method, which is the default callback method of the spider, we are already looping through all the books.
  • Inside the loop, create an object of ItemLoader class, by using the arguments as –
    • Pass the item attribute name, as GfgItemloadersItem
    • Pass selector attribute, as ‘books’
    • So the code will look –  “loader = ItemLoader(item=GfgItemloadersItem(), selector=books)”
  • Use the Item loader method, add_xpath(), and, pass the item field name, and, XPath expression.
  • Use ‘price’ field, and, write its XPath in the add_xpath() method. Syntax will be – “loader.add_xpath(‘price’, ‘.//*[@class=”product_price”]/p/text()’)”. Here, we are selecting, the text of the price, by navigating till the price tag, and, then fetching the  using the text() method.
  • Use ‘title’ field, and write its XPath expression, in the add_xpath() method. Syntax will be – “loader.add_xpath(‘title’, ‘.//h3/a/@title’)”. Here, we are fetching, the value of the ‘title’ attribute, of the <a> tag.
  • Yield, the loader item, now by using the load_item(), method of the loader.
  • Now, let us make changes, in the ‘items.py’ file. For Every Item field, defined here, there is an input and output processor. When data is received, the input processor acts upon them, as defined by the function. Then, a list of internal elements is prepared, and passed to the output processor function, when they are populated, using the load_item() method. Currently, price and title are defined, as scrapy.Field().
  • For the Book Price values, we need to replace the ‘£’ sign with a blank. Here,  we assign, MapCompose() built-in processor, as an input_processor. The first parameter to this is the remove_tags method, which removes all the tags, present in the selected response. The second parameter will be our custom function, remove_pound_sign(), that will replace ‘£’ sign a blank. The output_processor, for the Price field, will be TakeFirst(), which is the built-in processor, used to return the first non-null value, from the output. Hence, the syntax for the Price Item field will be price = scrapy.Field(input_processor=MapCompose(remove_tags, remove_pound_sign), output_processor=TakeFirst()).
  • The functions, used for Price, are remove_tags and remove_pound_sign. The remove_tags() method, is imported from the Urllib HTML module. It removes, all the tags present, in the scraped response. The remove_pound_sign(), is our custom method that accepts the ‘price’ value of every book, and, replaces it with a blank. The inbuilt Python, replace function, is used for the replacement.
  • Similarly, for the Book Title, we will replace ‘&’ with ‘AND’, by assigning appropriate Input and Output processors. The input_processor will be MapCompose(), the first parameter to which, will be the remove_tags method, which will remove all the tags, and, replace_and_sign(), our custom method to replace ‘&’ with ‘AND’. The output_processor will be TakeFirst() that will return, the first non-null value, from the output. Hence, the book title field will be title= scrapy.Field(input_processor=MapCompose(remove_tags, replace_and_sign), output_processor=TakeFirst()).
  • The functions, used for Title, are remove_tags and replace_and_sign. The remove_tags method is imported from the Urllib HTML module. It removes all the tags, present, in the scraped response. The replace_and_sign(), is our custom method, that accepts the ‘&’ operator, of every book, and, replaces it with a ‘AND’. The inbuilt Python, replace function, is used for the replacement.

The final code, for our ‘items.py’ class, will look as shown below: 

Python3




# Define here the models for your scraped items
 
# import Scrapy library
import scrapy
 
# import itemloader methods
from itemloaders.processors import TakeFirst, MapCompose
 
# import remove_tags method to remove all tags present
# in the response
from w3lib.html import remove_tags
 
# custom method to replace '&' with 'AND'
# in book title
def replace_and_sign(value):
     
    # python replace method to replace '&' operator
    # with 'AND'
    return value.replace('&', ' AND ')
 
# custom method to remove the pound currency sign from
# book price
def remove_pound_sign(value):
   
    # for pound press Alt + 0163
    # python replace method to replace '£' with a blank
    return value.replace('£', '').strip()
 
# Item class to define all the Item fields - book title
# and price
class GfgItemloadersItem(scrapy.Item):
   
    # Assign the input and output processor for book price field
    price = scrapy.Field(input_processor=MapCompose(
        remove_tags, remove_pound_sign), output_processor=TakeFirst())
     
    # Assign the input and output processor for book title field
    title = scrapy.Field(input_processor=MapCompose(
        remove_tags, replace_and_sign), output_processor=TakeFirst())


The final spider file code will look as follows:

Python3




# Import the required Scrapy library
import scrapy
 
# Import the Item Loader library
from scrapy.loader import ItemLoader
 
# Import the items class from 'items.py' file
from ..items import GfgItemloadersItem
 
# Spider class having Item loader
class GfgLoadbookdataSpider(scrapy.Spider):
    # Name of the spider
    name = 'gfg_loadbookdata'
     
    # The domain  to be scraped
    allowed_domains = [
        'books.toscrape.com/catalogue/category/books/womens-fiction_9']
     
    # The webpage to be scraped
    start_urls = [
        'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/']
     
    # Default callback method used by the spider
    # Data in the response will be processed here
    def parse(self, response):
       
      # Loop through all the books using XPath expression
        for books in response.xpath('//*[@class="product_pod"]'):
 
            # Define Item Loader object,
            # by passing item and selector attribute
            loader = ItemLoader(item=GfgItemloadersItem(), selector=books)
             
            # Item loader method add_xpath(),for price,
            # mention the field name and xpath expression
            loader.add_xpath('price', './/*[@class="product_price"]/p/text()')
 
            # Item loader method add_xpath(),
            # for title, mention the field name
            # and xpath expression
            loader.add_xpath('title', './/h3/a/@title')
 
            # use the load_item method of
            # loader to populate the parsed items
            yield loader.load_item()


We can run, and, save the data in JSON file, using the scrapy ‘crawl’ command using the syntax scrapy crawl spider_name as –

scrapy crawl gfg_loadbookdata -o parsed_bookdata.json

The above command will scrape the data, parse the data, which means the pound sign, won’t be there, and, ‘&’ operator will be replaced with ‘AND’. The  parsed_bookdata.json file is created as follows:

The parsed JSON output  file using Item Loaders



Scrapy – Item Loaders

In this article, we are going to discuss Item Loaders in Scrapy.

Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields.  In this article, we will learn about Item Loaders.

Similar Reads

Installing Scrapy:

Scrapy, requires a Python version, of 3.6 and above. Install it, using the pip  command, at the terminal as:...

Create a Scrapy Spider Project

Scrapy comes with an efficient command-line tool, called the Scrapy tool. The commands have a different set of arguments, based on their purpose. To write the Spider code, we begin by creating, a Scrapy project. Use the following, ‘startproject’ command, at the terminal –...

Data  Extraction Using Scrapy Items

We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data,  to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same....

Introduction to Item Loaders

...

How do Item Loaders work?

...

Built-in processors:

Item loaders, allow a smoother way, to manage scraped data. Many times, we may need to process, the data we scrape. This processing can be:...

Item Loader Objects

So far we know, Item Loaders are used to parse, the data, before Item fields are populated. Let us understand, how Item Loaders work –...

Following are the methods available for ItemLoader objects:

Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them –...

Nested Loaders

...

Reusing and Extending Item Loaders

...

Declaring Custom Item Loaders Processors

...

Implementing Item Loaders to Parse Data:

...

Contact Us