Implementing Item Loaders to Parse Data

Declaring Custom Item Loaders Processors

Now, we have a general understanding of Item Loaders. Let us implement, the above concepts, in our example –

In the spider ‘gfg_loadbookdata.py’ file, we define ItemLoaders, by making use of Scrapy.Loader.Itemloader module. The syntax will be -“from scrapy.loader import ItemLoader”.
In the parse method, which is the default callback method of the spider, we are already looping through all the books.
Inside the loop, create an object of ItemLoader class, by using the arguments as –
- Pass the item attribute name, as GfgItemloadersItem
- Pass selector attribute, as ‘books’
- So the code will look – “loader = ItemLoader(item=GfgItemloadersItem(), selector=books)”
Use the Item loader method, add_xpath(), and, pass the item field name, and, XPath expression.
Use ‘price’ field, and, write its XPath in the add_xpath() method. Syntax will be – “loader.add_xpath(‘price’, ‘.//*[@class=”product_price”]/p/text()’)”. Here, we are selecting, the text of the price, by navigating till the price tag, and, then fetching the using the text() method.
Use ‘title’ field, and write its XPath expression, in the add_xpath() method. Syntax will be – “loader.add_xpath(‘title’, ‘.//h3/a/@title’)”. Here, we are fetching, the value of the ‘title’ attribute, of the <a> tag.
Yield, the loader item, now by using the load_item(), method of the loader.
Now, let us make changes, in the ‘items.py’ file. For Every Item field, defined here, there is an input and output processor. When data is received, the input processor acts upon them, as defined by the function. Then, a list of internal elements is prepared, and passed to the output processor function, when they are populated, using the load_item() method. Currently, price and title are defined, as scrapy.Field().
For the Book Price values, we need to replace the ‘£’ sign with a blank. Here, we assign, MapCompose() built-in processor, as an input_processor. The first parameter to this is the remove_tags method, which removes all the tags, present in the selected response. The second parameter will be our custom function, remove_pound_sign(), that will replace ‘£’ sign a blank. The output_processor, for the Price field, will be TakeFirst(), which is the built-in processor, used to return the first non-null value, from the output. Hence, the syntax for the Price Item field will be price = scrapy.Field(input_processor=MapCompose(remove_tags, remove_pound_sign), output_processor=TakeFirst()).
The functions, used for Price, are remove_tags and remove_pound_sign. The remove_tags() method, is imported from the Urllib HTML module. It removes, all the tags present, in the scraped response. The remove_pound_sign(), is our custom method that accepts the ‘price’ value of every book, and, replaces it with a blank. The inbuilt Python, replace function, is used for the replacement.
Similarly, for the Book Title, we will replace ‘&’ with ‘AND’, by assigning appropriate Input and Output processors. The input_processor will be MapCompose(), the first parameter to which, will be the remove_tags method, which will remove all the tags, and, replace_and_sign(), our custom method to replace ‘&’ with ‘AND’. The output_processor will be TakeFirst() that will return, the first non-null value, from the output. Hence, the book title field will be title= scrapy.Field(input_processor=MapCompose(remove_tags, replace_and_sign), output_processor=TakeFirst()).
The functions, used for Title, are remove_tags and replace_and_sign. The remove_tags method is imported from the Urllib HTML module. It removes all the tags, present, in the scraped response. The replace_and_sign(), is our custom method, that accepts the ‘&’ operator, of every book, and, replaces it with a ‘AND’. The inbuilt Python, replace function, is used for the replacement.

The final code, for our ‘items.py’ class, will look as shown below:

Python3

# Define here the models for your scraped items
 
# import Scrapy library
import scrapy
 
# import itemloader methods
from itemloaders.processors import TakeFirst, MapCompose
 
# import remove_tags method to remove all tags present 
# in the response
from w3lib.html import remove_tags
 
# custom method to replace '&' with 'AND'
# in book title
def replace_and_sign(value):
     
    # python replace method to replace '&' operator
    # with 'AND'
    return value.replace('&', ' AND ')
 
# custom method to remove the pound currency sign from
# book price
def remove_pound_sign(value):
   
    # for pound press Alt + 0163
    # python replace method to replace '£' with a blank
    return value.replace('£', '').strip()
 
# Item class to define all the Item fields - book title
# and price
class GfgItemloadersItem(scrapy.Item):
   
    # Assign the input and output processor for book price field
    price = scrapy.Field(input_processor=MapCompose(
        remove_tags, remove_pound_sign), output_processor=TakeFirst())
     
    # Assign the input and output processor for book title field
    title = scrapy.Field(input_processor=MapCompose(
        remove_tags, replace_and_sign), output_processor=TakeFirst())

The final spider file code will look as follows:

Python3

# Import the required Scrapy library
import scrapy
 
# Import the Item Loader library
from scrapy.loader import ItemLoader
 
# Import the items class from 'items.py' file
from ..items import GfgItemloadersItem
 
# Spider class having Item loader
class GfgLoadbookdataSpider(scrapy.Spider):
    # Name of the spider
    name = 'gfg_loadbookdata'
     
    # The domain  to be scraped
    allowed_domains = [
        'books.toscrape.com/catalogue/category/books/womens-fiction_9']
     
    # The webpage to be scraped
    start_urls = [
        'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/']
     
    # Default callback method used by the spider
    # Data in the response will be processed here
    def parse(self, response):
       
      # Loop through all the books using XPath expression
        for books in response.xpath('//*[@class="product_pod"]'):
 
            # Define Item Loader object,
            # by passing item and selector attribute
            loader = ItemLoader(item=GfgItemloadersItem(), selector=books)
             
            # Item loader method add_xpath(),for price,
            # mention the field name and xpath expression
            loader.add_xpath('price', './/*[@class="product_price"]/p/text()')
 
            # Item loader method add_xpath(),
            # for title, mention the field name
            # and xpath expression
            loader.add_xpath('title', './/h3/a/@title')
 
            # use the load_item method of
            # loader to populate the parsed items
            yield loader.load_item()

We can run, and, save the data in JSON file, using the scrapy ‘crawl’ command using the syntax scrapy crawl spider_name as –

scrapy crawl gfg_loadbookdata -o parsed_bookdata.json

The above command will scrape the data, parse the data, which means the pound sign, won’t be there, and, ‘&’ operator will be replaced with ‘AND’. The parsed_bookdata.json file is created as follows:

The parsed JSON output file using Item Loaders

Scrapy – Item Loaders

In this article, we are going to discuss Item Loaders in Scrapy.

Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.

Implementing Item Loaders to Parse Data

Python3

Python3

Scrapy – Item Loaders

Similar Reads

Contact Us