Data Extraction Using Scrapy Items
We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data, to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same.
‘items.py’ file, mention the attributes, we need to scrape.
We define them as follows:
Python3
# Define here the models for your scraped item import scrapy # Item class name for the book title and price class GfgItemloadersItem(scrapy.Item): # Scrape Book price price = scrapy.Field() # Scrape Book Title title = scrapy.Field() |
- Please note that Field() allows, a way to define all field metadata, in one location. It does not provide, any extra attributes.
- XPath expressions, allow us to traverse the webpage, and, extract the data. Right-click, on one of the books, and, select the ‘Inspect’ option. This should show its HTML attributes, in the browser. All the books on the webpage, are contained, in the same <article> HTML tag, having class attribute, as ‘product_pod’. It can be seen as below –
- Hence, we can iterate through, the <article> tag class attribute, to extract all Book titles and Price, on the webpage. The XPath expression , for the same, will be books =response.xpath(‘//*[@class=”product_pod”]’). This should return, all the book HTML tags, belonging to the class attribute as “product_pod”. The ‘*’ operator indicates, all tags, belonging to the class ‘product_pod’. Hence, we can now have a loop, that navigates to each and every Book, on the page.
- Inside the loop, we need to get the Book Title. Hence, right-click on the title and choose ‘Inspect’. It is included, in <a> tag, inside header <h3> tag. We will fetch the “title” attribute of the <a> tag. The XPath expression, for the same, would be, books.xpath(‘.//h3/a/@title’).extract(). The dot operator indicates, we will be using the ‘books’ object now, to extract data from it. This syntax will traverse through the header, and then, <a> tag, to get the title of the book.
- Similarly, to get the Price of the book, right click and say Inspect on it, to get its HTML attributes. All the price elements, belong to the <div> tag, having class attribute as “product_price”. The actual price is mentioned, inside the paragraph tag, present, inside the <div> element. Hence, the XPath expression, to get the actual text of Price, would be books.xpath(‘.//*[@class=”product_price”]/p/text()’).extract_first(). The extract_first() method, returns, the first price value.
We will create, an object of the above, Item class, in the spider, and, yield the same. The spider code file will look as follows:
Python3
# Import Scrapy library import scrapy # Import Item class from ..items import GfgItemloadersItem # Spider class name class GfgLoadbookdataSpider(scrapy.Spider): # Name of the spider name = 'gfg_loadbookdata' # The domain to be scraped allowed_domains = [ 'books.toscrape.com/catalogue/category/books/womens-fiction_9' ] # The URL to be scraped start_urls = [ 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/' ] # Default parse callback method def parse( self , response): # Create an object of Item class item = GfgItemloadersItem() # loop through all books for books in response.xpath( '//*[@class="product_pod"]' ): # XPath expression for the book price price = books.xpath( './/*[@class="product_price"]/p/text()' ).extract_first() # place price value in item key item[ 'price' ] = price # XPath expression for the book title title = books.xpath( './/h3/a/text()' ).extract() # place title value in item key item[ 'title' ] = title # yield the item yield item |
- When we execute, the above code, using scrapy “crawl” command, using the syntax as, scrapy crawl spider_name, at the terminal as –
scrapy crawl gfg_loadbookdata -o not_parsed_data.json
The data is exported, in the “not_parsed_data.json” file, which can be seen as below:
Now, suppose we want to process, the scraped data, before yielding and storing them, in any file format, then we can use Item Loaders.
Scrapy – Item Loaders
In this article, we are going to discuss Item Loaders in Scrapy.
Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.
Contact Us