Reusing and Extending Item Loaders
Maintenance, becomes difficult, as the project grows, and, also the number of spiders, written for data scraping. Also, the parsing rules may change, for every other spider. To simplify the maintenance, of parsing, Item Loaders support, regular Python inheritance, to deal with differences, present in a group of spiders. Let us look, at an example, where extending loaders, may turn beneficial.
Suppose, any eCommerce book website, has its book author names, starting with an “*”(asterisk). If you want, to remove those “*”, present in the final scraped author names, we can reuse, and, extend the default loader class ‘BookLoader’ as follows:
Python3
# Import the MapCompose built-in processor from itemloaders.processors import MapCompose # Import the existing BookLoader # Item loader used for scraping book data from myproject.ItemLoaders import BookLoader # Custom function to remove the '*' def strip_asterisk(x): return x.strip( '*' ) # Extend and reuse the existing BookLoader class class SiteSpecificLoader(BookLoader): authorname = MapCompose(strip_asterisk, BookLoader.authorname) |
In the above code, the BookLoader is a parent class, for the SiteSpecificLoader class. By reusing the existing loader, we have added only the strip “*” functionality, in the new loader class.
Scrapy – Item Loaders
In this article, we are going to discuss Item Loaders in Scrapy.
Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.
Contact Us