Link Extractors using Scrapy
Example 1 :
Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :
Python3
# scrapy_link_extractor.py import scrapy from scrapy.linkextractors import LinkExtractor class QuoteSpider(scrapy.Spider): name = "OuoteSpider" start_urls = [ "https://quotes.toscrape.com/" ] def parse( self , response): link_extractor = LinkExtractor() links = link_extractor.extract_links(response) for link in links: yield { "url" : link.url, "text" : link.text} |
To run the above code we run the following command :
scrapy runspider scrapy_link_extractor.py -o quotes.json
Output:
Example 2 :
Let us this time fetch all the links from the website https://www.w3wiki.org/email-id-extractor-project-from-sites-in-scrapy-python/ .
This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.
Python3
import scrapy from scrapy.linkextractors import LinkExtractor class w3wikiSpider(scrapy.Spider): name = "w3wikiSpider" start_urls = [ "https: / / www.w3wiki.org / email - id - extractor - \ project - from - sites - in - scrapy - python / "] def __init__( self , name = None , * * kwargs): super ().__init__(name, * * kwargs) self .link_extractor = LinkExtractor(unique = True ) def parse( self , response): links = self .link_extractor.extract_links(response) for link in links: yield { "nofollow" : link.nofollow, "url" : link.url, "text" : link.text} |
To run the above code, we run the following command in the terminal :
scrapy runspider scrapy_link_extractor.py -o w3wiki.json
Output:
Scrapy – Link Extractors
In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post.
Contact Us