Link Extractors using Scrapy

Example 1 :

Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :

Python3

# scrapy_link_extractor.py 
import scrapy 
from scrapy.linkextractors import LinkExtractor 
  
  
class QuoteSpider(scrapy.Spider): 
    name = "OuoteSpider"
    start_urls = ["https://quotes.toscrape.com/"] 
  
    def parse(self, response): 
        link_extractor = LinkExtractor() 
        links = link_extractor.extract_links(response) 
  
        for link in links: 
            yield {"url": link.url, "text": link.text} 

To run the above code we run the following command :

scrapy runspider scrapy_link_extractor.py -o quotes.json

Output:

scrapy link extractor example 1

Example 2 :

Let us this time fetch all the links from the website https://www.w3wiki.org/email-id-extractor-project-from-sites-in-scrapy-python/ .

This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.

Python3

import scrapy 
from scrapy.linkextractors import LinkExtractor 
  
  
class w3wikiSpider(scrapy.Spider): 
    name = "w3wikiSpider"
    start_urls = [ 
        "https://www.w3wiki.org/email-id-extractor-\ 
        project-from-sites-in-scrapy-python/"] 
  
    def __init__(self, name=None, **kwargs): 
        super().__init__(name, **kwargs) 
  
        self.link_extractor = LinkExtractor(unique=True) 
  
    def parse(self, response): 
        links = self.link_extractor.extract_links(response) 
  
        for link in links: 
            yield {"nofollow": link.nofollow, "url": link.url, "text": link.text} 

To run the above code, we run the following command in the terminal :

scrapy runspider scrapy_link_extractor.py -o w3wiki.json

Output:

output of scrapy link extractor example 2

Scrapy – Link Extractors

In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post.

Tags:

#Python #python

Stepwise Implementation

Link Extractors using Scrapy

Python3

Python3

Scrapy – Link Extractors

Similar Reads

Contact Us