Link Extractors using Scrapy

Example 1 :

Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :

Python3




# scrapy_link_extractor.py
import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class QuoteSpider(scrapy.Spider):
    name = "OuoteSpider"
    start_urls = ["https://quotes.toscrape.com/"]
  
    def parse(self, response):
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)
  
        for link in links:
            yield {"url": link.url, "text": link.text}


To run the above code we run the following command :

scrapy runspider scrapy_link_extractor.py -o quotes.json

Output:

scrapy link extractor example 1

Example 2 :

Let us this time fetch all the links from the website https://www.w3wiki.org/email-id-extractor-project-from-sites-in-scrapy-python/  .

This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.

Python3




import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class w3wikiSpider(scrapy.Spider):
    name = "w3wikiSpider"
    start_urls = [
        "https://www.w3wiki.org/email-id-extractor-\
        project-from-sites-in-scrapy-python/"]
  
    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)
  
        self.link_extractor = LinkExtractor(unique=True)
  
    def parse(self, response):
        links = self.link_extractor.extract_links(response)
  
        for link in links:
            yield {"nofollow": link.nofollow, "url": link.url, "text": link.text}


To run the above code, we run the following command in the terminal :

scrapy runspider scrapy_link_extractor.py -o w3wiki.json

Output:

output of scrapy link extractor example 2



Scrapy – Link Extractors

In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post. 

Similar Reads

Scrapy – Link Extractors

Basically using the “LinkExtractor” class of scrapy we can find out all the links which are present on a webpage and fetch them in a very easy way. We need to install the scrapy module (if not installed yet) by running the following command in the terminal:...

Link Extractor class of Scrapy

So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a response object. For convenience scrapy also provides us with “scrapy.linkextractors.LinkExtractor“....

Stepwise Implementation

Step 1: Creating a spider...

Link Extractors using Scrapy

...

Contact Us