Implementation of pagination by XPath in Python and selenium

We are scraping from the w3wiki website with articles links and titles and applying pagination. As a result, we’ll have a set of links and titles of articles.

Step 1: Firstly we will import all the required modules used for pagination.

Python

# Importing required modules 
import selenium 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys 
from time import sleep 
from bs4 import BeautifulSoup

Let us understand the usage of imported files and how they are useful in our problem statement.

webdriver: Selenium WebDriver is a tool that is used to automate web-based application testing to verify it performs as expected. It enables to execution of cross-browser tests.

By: Using the By class, we can locate elements within a document. It has a different parameters to find the element in the doc via the parser. It has CLASS_NAME, CSS_SELECTOR ,ID , LINK_TEXT , NAME .PARTIAL_LINK_TEXT , TAG_NAME , XPath parameters to locate the element into an HTML.

Keys: Using the Keys class imported from selenium.webdriver.common.keys, we can send special keys similar to how we enter keys with our keyboards

sleep: Using the Python time sleep function, we can pause the execution of a program for a specified time in seconds.

Step 2: Initialise the driver and access the website where you need to apply paging or pagination:

Python

# Download chrome driver and enter its path inside string. 
PATH = "" 
driver = webdriver.Chrome(PATH) 
driver.get("https://www.w3wiki.org/\ 
           category/programming-language/python/") 

Step 3: The XML Path Language is also known as XPath. It is possible to select nodes or node sets in XML documents with path expressions. The expression looks like a registry folder file system in windows. To copy the XPath inspect the webpage and then select the link after that right click on tag and copy XPath as show in the below image.

Press copy XPath

Maybe copied XPath be according to id, we can replace it with the class also and enter the relevant class which is available.

Step 4: Finding the correct XPath is an important task. And then if we have to iterate on different pages then we have to check manually which div is changing. If we want to scrape for a single element we can directly copy and paste the path.

Python

# add path of chrome driver,  
# Replace it with the below string 
PATH ="C:/Users/dvgrg/OneDrive/Desktop/dexterio\ 
       internship/workflow/Current/version 1.0/fabric\ 
       data/chromedriver_win32/chromedriver.exe" 
driver = webdriver.Chrome(PATH) 
driver.get("https://www.w3wiki.org/\ 
            category/programming-language/python/")

Step 5: In this step, firstly we initialize the variable page with “1” and declare two list to store link and title and these to lists are stored in a dictionary named “result”. we are running a while loop to extract the links from page using link and inside the while loop we are running a for loop with the help of XPath and inside this loop we are appending links and titles in their respective lists and store these lists in a dictionary result. This for loop is terminated when it iterated upto 9th page.

Python

page = 1
# we'll append links in the list called link 
link = []   
# we'll append title in the list called title 
title = []   
# we'll append result into the dictionary 
result = {"link" : [] , "title" : []}     
  
while(page): 
    print(page) 
      
    url = "https://www.w3wiki.org/\ 
    category/programming-language/python/page/{}".format(page) 
    driver.get(url) 
  
    for element in driver.find_elements(By.XPATH, 
     '//*[@class ="articles-list"]/div/div/div/a'): 
          # for link we are using href 
          li = element.get_attribute('href')  
          # for title we are using title 
          ti = element.get_attribute('title')     
  
          link.append(li) 
          title.append(ti) 
        #   we can try printing link and title 
          print(link,title) 
        #   lets run. 
  
          result['link'].append(li) 
          result['title'].append(ti) 
  
    page += 1  
    # Give any limit where you need to stop the loop 
    if page == 9:    
        break

Complete code:

Python3

# Importing required modules 
import selenium 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys 
from time import sleep 
from bs4 import BeautifulSoup 
  
# Download chrome driver and enter its path inside string. 
PATH = "" 
driver = webdriver.Chrome(PATH) 
driver.get("https://www.w3wiki.org/\ 
           category/programming-language/python/") 
  
# add path of chrome driver,  
# Replace it with the below string 
PATH ="C:/Users/dvgrg/OneDrive/Desktop/dexterio\ 
       internship/workflow/Current/version 1.0/fabric\ 
       data/chromedriver_win32/chromedriver.exe" 
driver = webdriver.Chrome(PATH) 
driver.get("https://www.w3wiki.org/\ 
            category/programming-language/python/") 
  
page = 1
# we'll append links in the list called link 
link = []   
# we'll append title in the list called title 
title = []   
# we'll append result into the dictionary 
result = {"link" : [] , "title" : []}     
  
while(page): 
    print(page) 
      
    url = "https://www.w3wiki.org/\ 
    category/programming-language/python/page/{}".format(page) 
    driver.get(url) 
  
    for element in driver.find_elements(By.XPATH, 
     '//*[@class ="articles-list"]/div/div/div/a'): 
          # for link we are using href 
          li = element.get_attribute('href')  
          # for title we are using title 
          ti = element.get_attribute('title')     
  
          link.append(li) 
          title.append(ti) 
        #   we can try printing link and title 
          print(link,title) 
        #   lets run. 
  
          result['link'].append(li) 
          result['title'].append(ti) 
  
    page += 1  
    # Give any limit where you need to stop the loop 
    if page == 9:    
        break

Output:

The output will print page number, links, and title.

Pagination – xpath for a crawler in Python

In this article, we are going to learn about pagination using XPath for a crawler in Python.

This article is about learning how to extract information from different websites where information is stored on multiple pages. So to move to all pages via API’s call we use a concept of paging which helps to navigate through them.

We’ll use selenium and BeautifulSoup for this. Selenium is a free open-source framework, used for testing. Selenium Driver (which we will use), and selenium grid. Many languages can be used to create a selenium script like java, Python, c#, etc. BeautifulSoup (a library in Python) is used for extracting data from HTML and XML files. The tool works with the preferred parser to provide idiomatic means of navigating, searching, and modifying the parse tree. Execute the below commands to install selenium and beautifulSoup in the Python environment.

pip install selenium

pip install beautifulsoup4

Prerequisite:

Selenium
beautifulSoup
Basics of Python
Chrome driver(Download chrome driver for selenium)

What is pagination?

Pagination is the process of dividing a document into different pages, which means a number of different pages which has data. These different pages of websites have their own different URL, having minor changes usually. So we have to reach these URL one by one and scrape data that these pages contain till the last page. There are generally two ways where pagination is needed, first where pages have a next page button and second where there is no next button but an infinite scroll having a new location.

So now let’s see with the help of an example, how to implement pagination by XPath in Python.

Tags:

#selenium #Technical Scripter 2022 #Python #Technical Scripter #python

Implementation of pagination by XPath in Python and selenium

Python

Python

Python

Python

Python3

Pagination – xpath for a crawler in Python

Prerequisite:

What is pagination?

Similar Reads

Contact Us