Implementation of pagination by XPath in Python and selenium
We are scraping from the w3wiki website with articles links and titles and applying pagination. As a result, we’ll have a set of links and titles of articles.
Step 1: Firstly we will import all the required modules used for pagination.
Python
# Importing required modules import selenium from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from time import sleep from bs4 import BeautifulSoup |
Let us understand the usage of imported files and how they are useful in our problem statement.
webdriver: Selenium WebDriver is a tool that is used to automate web-based application testing to verify it performs as expected. It enables to execution of cross-browser tests.
By: Using the By class, we can locate elements within a document. It has a different parameters to find the element in the doc via the parser. It has CLASS_NAME, CSS_SELECTOR ,ID , LINK_TEXT , NAME .PARTIAL_LINK_TEXT , TAG_NAME , XPath parameters to locate the element into an HTML.
Keys: Using the Keys class imported from selenium.webdriver.common.keys, we can send special keys similar to how we enter keys with our keyboards
sleep: Using the Python time sleep function, we can pause the execution of a program for a specified time in seconds.
Step 2: Initialise the driver and access the website where you need to apply paging or pagination:
Python
# Download chrome driver and enter its path inside string. PATH = "" driver = webdriver.Chrome(PATH) driver.get("https: / / www.w3wiki.org / \ category / programming - language / python / ") |
Step 3: The XML Path Language is also known as XPath. It is possible to select nodes or node sets in XML documents with path expressions. The expression looks like a registry folder file system in windows. To copy the XPath inspect the webpage and then select the link after that right click on tag and copy XPath as show in the below image.
Maybe copied XPath be according to id, we can replace it with the class also and enter the relevant class which is available.
Step 4: Finding the correct XPath is an important task. And then if we have to iterate on different pages then we have to check manually which div is changing. If we want to scrape for a single element we can directly copy and paste the path.
Python
# add path of chrome driver, # Replace it with the below string PATH = "C: / Users / dvgrg / OneDrive / Desktop / dexterio\ internship / workflow / Current / version 1.0 / fabric\ data / chromedriver_win32 / chromedriver.exe" driver = webdriver.Chrome(PATH) driver.get("https: / / www.w3wiki.org / \ category / programming - language / python / ") |
Step 5: In this step, firstly we initialize the variable page with “1” and declare two list to store link and title and these to lists are stored in a dictionary named “result”. we are running a while loop to extract the links from page using link and inside the while loop we are running a for loop with the help of XPath and inside this loop we are appending links and titles in their respective lists and store these lists in a dictionary result. This for loop is terminated when it iterated upto 9th page.
Python
page = 1 # we'll append links in the list called link link = [] # we'll append title in the list called title title = [] # we'll append result into the dictionary result = { "link" : [] , "title" : []} while (page): print (page) url = "https: / / www.w3wiki.org / \ category / programming - language / python / page / {}". format (page) driver.get(url) for element in driver.find_elements(By.XPATH, '//*[@class ="articles-list"]/div/div/div/a' ): # for link we are using href li = element.get_attribute( 'href' ) # for title we are using title ti = element.get_attribute( 'title' ) link.append(li) title.append(ti) # we can try printing link and title print (link,title) # lets run. result[ 'link' ].append(li) result[ 'title' ].append(ti) page + = 1 # Give any limit where you need to stop the loop if page = = 9 : break |
Complete code:
Python3
# Importing required modules import selenium from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from time import sleep from bs4 import BeautifulSoup # Download chrome driver and enter its path inside string. PATH = "" driver = webdriver.Chrome(PATH) driver.get("https: / / www.w3wiki.org / \ category / programming - language / python / ") # add path of chrome driver, # Replace it with the below string PATH = "C: / Users / dvgrg / OneDrive / Desktop / dexterio\ internship / workflow / Current / version 1.0 / fabric\ data / chromedriver_win32 / chromedriver.exe" driver = webdriver.Chrome(PATH) driver.get("https: / / www.w3wiki.org / \ category / programming - language / python / ") page = 1 # we'll append links in the list called link link = [] # we'll append title in the list called title title = [] # we'll append result into the dictionary result = { "link" : [] , "title" : []} while (page): print (page) url = "https: / / www.w3wiki.org / \ category / programming - language / python / page / {}". format (page) driver.get(url) for element in driver.find_elements(By.XPATH, '//*[@class ="articles-list"]/div/div/div/a' ): # for link we are using href li = element.get_attribute( 'href' ) # for title we are using title ti = element.get_attribute( 'title' ) link.append(li) title.append(ti) # we can try printing link and title print (link,title) # lets run. result[ 'link' ].append(li) result[ 'title' ].append(ti) page + = 1 # Give any limit where you need to stop the loop if page = = 9 : break |
Output:
Pagination – xpath for a crawler in Python
In this article, we are going to learn about pagination using XPath for a crawler in Python.
This article is about learning how to extract information from different websites where information is stored on multiple pages. So to move to all pages via API’s call we use a concept of paging which helps to navigate through them.
We’ll use selenium and BeautifulSoup for this. Selenium is a free open-source framework, used for testing. Selenium Driver (which we will use), and selenium grid. Many languages can be used to create a selenium script like java, Python, c#, etc. BeautifulSoup (a library in Python) is used for extracting data from HTML and XML files. The tool works with the preferred parser to provide idiomatic means of navigating, searching, and modifying the parse tree. Execute the below commands to install selenium and beautifulSoup in the Python environment.
pip install selenium
pip install beautifulsoup4
Prerequisite:
- Selenium
- beautifulSoup
- Basics of Python
- Chrome driver(Download chrome driver for selenium)
What is pagination?
Pagination is the process of dividing a document into different pages, which means a number of different pages which has data. These different pages of websites have their own different URL, having minor changes usually. So we have to reach these URL one by one and scrape data that these pages contain till the last page. There are generally two ways where pagination is needed, first where pages have a next page button and second where there is no next button but an infinite scroll having a new location.
So now let’s see with the help of an example, how to implement pagination by XPath in Python.
Contact Us