How to use Selenium In Python

Method 2: Using Requests and BeautifulSoup

We need to install a chrome driver to automate using selenium, our task is to create a bot that will be continuously scraping the google news website and display all the headlines every 10mins.

Stepwise implementation:

Step 1: First we will import some required modules.

Python3

# These are the imports to be made 
import time 
from selenium import webdriver 
from datetime import datetime 

Step 2: The next step is to open the required website.

Python3

# path of the chromedriver we have just downloaded 
PATH = r"D:\chromedriver"
driver = webdriver.Chrome(PATH)  # to open the browser 
  
# url of google news website 
url = 'https://news.google.com/topstories?hl=en-IN&gl=IN&ceid=IN:en'
  
# to open the url in the browser 
driver.get(url)   

Output:

Step 3: Extracting the news title from the webpage, to extract a specific part of the page, we need its XPath, which can be accessed by right-clicking on the required element and selecting Inspect in the dropdown bar.

After clicking Inspect a window appears. From there, we have to copy the elements full XPath to access it:

Note: You might not always get the exact element that you want by inspecting (depends on the structure of the website), so you may have to surf the HTML code for a while to get the exact element you want. And now, just copy that path and paste that into your code. After running all these lines of code, you will get the title of the first heading printed on your terminal.

Python3

# Xpath you just copied 
news_path = '/html/body/c-wiz/div/div[2]/div[2]/\ 
div/main/c-wiz/div[1]/div[3]/div/div/article/h3/a' 
  
# to get that element 
link = driver.find_element_by_xpath(news_path)   
  
# to read the text from that element 
print(link.text)   

Output:

‘Attack on Afghan territory’: Taliban on US airstrike that killed 2 ISIS-K men

Step 4: Now, the target is to get the X_Paths of all the headlines present.

One way is that we can copy all the XPaths of all the headlines (about 6 headlines will be there in google news every time) and we can fetch all those, but that method is not suited if there are a large number of things to be scrapped. So, the elegant way is to find the pattern of the XPaths of the titles which will make our tasks way easier and efficient. Below are the XPaths of all the headlines on the website, and let’s figure out the pattern.

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[3]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[4]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[5]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[6]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[7]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[8]/div/div/article/h3/a

So, by seeing these XPath’s, we can see that only the 5th div is changing (bolded ones). So based upon this, we can generate the XPaths of all the headlines. We will get all the titles from the page by accessing them with their XPath. So to extract all these, we have the code as

Python3

# I have used f-strings to format the string 
c = 1
for x in range(3, 9): 
    print(f"Heading {c}: ") 
    c += 1
    curr_path = f'/html/body/c-wiz/div/div[2]/div[2]/div/main\ 
    /c-wiz/div[1]/div[{x}]/div/div/article/h3/a' 
    title = driver.find_element_by_xpath(curr_path) 
    print(title.text) 

Output:

Now, the code is almost complete, the last thing we have to do is that the code should get headlines for every 10 mins. So we will run a while loop and sleep for 10 mins after getting all the headlines.

Below is the full implementation

Python3

import time 
from selenium import webdriver 
from datetime import datetime 
  
PATH = r"D:\chromedriver"
  
driver = webdriver.Chrome(PATH) 
  
url = 'https://news.google.com/topstories?hl=en-IN&gl=IN&ceid=IN:en'
  
driver.get(url) 
  
while(True): 
    now = datetime.now() 
      
    # this is just to get the time at the time of  
    # web scraping 
    current_time = now.strftime("%H:%M:%S") 
    print(f'At time : {current_time} IST') 
    c = 1
  
    for x in range(3, 9): 
        curr_path = '' 
          
        # Exception handling to handle unexpected changes 
        # in the structure of the website 
        try: 
            curr_path = f'/html/body/c-wiz/div/div[2]/div[2]/\ 
            div/main/c-wiz/div[1]/div[{x}]/div/div/article/h3/a' 
            title = driver.find_element_by_xpath(curr_path) 
        except: 
            continue
        print(f"Heading {c}: ") 
        c += 1
        print(title.text) 
          
    # to stop the running of code for 10 mins 
    time.sleep(600)  

Output:

How to Build Web scraping bot in Python

In this article, we are going to see how to build a web scraping bot in Python.

Web Scraping is a process of extracting data from websites. A Bot is a piece of code that will automate our task. Therefore, A web scraping bot is a program that will automatically scrape a website for data, based on our requirements.

Tags:

#Blogathon-2021 #Python BeautifulSoup #Python web-scraping-exercises #Python-projects #Python-requests #Python-selenium #Blogathon #Python #python

Module needed

Method 2: Using Requests and BeautifulSoup

How to use Selenium In Python

Stepwise implementation:

Python3

Python3

Python3

Python3

Python3

How to Build Web scraping bot in Python

Similar Reads

Contact Us