How to Scrape Google Ngrams?
To scrape google ngram, we will use Python’s requests and urllib libraries.
Now, we will create a function that extracts the data from google ngram’s website. Go through the comments written along with the code in order to follow along.
Python3
import requests import urllib def runQuery(query, start_year = 1850 , end_year = 1860 , corpus = 26 , smoothing = 0 ): # converting a regular string to # the standard URL format # eg: "geeks for,geeks" will # convert to "geeks%20for%2Cgeeks" query = urllib.parse.quote(query) # creating the URL url = 'https://books.google.com/ngrams/json?content=' + query + '&year_start=' + str (start_year) + '&year_end=' + str (end_year) + '&corpus=' + str (corpus) + '&smoothing=' + str (smoothing) + '' # requesting data from the above url response = requests.get(url) # extracting the json data from the response we got output = response.json() # creating a list to store the ngram data return_data = [] if len (output) = = 0 : # if no data returned from site, # print the following statement return "No data available for this Ngram." else : # if data returned from site, # store the data in return_data list for num in range ( len (output)): # getting the name return_data.append((output[num][ 'ngram' ], # getting ngram data output[num][ 'timeseries' ]) ) return return_data |
In the function runQuery, we took an argument string query as the function’s argument while the rest of the arguments were default arguments. By default, the year range was kept 1850 to 1860, the corpus was 26 (i.e. English language), and the smoothing was kept 0. We created the google ngram URL as per the argument string. Then, we used this URL to get the data from google ngram. Once the JSON data was returned, we stored the data we needed in a list and then returned the list.
Now, let us use the runQuery function to find out the popularity of “Albert Einstein”.
Python3
query = "Albert Einstein" print (runQuery(query)) |
Output:
[(‘Albert Einstein’, [0.0, 0.0, 0.0, 0.0, 2.171790969285325e-09,
1.014315520464492e-09, 6.44787723214079e-10, 0.0, 7.01216085197131e-10, 0.0, 0.0])]
We can even enter multiple phrases in the same query by separating each phrase with commas.
Python3
query = "Albert Einstein,Isaac Newton" print (runQuery(query)) |
Output:
[(‘Albert Einstein’, [0.0, 0.0, 0.0, 0.0, 2.171790969285325e-09,
1.014315520464492e-09, 6.44787723214079e-10, 0.0, 7.01216085197131e-10,
0.0, 0.0]), (‘Isaac Newton’, [1.568728407619346e-06, 1.135979687205690e-06,
1.140318772741011e-06, 1.102130454455618e-06, 1.34806168716750e-06,
2.039112359852879e-06, 1.356955749542976e-06, 1.121004174819972e-06,
1.223622120960499e-06, 1.18965874662535e-06, 1.077695060303085e-06])]
Scrape Google Ngram Viewer using Python
In this article, we will learn how to scrape Google Ngarm using Python. Google Ngram/Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings.
Contact Us