Use-case:
I want to automate the scraping and storing results from google search engine based on a set of keywords and queries using Python. One possible use-case for this scraping could be a need for finding news or contextual information or article around some topical issues in various locations or communities. Another use-case could be marketing purpose. For instance, a need to find the social media links and contacts for certain target group individuals with certain working roles in communities or locations from google search. Of course there are many more use-cases using google search possible.
The video tutorial, complementing this blog page:
Here I go through the code in steps:
Step1: initial settings
# required package and initial settings for reading a data file from google drive in google colab
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
# other required packages depending on the need
import pandas as pd
import re
from tqdm import tqdm
!pip3 install --upgrade ecommercetools
from ecommercetools import seo
Step2: importing keyword data for queries
Step3: Creating query list from keyword data
There is a structure for query keywords to be used in google search. There are three columns in the data for keywords in this example. We provide two combinations to create two options for query list to be used in google as follows:
Step4: scraping google search results using seo function from ecommercetools in python
# def get_google_links(query, pages_num, condition):
targets = []
for query in tqdm(Querylist1):
# step 4.1: specify how many pages of google search you want to search!
pages_num = 5
# step 4.2: specify what type of google search results you want to grab
# and focus given this list of keywords
conditions = ['wikipedia','twitter','linkedin','facebook','news'] # use [''] to get all the google search results
# Step 4.3: the main body.
try:
res = seo.get_serps(query, pages=pages_num)
except:
continue
for c in conditions[:]:
temp1 = res[res['link'].str.contains(c,case=False)]['title'].to_list()
temp2 = res[res['link'].str.contains(c,case=False)]['link'].to_list()
if len(temp1)==0:
y = ''
else:
y = temp1[0]
if len(temp2)==0:
z = ''
continue
else:
z = temp2[0]
targets.append([y,z,c,query])
Step5: store the result
Related links: