Use-case:
News and news articles can be overwhelming and hard to structure, specially when we are dealing with complex or multi-dimensional domains or communities that needs us staying on top of latest news and development and it is just too much to handle manually or using traditional methods. You could be in interested in storying and organizing a summary of news articles from specific sources in a scheduled manner for variety of reasons. Cases I’ve seen personally has been the scenario where we had a meeting with a client and we wanted to keep up-to-date with the latest news and pain-points of the community for those clients, so we can provide relevant and evidence based product or update suggestion for next phase of projects that impressed them. Or you could be wanting to act proactive and search for finding the hottest topics in news so then you can build more research over that for a possible product development, marketing, or funding or grant writing.
In this tutorial, I provide a simple code I’ve used to scrape and organize the summary of news articles from a set of news website, which they had an RSS feed. So, you can find it as a RSS feed scraping project with Python using beautifulsoup package in xml feature.
Video Tutorial for this blog is available here:
Step1: Creating a list of news website – RSS feed pages
This is one example of news website I was looking for
Somewhere in their website you can search for RSS feed information, and news websites can have mutliple RSS feeds like this example:
The rss feed url for the second link in the above picture for instance is “https://globalnews.ca/bc/feed/”.
So, I ended up getting a list of urls for RSS feeds and some sort of metadata or additional info on recording where I got this RSS in first place(source : global news and subcategory: Global BC. My input data is a table like this:
Step2: input packages and data
Step3: Overview of scraped content for one news website and one news article
r = requests.get('https://globalnews.ca/bc/feed/')
soup = BeautifulSoup(r.content, features='xml')
articles = soup.findAll('item')
The object “soup” in above code has the xml-structured format of news RSS feed. It has too much information and it is rather unstructured for our purpose..
So, we need to find how to identify each news article separatly and summarize the information from it next.
It turns of that in RSS feeds, the partition label “items” specify each news articles in here. So, we use that information and store all news articles in a list called “articles”
Step4: The main body getting the result for all RSS feeds
Now, we can all the information to dig more and record the final summarized and structured news data as we want.
Within each article, separated by “item” there are news article link, title, description, publication date are among the key information I wanted to store, which done in below.
- Here is one example of raw data for news article in RSS and how we can identify these desired information from it as done in the code:
article_list = []
for i,x in enumerate(data['rss']):
# scraping function
r = requests.get(x)
soup = BeautifulSoup(r.content, features='xml')
articles = soup.findAll('item')
#
for a in articles:
try:
title = a.find('title').text
except Exception:
title = ''
try:
link = a.find('link').text
except Exception:
link = ''
try:
published = a.find('pubDate').text
except Exception:
published = ''
try:
desc = a.find('description').text
except Exception:
desc= ''
try:
catg = a.find('category').text
except Exception:
catg = ''
try:
id = a.find('guid').text
except Exception:
id = ''
# create an "article" object with the data
# from each "item"
article = {
'title': title,
'link': link,
'published': published,
'description': desc,
'category': catg,
'source': data['source'].tolist()[i],
'source sub-category': data['source sub category'].tolist()[i],
'id':id
}
# append my "article_list" with each "article" object
article_list.append(article)
#
df = pd.DataFrame.from_dict(article_list)
print(data['source'].tolist()[i] + ' - ' + data['source sub category'].tolist()[i])
time.sleep(2)
df.to_csv('news_rss.csv', encoding='utf-8-sig', index=False)
time.sleep(5)
Final result are stored in a csv file in this case.
Related Links
- code solution script: available as a jypter notebook along with all other files used here in a github repo https://github.com/winswithdata/news-rss-scraping
Check out these related tutorial in your convenient time:
- For python related tutorials, see this playlist of video tutorials: https://www.youtube.com/playlist?list=PL_b86y-oyLzAQOf0W7MbCXYxEkv27gPN6
- For web-scraping tutorials, see this playlist of video tutorials: https://www.youtube.com/watch?v=RLnPN4HE-Qs&list=PL_b86y-oyLzDp2EBX-k2bjIBW9SgH3sxT