Introduction

XML

eXtensible Markup Language is referred to as XML. Its purpose was to transport and store data. It was made to be readable by both humans and machines. Because of this, the design objectives of XML place an emphasis on simplicity, generality, and Internet usability.

In this tutorial, an XML file that needs to be parsed is actually an RSS feed.

RSS

Rich Site Summary (RSS), also known as Really Simple Syndication, is a web feed format family that is used to publish frequently updated information such as blog entries, news headlines, audio, and video. RSS is plain text with an XML format.

  • The RSS format itself is comparatively simple for both automated and human readers.
  • The top news stories RSS feed from a well-known news website is the RSS that is processed in this tutorial. Check it out right here. Our objective is to transform this XML file into another format and store it for later use after processing the RSS feed.

Implementation

import csv
import requests
import xml.etree.ElementTree as ET


def parseXML(xmlfile):

tree = ET.parse(xmlfile)

root = tree.getroot()

items = []

for item in root.findall('./channel/item'):

e = {}

for child in item:
e[child.tag] = child.text.encode('utf8')

items.append(e)

return items

def loadRSS():

url = 'myFile.xml'

resp = requests.get(url)

with open('writeFile.xml', 'wb') as f:
f.write(resp.content)

def savetoCSV(items, filename):
fields = ['id', 'title', 'description', 'link']

with open(filename, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames = fields)
writer.writeheader()
writer.writerows(items)


def main():
loadRSS()
newsitems = parseXML('writeFile.xml')
savetoCSV(newsitems, 'myCSV.csv')


if __name__ == "__main__":
main()

write your code here: Coding Playground

The code above will:

  • Open the specified URL to load the RSS feed, then save it as an XML file.
  • To save news as a list of dictionaries, where each dictionary represents a single news item, parse the XML file.
  • News items should be saved as a CSV file.

Let us try to understand the code in pieces:

Loading and saving RSS feed

def loadRSS():
    url = 'myFile.xml'
    resp = requests.get(url)
    with open('writeFile.xml', 'wb') as f:
        f.write(resp.content)

Here, we started off by making an HTTP response object by making an HTTP request to the RSS feed's URL. The XML file data is now contained in the response's content, and we save it as topnewsfeed.xml in our local directory.

Parsing XML

To parse an XML file, we created the parseXML() function. We are aware that XML is a naturally hierarchical data format, and that a tree is the most appropriate visual representation of it.

Here, we're utilising the xml.etree.ElementTree (abbreviated ET) module. For this purpose, Element Tree has two classes: Element and ElementTree, where Element represents a single node in the tree and ElementTree represents the entire XML document as a tree. Reading and writing to/from files as well as interactions with the entire document are typically done at the ElementTree level. The Element level is where interactions with a single XML element and its children take place.

Ok, so let’s go through the parseXML() function now:

tree = ET.parse(xmlfile)
root = tree.getroot()
items = []
for item in root.findall('./channel/item'):
e = {}
for child in item:
e[child.tag] = child.text.encode('utf8')
items.append(e)
return items

Saving Data to a CSV File

Now all that is left to do is use the savetoCSV() function to save the list of news items to a CSV file so that it can be easily used or modified in the future. To learn more about adding dictionary words to CSV files, click here.

To store all news stories as a table, the data from the hierarchical XML file has been transformed into a straightforward CSV file. This also makes expanding the database simpler.

Additionally, the JSON-like data can be used directly in applications! The best alternative for data extraction from websites that don't offer a public API but do offer some RSS feeds is this.