Web Scraping

A snippet HTML file will help illustrate the main methods for parsing data in Beautiful Soup. This file is much simpler than the typical modern website, but it will suffice for the purposes of this walkthrough.

Use of Web Scraping

  1. Sentiment Analysis on Social Media
  2. Market Analysis, Lead Generation in the Marketing Domain, Online Price Comparison in the eCommerce Domain
  3. In Machine Learning Applications, collect training and testing data.

Scrape the Python Site

Step 1: Inspect Your Data Source

Investigate the Website

Before using Python to scrape the web, you need to become familiar with the target website. That should be the first step in any web scraping project you undertake. To extract the information that is relevant to you, you will need to comprehend the site structure.

Decipher the Information in URLs

A URL can contain a large amount of data that a programmer can encode. Your web scraping adventure will be much simpler if you first learn how URLs work and what they're made of. For example, you could end up on a details page with the following URL.

Query parameters can be found at the end of a URL. If you go to Indeed and search for "software developer" in "Australia" in their search bar, you'll notice that the URL adjustments to include the following parameters as query parameters:

https://au.indeed.com/jobs?q=software+developer&l=Australia

Query parameters are made up of three parts:

  • A question mark (?) indicates the start of the query parameters.
  • Information: The bits of data that make up a query parameter are encoded in key-value pairs, with related keys and values joined together by an equals sign (key=value).
  • Separator: An ampersand symbol separates multiple query parameters in a URL (&).

Inspect the Site Using Developer Tools

  • Developer tools can support you in comprehending the structure of a website.
  • Developer tools are included with all modern web browsers. In this section, you'll learn how to use Chrome's developer tools. The procedure will be similar to that of other modern web browsers.
  • To better understand your source, developer tools allow you to interactively explore the site's document object model (DOM). Select the Elements tab in developer tools to explore your page's DOM. A structure with clickable HTML elements will be visible.

Step 2: Scrape HTML Content From a Page

The HTML you'll come across will occasionally be perplexing. Fortunately, the HTML of this job board includes descriptive class names for the elements that you're looking for:

  • The job posting's title is contained in the class="title is-5"
  • The name of the company that is offering the position is included in the class="subtitle is-6 company."
  • The class="location" attribute specifies the location where you will be working.

Step 3: Parse HTML Code With Beautiful Soup

  • Every element on an HTML web page can have an id attribute. As the name implies, the id attribute makes the element on the page uniquely identifiable. You can start parsing your page by selecting an element by its ID.
  • Return to the developer tools and locate the HTML object that contains all of the job postings. Hover over sections of the page to inspect them by right-clicking.

Create code to retrieve the content of the elements you've chosen.

>> Begin by installing the required modules/packages.

pip install pandas requests BeautifulSoup4

>>Import essential libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

>>Define a scraper function

def scraper():

>> Write a for loop inside the scraper function to loop through the number of pages you want to scrape. I'd like to scrape the reviews from five different pages.

for i in range(1,6):

>> Making a vacant list to hold each page's reviews (from 1 to 5)

pagewise_reviews = []

>>Construct the URL

url = base_url + query_parameter

>> Using requests, send an HTTP request to the URL and save the response.

response = requests.get(url)

Make a soup object, then parse the HTML page.

soup = bs(response.content, 'html.parser')

>>Find and save all the div elements with the class name "rvw-bd" in a variable.

rev_div = soup.findAll("div",attrs={"class","rvw-bd"})

>>Simply add the review text to the pagewise reviews list after looping through every rev div.

for j in range(len(rev_div)):
# finding all the p tags to fetch only the review text
pagewise_reviews.append(rev_div[j].find("p").text)

>>All pagewise reviews should be added to a single list called "all_pages_reviews."

for k in range(len(pagewise_reviews)):
  all_pages_reviews.append(pagewise_reviews[k])

Revert back the final list of reviews at the finish of the function.

return all_pages_reviews

Call the function scraper() and store the output to a variable 'reviews'

# Driver code
reviews = scraper()

Step 4: Store the data in the required format

i = range(1, len(reviews)+1)
reviews_df = pd.DataFrame({'review':reviews}, index=i)

Now let us take a glance of our dataset

print(reviews_df)

Entire Python Code

# !pip install pandas requests BeautifulSoup4
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
base_url = "https://www.consumeraffairs.com/food/dominos.html"
all_pages_reviews =[]

def scraper():
    for i in range(1,6): # fetching reviews from five pages
        pagewise_reviews = []
        query_parameter = "?page="+str(i)
url = base_url + query_parameter
response = requests.get(url)
soup = bs(response.content, 'html.parser')
rev_div = soup.findAll("div",attrs={"class","rvw-bd"})

    for j in range(len(rev_div)):
    # finding all the p tags to fetch only the review text
        pagewise_reviews.append(rev_div[j].find("p").text)

    for k in range(len(pagewise_reviews)):
        all_pages_reviews.append(pagewise_reviews[k])
        return all_pages_reviews

# Driver code
reviews = scraper()
i = range(1, len(reviews)+1)
reviews_df = pd.DataFrame({'review':reviews}, index=i)
reviews_df.to_csv('reviews.txt', sep='t')

write your code here: Coding Playground