Build your own dataset. Web scraping & Scheduler.

4 min readDec 12, 2020

The golden oil for a data scientist and machine learning programer is the dataset. Without datas, analysis and modeling is not possible.

Topics of this story

way to collect a dataset
web scraping example
scheduling

When we start a new project of data analysis and/or machine learning modeling there is several tasks to do before jumping or diving in coding.

Collect the data is one of the most important one. There is multiple options to get a dataset (after scoping the project). We can find the dataset on internet. The web offers thousands of link where data can be downloaded (for free or for fees) and loaded in your project. The dataset can be manually created (I play golf and collect my stats in a notebook — manually — and then I fill a text file).

In this post I would like to cover a third option by speaking about web scraping.

Wikipedia resume the topics as “extracting datas from websites”.

Web scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web…

en.wikipedia.org

I will explain how to build a web scraping program to collect data about Coronavirus and schedule the Notebooks to have daily fresh data.

Web Scraping example

The notebook will define a function to collect data from https://www.worldometers.info/coronavirus/. Having a quick look on the html source code we can isolate the section where the table is created and identify the structure of the table fo each country detailed statistics [‘Total Cases’, ‘New Cases’, …, ‘Population’].

Let’s import a few libraries

import bs4
import pandas as pd
import requests
import time
from datetime import datetime
import threading
import numpy as np

The bs4 and threading librairies are the focus of this story. Beautiful Soup offers methods to crawl the html source code and isolates tags to collect data. The collected data will simply be added to a dataframe which will be saved after concatenation with history (I load the original file in a dataframe, perform the web scraping on a temporary dataframe and finally merge both dataframe and save it on the original name).

Read the URL and return content

def get_page_contents(url):
    page = requests.get(url, headers={"Accept-Language": "en-US"})
    return bs4.BeautifulSoup(page.text, "html.parser")

Scraping function & threading

The code to scrap the content is defined in a function. The function is called by using the threading timer method. I start by creating the function scrapping_url() and set up some constants.

def scrapping_url():
    data = []
    url='https://www.worldometers.info/coronavirus/'
    root_url = url

Then, this is the magic line to schedule and play and play and re-play the function.

    threading.Timer(30.0, scrapping_url).start()

The 30.0 indicates to the scheduler to start the scrapping_url function every 30 seconds (good to test) but in real life I set up this parameters to 36000 to collect data every 10 hours. During my analysis I manage duplicates row per days (I did that to be sure to have consistent data and not be too much impacted by the timezone).

Dealing with the page content


    soup = get_page_contents(url)
    entries = soup.findAll('div', class_='maincounter-number')
    rat = [e.find('span', class_='').text for e in entries]    table = soup.find('table', 
       attrs={'class':'table table-bordered table-hover  main_table_countries'})
    table_body = table.find('tbody')    rows = table_body.find_all('tr')

Rows contains the table. With a simple for row in rows: loop it’s easy to extract data for each rows (dealing cell by cell) and storing the result in a list.

#I don't want to get data from the 1st row of the table
    dont_do_1_row=False
    for row in rows:
        if dont_do_1_row:
            cols = row.find_all('td')
            d=[]
            for c in cols:
                v=c.text.strip()
                d.append(v) #replaced i by v
            temp=[]
            for a in d:
                if a != '':
                    try:
                        i=int(a.replace(',',''))
                    except:
                        i=a
                else:
                    i=a
                temp.append(i)
            data.append(temp)            
        dont_do_1_row=True

At the end of the loop, the content of the list can be added to a list of list which will be transformed in a temporary dataframe and finally merged to the global file.

colname = ['Rank','Location' , 'Cases', 'New Cases'
              , 'Deaths','New Deaths', 'Recovered',
              'Actives','Serious','Case-1M'
              ,'Deaths-1M','Tests','Tests-1M'
               ,'Population','Continent',
               'dummy1','dummy2','dummy3','dummy4']    df=pd.DataFrame(data, columns=colname)
    df['DateTime']=datetime.now()    #read row data
    try:
        covid_global = pd.read_excel('covid_data.xlsx, index_col=0) 
    except:
        covid_global = None    #append last readed line(s)
    try:
        covid_fusion = covid_global.append(df,  
          ignore_index=True, verify_integrity=False, sort=None)
    except:
        covid_fusion = df.copy()    #save file with just row data scrapping from website        
    covid_fusion.to_excel('covid_data.xlsx')

Call the global function and job is done !

scrapping_url()

Conculsion

the magic line is the call to threading.timer
the program works as long as the definition of the website doesn’t change
some website can be very difficult to analyse and those with no pure html are out of scope
I imagine that in a modeling project, fresh data can be used to validate model accuracy and trigger some fine tunning