Natalia Fedorova, Regional Coordinator

19 January 2021

Analyzing COVID-19 Data with Python

Python is not just a popular back-end development language but also a backbone for a key Data Science tool. Pandas, a go-to library for data specialists, is built upon Python. Much like the vanilla language, pandas is pretty easy to get started with.

Analyzing COVID-19 Data with Python

Python is not just a popular back-end development language but also a backbone for a key Data Science tool. Pandas, a go-to library for data specialists, is built upon Python. Much like the vanilla language, pandas is pretty easy to get started with.

Is Pandas Simply Excel But with More Steps?

Yes and no. A lot of the basic functionality, such as sorting or Find and Replace is available on Excel. The main advantages of pandas are a broader set of options for handling data as well as visualization. Personally, I find Excel graphs clunky and arguably favoring looks over insights too much. With Pandas, you can have a great combination of both.

Performance is another area where Excel is lacking. One of the popular Machine Learning competitions on Kaggle, fare prediction for NYC taxis, provides participants with data covering about 55 million taxi rides. Last time I checked, Excel would show as little as 10,000 rows from that. If you can’t load data, you can’t process it.

Today’s project will not be leaning toward either extreme. We’re working with under 100 rows, but there are enough columns and discrepancies in data to show you what makes pandas the superior tool.

Getting Started

When it comes to pandas, the most beginner-friendly tool is Jupyter Notebook. It is an interactive web tool for working with various Javascript- and Python-based libraries, including pandas. Another great aspect of Jupyter is that it is open-sourced, which means anyone can create an online solution based on it. We will be using a Colaboratory, which is Google's cloud spin on Jupyter.

In this project, we’re taking a look at the daily COVID-19 reports that summarize new cases as well as contain cumulative data for the entire pandemic. I’ll be using the December 17 one although you can get a more recent one from Johns Hopkins University’s github repository. Unless the protocol changes, you’ll be able to simply replace the date in the code below. If that fails, find the necessary file in the repository, click on it, press the Raw button, and keep the URL around for later use. 

Now, let’s go to Google Collab and create a new notebook. It should look something like this.

Let’s fetch pandas, the data processing library for Python. Google Colab features a solid number of pre-downloaded libraries, including pandas. All we need to do is type import pandas as pd and then hit the button to the left.

Reading the Data

Now, let’s introduce the data. Colab can pull smaller files (under 25MB) directly from github, so you won’t even have to store it locally. We will now be introducing two variables: report_url will show Google the path to the daily report, and df will store a data frame created with the pd.read_csv function.

report_url='https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/12-17-2020.csv' #change the date here to pull a more recent report
df=pd.read_csv(report_url)

Great, you’ve just imported the dataset. Nothing happened though. That’s fine: nothing is displayed unless you request that. Let’s see the first 10 lines of the dataframe with the print function. 

df.head(10)

Here’s the field descriptions from the JHU. Don’t put too much time into comprehending them: most of the fields are either self-explanatory or irrelevant for our purposes.

  • Province_State - The name of the State within the USA.
  • Country_Region - The name of the Country (US).
  • Last_Update - The most recent date the file was pushed.
  • Lat - Latitude.
  • Long_ - Longitude.
  • Confirmed - Aggregated case count for the state.
  • Deaths - Aggregated death toll for the state.
  • Recovered - Aggregated Recovered case count for the state.
  • Active - Aggregated confirmed cases that have not been resolved (Active cases = total cases - total recovered - total deaths).
  • FIPS - Federal Information Processing Standards code that uniquely identifies counties within the USA.
  • Incident_Rate - cases per 100,000 persons.
  • Total_Test_Results - Total number of people who have been tested.
  • People_Hospitalized - Total number of people hospitalized. (Nullified on Aug 31 as most stated won’t report cumulative hospitalization data)
  • Case_Fatality_Ratio - Number recorded deaths * 100/ Number confirmed cases.
  • UID - Unique Identifier for each row entry.
  • ISO3 - Officially assigned country code identifiers.
  • Testing_Rate - Total test results per 100,000 persons. The "total test results" are equal to "Total test results (Positive + Negative)" from COVID Tracking Project.
  • Hospitalization_Rate - US Hospitalization Rate (%): = Total number hospitalized / Number cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from COVID Tracking Project. The "hospitalization rate" and "Total number hospitalized" are only presented for those states which provide cumulative hospital data. (Nullified on Aug 31 as most states won’t report cumulative hospitalization data)

You can see that a lot of columns do not help in processing relevant data, especially the discontinued hospitalization data. Latitude and Longitude do not matter when looking at clearly defined USA states. FIPS and UID are not particularly useful either. Removing them would condense the dataframe. In real-life scenarios when working with hundreds of thousands of rows, deleting redundant/irrelevant columns noticeably improves the performance for both developers and end-users. 

Some rows will be skewing the data significantly. The data frame includes US dependent territories as well as the Diamond Princess ship. The data from there can be incomplete, heavily affected by the low sample size, or both. Besides, our area of interest for today is the 50 states (and DC) only.

Finally, with reporting inconsistencies that you might have heard about elsewhere, some states can only have partial data. This project only concerns data available for all states but there would otherwise be a few challenges to solve before trying to extract insights.

Data Cleaning & Transformation

First things first, excessive columns. Let’s get rid of Latitude, Longitude, FIPS, UID as well as People Hospitalized and Hospitalization Rate. The Country_Region column is also removed to avoid confusion later. We’ll be using the pd.drop function. Whenever relevant, we’ll be using the inplace=False argument, because the default value (True) brings more problems than it solves. Among other things, it does not work well when using multiple functions in the same line, which we will be doing later.

df=df.drop(columns=["Lat", "Long_", "FIPS", "UID", "People_Hospitalized", "Hospitalization_Rate", "Country_Region"], inplace=False)
df.head() #not specifying the number prints five rows

Great, our dataframe is now much neater (and takes a bit less time to load). Now, let’s continue by removing inhabited territories that are not part of what’s colloquially referred to as the US. The United Nations have already done the legwork for us here: although under the US jurisdiction, the territories have separate ISO3 country codes. 

The most efficient way to remove rows by column values is pd.query. In our case, the function keeps only rows that have USA in the country code column. Mind the syntax: the whole query should be wrapped in a set of '. If the value we’re working with is not numeric, it needs quotation marks. We’re also adding the pd.reset_index to remove gaps in indexes that came from removing several lines. The function comes with drop=True argument since we do not need a new column that stores the original indexes. 

df=df.query('ISO3 == "USA"', inplace=False).reset_index(drop=True) #removing insular US territories
df

Now, we still have 52 rows, which is one more than 50 states+DC would have given us. Could it be that something slipped? Yes, the Diamond Princess and Grand Princess ships are still in the dataframe. Just for posterity, we will be using a third function to remove the redundant line: pd.loc. The usage is similar to pd.query; in real-life scenarios, you would generally see pd.loc prove more efficient for smaller databases.

Sticking to the logic from the previous step, we may as well keep all rows that do not have “Diamond Princess” or “Grand Princess” in the Province_State column. We will be once again resetting indexes to keep things tidy.

df=df.loc[(df['Province_State'] != "Grand Princess") & (df['Province_State'] != "Diamond Princess")].reset_index(drop=True) #removing ships
df.head(10) #Diamond Princess no longer shows up

Speaking of tidy, the dataframe (as expected) is a mix of whole and decimal numbers. Their types, however, do not necessarily make sense. Recovered, Active, Total_Test_Results are all decimal numbers even though they are not. The first two columns are about people, and you can’t quite administer half a test. The seemingly nonsensical difference probably comes to legacy support: until recently, you could not have integers and NaN (indicating missing data) in the same column of a pandas dataframe. Seeing as we employ a cloud programming solution from a tech giant, our project is naturally not suffering from this limitation. 

Let’s convert the column values into whole numbers, which, in Python terminology, would be going from floating-point to integer numbers. 

#changing floats into integers
df['Recovered']=df['Recovered'].astype("Int64")
df['Active']=df['Active'].astype("Int64")
df['Total_Test_Results']=df['Total_Test_Results'].astype("Int64")
df.head()

That’s it for this stage. We now have a dataframe without irrelevant rows and columns. The figures are tidy without unwarranted decimals in columns that feature whole numbers. Now onto enriching data.

Enriching Data

One of the COVID era’s key metrics is the percent positive rate. These days, taking tests is becoming routine for business trips, emergency family visits, or checking whether you contract the disease from a contact. A high number of tests is a solid indicator of how likely one is to get COVID from a random person. The metric is utilized by state governments: New York schools switch to remote learning at a 3% percent positive rate while Iowa draws the line at 15%.

Percent positive rate is defined as the ratio of positive tests to all tests (multiplied by 100 to get a percentage). Let’s make a new column that reflects just that. You can use basic operators instead of functions for, well, basic operations. We will make actual percentages later in the project as displaying them now would require an awkward workaround later.

df['Percent_Positive_Rate']=(df['Confirmed']/df['Total_Test_Results'])
df.head(5)

The US is a federative country, one that has been moving toward more and more aspects of everyday life regulated by states themselves. In 2020, the COVID response efficiency by Democratic and Republican governors was a key point of internal politics rhetoric. Our initial dataframe lacks information on what state has a governor from which party, but we can add that ourselves. Let’s make a list of states with governors from the Democratic party (we only need one list for this task, so let’s make the shorter one).

Dem_States=['California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Hawaii', 'Illinois', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Michigan', 'Minnesota', 'Nevada', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'Oregon', 'Pennsylvania', 'Rhode Island', 'Virginia', 'Washington', 'Wisconsin'] #creating a list of states with Democratic governors
Dem_States

Now, let’s add a new column that denotes the governor’s party for each state (and DC). This requires importing the numpy library, which is a pillar for any serious math work in Python. The logic here is simple: if the state (i.e. Province_State value) has a Democratic party governor (i.e. the value is present in the Dem_States list), our new column (Governor_Party) will say Democratic. Otherwise, we print Republican. 

import numpy as np
df['Governor_Party']=np.where(df['Province_State'].isin(Dem_States), 'Democratic', 'Republican')
df.head(5)

 

Time for analysis.

Data Analysis

Let me show you a few basic methods first before we move to the main event. First, we can use pd.sort_values to, well, do just that. We will be using the ascending=False argument to start from the highest number, and na_position='first' to put states without the relevant info at the start of the dataframe. This is the more relevant option here: whenever you see a figure too high, remember that there are states that do not provide certain info at all. Instead of continuously sorting the same dataframe, I suggest you make new dataframes for each metric based on the original one.

Let’s work on Percent_Positive_Rate. It’s time to use pd.apply to format the column values into percentages. The % tells the machine to multiply the result by 100 and print it as a percentage, while 2 denotes the number of digits (and those rounding up) in the printed results. 

df_ppr=df.sort_values(by='Percent_Positive_Rate', ascending=False, na_position='first') #new dataframe sorted by PPR
df_ppr['Percent_Positive_Rate']=df_ppr['Percent_Positive_Rate'].apply("{:.2%}".format) #converting decimals into percentages
df_ppr.head()

Other columns will be easier. Let’s take a look at tests per 100,000 people. 

df_tr=df.sort_values(by='Testing_Rate', ascending=False, na_position='first') #new dataframe sorted by testing
df_tr.head(20)

A good number of states have more than 100,000 tests per 100,000 people. Does it mean that everyone has had at least one test? Not really: people get tests for a variety of reasons, and people with positive tests as well as confirmed contacts may be tested again. You can also see that, if we’re looking at the top 20 by testing per 100,000, the presence of Republican and Democratic states is pretty even. This is no surprise, as the original plan to create tests following a unique protocol led to significant test shortages across the country. 

Now, it would also be interesting to generalize some values. We can use pd.describe to check minimum, mean, standard deviation, maximum values for each column as well as percentiles. Let’s use the original dataframe for that.

df.describe() #printing table with key numbers about the dataframe

Ignoring the count row (it shows the number of values), we can see that the US states are far from polar when it comes to COVID response efficiency. The median (see row 50%) fatality of 1.54% is close to the mean of 1.70%, and it’s more of the same with cases per 100,000 people (5,440 and 5,546) as well as testing (a median of 69,317 with the mean of 73,776).

Finally, the outgoing president of the United States famously said that more testing means more positive results means more COVID cases. He also mentioned that less testing would mean less testing. Let’s see for ourselves with a correlation matrix. We start by creating a dataframe that has the relevant information for this exercise.

df_for_correlation=df[["Incident_Rate", "Total_Test_Results", "Case_Fatality_Ratio", "Testing_Rate", "Percent_Positive_Rate"]].copy() #smaller dataframe only with data relevant for correlation
df_for_correlation.head()

Next, let’s do a neat correlation matrix with pd.corr and style it. The values will be up to 1, with a higher figure indicating that one value heavily increases/decreases when the other value does. If the number is negative, it means that the values are moving in the opposite direction.

df_for_correlation.corr().style.background_gradient(cmap="coolwarm")


The intersection of Percent_Positive_Rate and Testing_Rate gives us a correlation value of -0.63. It means that the more tests per 100,000 we do, the lower share of positive tests we get. In other words, ramping up testing does not give you a proportional increase in confirmed cases (unless you were hardly testing at all). Consequently, reducing the number of tests would not give you a 1:1 reduction in positive cases. This is common sense but common sense goes well with actual data. 

Final Words

This project is a simplified version of what a Data Scientist would be doing to investigate similar questions. A full-time specialist would put effort into removing outliers (such as removing states with sub 1% case fatality), potentially try and restore missing data (aggregating Recoveries without California and Missouri is somewhat moot), look deeper into various metrics for Democratic and Republican states only. We did, however, get quite a few insights from the data in a more efficient way than Excel allows. It’s wonderful when you can get and transform data to see things for yourself.

One way to advance your Data Science skills is to study Python with us. Our real-time classes and LMS materials will set you up for back-end jobs and provide you with data foundations. The next group starts in March.