Advanced Data Visualization on a Covid19 database

First post of this blog!

It will be about data vizualization, we will explore how we can make dynamic charts (see below).

Figure 1 – Evolution of active cases
Figure 2 – Evolution of death cases

We will extract the data we need from a Covid19 database, which we can find on a Github repository managed by a John Hopkins University team.

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

This database basically store, each day, new confirmed cases, new deaths or new recovery cases linked to Covid19 for each country.

The main purpose here is not the data quality (which can be questionnable!), but tips to display it in dynamic ways, to complement static charts for better data visualization.

Step 1 – Download data

First we load our main libraries we will be using:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib.offsetbox import AnchoredText
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from IPython.display import HTML

We can load the files:

# URL links

url_confirmed = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url_recovered = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"
url_death = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

# Download csv

data_confirmed = pd.read_csv(url_confirmed, error_bad_lines=False)
data_recovered = pd.read_csv(url_recovered, error_bad_lines=False)
data_death = pd.read_csv(url_death, error_bad_lines=False)

Step 2 – Transform data

The 3 dataframes are nearly identical, it looks like that:
(sample)

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20
0Afghanistan Afghanistan 33 65 0 0 0 0 0 0 0 0
1Albania Albania 41.153320.1683 0 0 0 0 0 0 0 0
2Algeria Algeria 28.0339 1.6596 0 0 0 0 0 0 0 0
3Andorra Andorra 42.5063 1.5218 0 0 0 0 0 0 0 0
4Angola Angola -11.202717.8739 0 0 0 0 0 0 0 0

Now we would like to transform a bit to be more user friendly, below the example for the recovery dataframe: (same logic for the other DFs)

all_countries = data_recovered['Country/Region'].unique() # list all countries
all_index = pd.to_datetime(data_recovered.columns[4:]) # list all dates
data_recovered_hist= pd.DataFrame(columns=all_countries,index=all_index) # create empty DF

# iterate through countries to compute aggregated timeseries
for country in all_countries:
    data_recovered_hist[country]  = data_recovered[data_recovered['Country/Region'] == country][data_recovered[data_recovered['Country/Region'] == country].columns[4:]].astype(float).sum()

Now we have something like this:
(sample)

Afghanistan Albania Algeria Andorra Angola
2020-01-22 00:00:00 0 0 0 0 0
2020-01-23 00:00:00 0 0 0 0 0
2020-01-24 00:00:00 0 0 0 0 0
2020-01-25 00:00:00 0 0 0 0 0
2020-01-26 00:00:00 0 0 0 0 0

We can compute active cases (Confirmed cases minus people who have died or who have recovered):

data_active_hist = data_confirmed_hist - data_death_hist - data_recovered_hist

We retrieve all geolocalization data in a new dataframe

data_coord = pd.DataFrame(index=data_active_hist.columns,columns=['latitude','longitude'])
for cnty in data_coord.index:
    data_coord.loc[cnty] = [data_death[(data_death['Country/Region'] == cnty)].sort_values(data_death.columns[-1],ascending=False)['Lat'].values[0],data_death[(data_death['Country/Region'] == cnty) ].sort_values(data_death.columns[-1],ascending=False)['Long'].values[0]]

(sample)

latitude longitude
Afghanistan 33 65
Albania 41.1533 20.1683
Algeria 28.0339 1.6596
Andorra 42.5063 1.5218
Angola -11.2027 17.8739

Step 3 – Generate charts

Figure 1 – The world map

We are going to use the python library Cartopy that deals with geographic charts: https://scitools.org.uk/cartopy/docs/latest/index.html

With this library we can have a world map, with coast lines or country borders already plotted. We only have to provide some latitude and longitude data to add custom markers.

Here we want to plot the evolution of “declared” active cases. However to avoid getting a messy map, we will mark only the biggest countries by red dots.
Size of red dots will depend of the number of active cases, like the sample table below:

Size of nodeLower boundUpper bound
001000
410005000
8500010000

We define an animate() function that is going to mark, for each date, the countries (with a dot) accordingly to the “size” mapping. This function will be called when we create our animation.FuncAnimation().

The code below:


time_window = len(all_index) #Number of frames

fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.Robinson())
    
ax.set_global()
ax.stock_img()
ax.coastlines()
ax.add_feature(cfeature.BORDERS)
text = AnchoredText(format(data_active_hist.iloc[0].name.strftime('%Y/%m/%d')), loc=4, prop={'size': 12}, frameon=True)
ax.add_artist(text)
scat = []

def animate(i):
    global scat
    for cont in ax.containers:
        cont.remove()
    merged_data = data_coord.merge(pd.Series(data_active_hist.iloc[i], name="active"), left_index=True, right_index=True)
    conditions = [(merged_data['active'] < 1000), 
              (merged_data['active'] >= 1000) & (merged_data['active'] < 5000), 
              (merged_data['active'] >= 5000) & (merged_data['active'] < 10000), 
              (merged_data['active'] >= 10000) & (merged_data['active'] < 20000), 
              (merged_data['active'] >= 20000) & (merged_data['active'] < 40000), 
              (merged_data['active'] >= 40000) & (merged_data['active'] < 80000), 
              (merged_data['active'] >= 80000) & (merged_data['active'] < 160000), 
              (merged_data['active'] >= 160000)]
    choices = [0, 4, 8, 16, 32, 64, 128, 256]
    merged_data['marker_size'] = np.select(conditions, choices, default=0)

    # scatter plot
    if scat:
        scat.remove()
    scat = ax.scatter(merged_data['longitude'], merged_data['latitude'], marker='o', c='red', s=merged_data['marker_size'], transform=ccrs.Geodetic())
    text.txt.set_text(format(data_active_hist.iloc[i].name.strftime('%Y/%m/%d'))) # change date in the legend
    return text,

anim=animation.FuncAnimation(fig, animate, repeat=False, 
                             blit=False, frames=time_window, interval=300)

plt.close(fig)

# we can have either save it as a file
anim.save('covid_active_dynamic_map.mp4')

# or display it in a notebook using HTML
#HTML(anim.to_jshtml())
Figure 1 – Evolution of active cases

Figure 2 – The horizontal bar plot

We will perform a simple horizontal bar plot, with the 10 worst countries, regarding the total deaths “officially” linked to Covid19.

The idea here is to assign for each country a unique color, for a better vizualization. Otherwise when a country’s ranking would change, its color would also.

As for Figure 1 we are using the animation library, for each frame we look up at the 10 worst countries, then plot them.

Remark: We add a while condition that checks if we don’t have 10 countries (for the beginning of the crisis, when there were no deaths except in China), we fill with empty data, to keep a constant size bar plot.

color_countries = [] # we attribute a unique colour for each country
for cnty in all_countries:
    r = np.random.rand()
    b = np.random.rand()
    g = np.random.rand()
    color_countries.append([r,g,b])
color_countries = pd.Series(color_countries)

time_window = len(all_index) #Number of frames

fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot()

def init_animate():
    ax.clear()
    
def animate(i):
    for bar in ax.containers:
        bar.remove()
    spot_vector = data_death_hist.iloc[i]
    spot_vector = spot_vector[spot_vector>0].sort_values(ascending=False).head(10)
    spot_vector = spot_vector.sort_values(ascending=True)
    this_colors = color_countries[np.array([pd.Series(all_countries).loc[pd.Series(all_countries)==i].index[0] for i in spot_vector.index])]
    
    while len(spot_vector) < 10:
        spot_vector = spot_vector.append(pd.Series(index=[""],data=0))
        this_colors = this_colors.append(pd.Series(index=[""],data=[[0,0,0]]))
    yticks = np.arange(len(spot_vector))    
    

    ax.barh(y=yticks, width=spot_vector, color=this_colors, tick_label=spot_vector.index)
    ax.set_title(all_index[i].strftime('%Y/%m/%d'))

anim=animation.FuncAnimation(fig, animate, repeat=False,init_func=init_animate, 
                             blit=False, frames=time_window, interval=300)

plt.close(fig)

# we can have either save it as a file
anim.save('covid_death_dynamic_barplot.gif', writer='pillow')

# or display it in a notebook using HTML
#HTML(anim.to_jshtml())
Figure 2 – Evolution of death cases

That’s it!

These were two examples of how we can make dynamic charts, which can be complementary when paired with static charts, for a good data visualization.