COVID-19 Open-Data
COVID-19 Open-Data attempts to assemble the largest Covid-19 epidemiological database, in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.
The details are in GitHub here.
It's easy to insert this data into ClickHouse...
The following commands were executed on a Production instance of ClickHouse Cloud. You can easily run them on a local install as well.
- Let's see what the data looks like:
The CSV file has 10 columns:
- Now let's view some of the rows:
Notice the url
function easily reads data from a CSV file:
- We will create a table now that we know what the data looks like:
- The following command inserts the entire dataset into the
covid19
table:
- It goes pretty quick - let's see how many rows were inserted:
- Let's see how many total cases of Covid-19 were recorded:
- You will notice the data has a lot of 0's for dates - either weekends or days when numbers were not reported each day. We can use a window function to smooth out the daily averages of new cases:
- This query determines the latest values for each location. We can't use
max(date)
because not all countries reported every day, so we grab the last row usingROW_NUMBER
:
- We can use
lagInFrame
to determine theLAG
of new cases each day. In this query we filter by theUS_DC
location:
The response look like:
- This query calculates the percentage of change in new cases each day, and includes a simple
increase
ordecrease
column in the result set:
The results look like
As mentioned in the GitHub repo, the dataset is no longer updated as of September 15, 2022.