A personal account of my experience - the thought process, challenges, and solution approach is documented in this post. The idea is to express that data analysis/science can be tricky, and not everyone gets it right the first time. It takes multiple iterations to come up with something decent while navigating the challenges faced.
The objective of this exercise is to analyze the COVID-19 cases and report the findings using MS Excel.
The data for the same can be downloaded from Tableau Website.
The data source has approximately 114,000 rows of COVID-19 cases all over the world. The following data is provided - the date, country, region, the case status (active, confirmed, recovered, deaths) information, and the location details.
I have analyzed the data with Microsoft Excel and have provided the insights.
Because, at this point, I know only Excel (40%, I presume). But, I thought that it was enough to do this exercise. Speaking rationally, the choice of the tool doesn’t matter. What matters is the analysis!
I took stock of all the Excel stuff I knew and whether it would be sufficient to do my analysis!? I didn’t know yet.
The following features of Excel may be used to make analysis easier was my guess.
In retrospect, I was grossly wrong on my Excel skills, and I had to do some learning along the way to achieve what I wanted.
At the start of the analysis, I did not have a definitive list of what metrics I would be reporting on - as I just had a raw dataset.
The possible things that came up rationally are to report on specific vital metrics that are easy to communicate like
Time series data is the complex one here. Why complex? Anything with dates’ - my understanding is that it can be complicated.
Geological location - well, I don’t have much experience in plotting data on maps. But, apart from that, the rest is easy to report.
Will the analysis be interesting? Will I capture the essence of the dataset? Not sure yet!
My understanding is that data analysis does not have a definitive ending. What metrics to report on depends on the amount of time we have at hand and what parameters the business will be interested in.
If we think that way, the top three in the above list do not add much business value plus point 5 is cosmetic. Time series analysis is the one the business will be much interested in as they can see the trend and can make decisions if such diseases spread next time. I presume China did this as they played along with the SARS playbook, which they drafted when it hit China in 2003.
I didn’t overthink at this point as to what metrics to report unless I look into the data and understand it. I had to get the ball rolling as overthinking can lead to a thought-loop, and I will end up where I have started. So, I started with the first step - the data clean-up and understanding process.
The first thing to notice was that the data needed a clean-up. The first column date had two different formats when imported into Excel. I don’t know why this happens in Excel.
Not a good start! Ugh!
Upon Googling, I found that this is a known problem in Excel (for large datasets?).
The first challenge - I have to bring it to a consistent format to present a meaningful analysis.
I had to see further which of the two formats needed a clean-up. Upon applying the filter, I got to see that
mm/dd/yyyy was not correctly picked up, and it required clean-up.
Understanding the data is critical - what information is captured in the spreadsheet and how to interpret it!?
First of all, the date column was not sorted. I had to sort on the date - Column A to set the date in ascending order. This would give me the ability to see how the total cases were captured. Is it incremental or cumulative cases reported per day?
Then, I did a quick pivot table analysis to check the cases of a few countries, one being Afghanistan.
As per my analysis with a quick pivot table, Afghanistan suggested a total of 418 cases.
To understand if the above numbers were correct, I cross-verified with the coronavirus numbers published by John Hopkins University’s Corona Dashboard
What I found was that the total cases for Afghanistan were only 40! (see image below - as of date March 23, 2020)
The data-set which I downloaded was also accurate, but my interpretation of the data was wrong.
The cases were captured by date, and the latest date across any country was the cumulative total on that date.
When I did a pivot table and computed by “Sum of total cases”, I got a grossly wrong number. Instead, the total cases have to be set to “Max of total cases” in the pivot table to get the correct number.
I did that, and I could see that the data was in-line with the data provided in the John Hopkins University website. But there was one more problem!
When I took the “Max of total cases,” it would not work for countries like China, the US, France, etc., as they have multiple districts and province numbers reported separately in each row.
What do I mean? For countries like America, it had multiple rows - one for each State in the US like Arizona, Washington, New York, etc. For countries like India, Japan, etc., there was just one row for the entire country.
So, if I take the China data and compute the data based on “Max of total cases,” it would only report numbers from the Wuhan province as it had the most confirmed cases. For most countries like India, Japan, etc., I had to go with the “Max of total cases” route and other countries like China, the US, etc. I had to compute based on “Sum of total cases”.
This data pre-processing took hours as I didn’t expect the pattern of data to change based on countries. In the end, I found it and manually tweaked the data. It was a tedious process, but I had to do and correct it.
Lesson learned: Always double-check the data.
Concerning data, one more clean up was necessary. The total number of cases were broken down into categories like “Confirmed”, “Active”, “Recovered” and “Deaths” and all these numbers were reported in “Rows”. [See image below]
What I need was - the data to be represented in columns (like the image below) so that aggregating the numbers would be more comfortable.
To get it in this format, I had to do a pivot table on the raw data, and for each country, I had to translate data in this format. I had to break this task into two days as the dataset had 168 countries, and each country had approximately 80 rows of data.
Once I had this data cleaned and prepared, I was ready to launch the analysis.
Give me six hours to cut a tree, and I will spend the first four to sharpen the axe
― Abraham Lincoln
Anybody can cut a tree when they have enough time. But, the challenge is to cut a tree in six hours, which not everyone can do. The approach has to be different.
My thought process at this point was - how can I efficiently analyze the data!? The dataset is huge; 110K rows, which I had aggregated to 9700+ rows for 160+ countries.
In short, I thought, how can I sharpen the axe so that I can cut down the tree in minimal time. I have a week to complete the analysis.
I can take the traditional route and make one worksheet for each country, and each sheet can have $n$ number of pivot tables and graphs, which can be analyzed further.
I started with this approach and whenever I saved the file, it took almost 10 seconds to save. The reason behind is the master data, cleaned up data was all in one file. I had to move the master data to another sheet to save some time plus the size of the file itself. But that didn’t help much.
Since I had multiple sheets and multiple graphs, I had to scramble across sheets to find the required data, and in the process, I got mad. Yeah! Too much data can make you crazy - at times. There has to be a better way!
What if I can create one master sheet and have it as a base template? When I change a country from a drop-down list, the dataset has to change, and the graph should refresh automatically. Sounded better!
Well, there was a problem! I didn’t know how to do it. I had never worked with Excel dynamic ranges or drop-down lists. I began my research and learned a few topics and watched some videos. (References at the end of this article.)
And, when you want something, all the universe conspires in helping you to achieve it.
― Paulo Coelho
One thing leads to another, that thing leads to another, and finally, I learned the following functions and features in Excel that helped me get it done.
All these topics were learned as and when required in preparing the holy grail - the Excel dashboard.
I prepared a time-series chart in a separate sheet to show the confirmed cases, active and recovered cases over time. Also, plotted the map for the specific country (Excel makes it very easy), and added another time series to show the recovery percentage and death percentage. I added another chart to show the next ten countries in the list and how the selected country is doing with respect to its counterparts, list countries based on their recovery and death percentage - who is doing good? And who is lagging?
And yes! The final thing - Add a list of countries and make all the above charts dynamic. Woohoo!
The first iteration is the one below!
Clicking each country refreshed the graph, highlighted the specific country in the Recovery % graph and the Death % graph (highlighted in Cyan and Red, respectively, in the above image).
I played around with it and saw what analysis I could get from it. Well! The fundamental analysis was there - what I needed!
In three words, I can sum up everything I have learned about life: it goes on!
― Robert Frost
Like the quote said, life moved on, and a curious question popped up. I wanted to narrow down to a particular period to see how many people were affected in a specific period.
This would allow me to see how pre-lockdown and post-lockdown trends vary and answer whether the lockdown was effective or not!?
The current dashboard cannot answer such questions as it didn’t have a provision to narrow down a time-period say e.g., From February 21 to March 07. This sort of ability can help me break down the data into different time-periods like weeks, months, and analyze.
I knew the solution would be to give some sort of date range as input - a start date and end date.
I had to introduce a couple of combo boxes, and based on the from date and to date, the results should also filter and give me numbers for that week.
I chose to keep the current graph but added one more chart to the bottom to show the period’s trend and numbers. With some combo boxes,
OFFSET functions, and the Name Manager feature, I was able to do it.
The second iteration of the dashboard looked like the one below.
I changed the layout to accommodate more graphs.
1 is the From date,
2 is the To date. Clicking each of these will change the chart
5 accordingly. Countries list
3 moved to the top.
6 is the numbers on the From date and To date, respectively.
8 represents Death % and Recovery %.
Not bad at all! I took it for a test, and everything seems to be in place.
To improve is to change. To be perfect is to change often.
― Sir Winston Churchill
Again, I stumbled upon a thought. I am not doing this for myself. I am doing this for everyone who does not have the time to clean the data and do the analysis.
Can I add a visual-cue to the graph to indicate the From date and To date so that the users know which window they are looking into? So, I decided to introduce a couple of vertical lines into the graph to show the dates.
Also, I wanted to add a line to indicate when the first infected case was reported? This would give me an indication as to when the infection started and how much time the countries had to make decisions before the disease grew exponentially.
I tested it, and it worked fine. Now, I can find out when the infection started and narrow it down to the period when the first death occurred. I can then find (from Google) when the lockdown period was announced and describe the trends before and after - and how much lead-time the countries had before going on lockdown and what they could have done differently to minimize the infection.
Here is a gif that tells you how Excel-based COVID19 Dashboard looks like in action.
So, I was happy with the result - not only because of what I could churn out, but also I learned so much in the process. It was time-consuming but worth it!
Now you may ask where is my analysis!?
Here is the generic analysis - the one that businesses would expect - one-liners mostly. A detailed report would make a separate post.
This exercise was time-consuming yet, rewarding. On the learning front, some key take-aways
Well, you can download the COVID19 Excel dashboard (link below) and can come up with your analysis. That way, this is open-ended and allows you to explore more and come up with analysis. In short, this is my version of the sharpened axe you can take and go for cutting the tree!
If you have any feedback on how this can be improved, let me know. Thanks for reading!
Be socially responsible, practice social-distancing, and stay safe!
The Excel dashboard discussed in this post can be downloaded from the following link.
Thanks for reading! If you enjoyed reading this article, consider subscribing to our newsletter below. We will notify you on the latest content published on datasciencia.