Currently in development, launching early 2021.
(DEPARTAMENTO, PROVINCIA & DISTRITO)
|Department (Admin 1) is specified in 100% of cases. 5.203% of entries for Province (Admin 2) and District (Admin 3) are ‘In investigation’|
|Date of Confirmation|
|Type of Test (METODOX)||> 100%|
In the following case-study we take a deep-dive into COVID-19 line-list data from Peru, one of the >130 countries included in the Global.health platform. The case-study covers information about provenance of the data, data transformations to fit the Global.health schema and key characteristics and limitations of the data. While only addressing one country, the design of Global.health lets users quickly ask similar questions about any country included in the platform and we will discuss how users can conduct such an investigation on their own. Stay tuned for more of these data deep dives coming up.
What is the provenance of the data?
The National Institute of Health and National Center for Epidemiology, Prevention and Control of Diseases (MINSA) of Peru collects and shares individual level case data for COVID-19 through their official government website associated with an Open Data Commons Attribution License. Their platform was launched on May 18th, 2020, and data are usually updated daily. The metadata provided for each patient have been largely consistent since the first reported case on March 29th, 2020.
Where can I find the original data and how is the data transformed?
Raw data can be downloaded on the link here and details of our parser that transforms the data to our standard schema can be found here. Specifically, we add centroids (latitude and longitude) through a manual lookup table which is provided by the INEI (Instituto Nacional de Estadística e Informática). We ingest this database once per day and check for any updates on previously ingested cases going back one month.
How complete is the data compared to aggregated data sources?
It appears that the individual level case data provided by the Peruvian Ministry of Health are almost complete when compared to official data provided to the World Health Organization (WHO): For example on May 26th, 2021 the dataset included 1,925,289 records which is >99% of those reported by WHO on that day (1,926,923). However, we note that details on the patient’s outcome (e.g., recovered) are not included in the line list.
Key characteristics and limitations of this database:
The line list dataset from Peru provides 8 metadata fields for each patient. These include ‘Universally unique identifiers (UUID)’ which unfortunately change through time (i.e., with every update of the dataset there are new UUIDs which do not correspond to previous days). Geographic metadata are provided and include the Department, Province & District (DEPARTAMENTO, PROVINCIA & DISTRITO) of where the case was reported. The only date provided is the date of confirmation of SARS-CoV-2 infection and the associated PCR, LFT or Serological test (no other information). Other metadata include the exact age; sex (male, female) of the patient (see table above). These fields are consistently filled out (>99% of the time) making them very useful for downstream analyses.
There were some reporting changes through time including as of June 2020 the database stopped including positive tests identified by IPRESSs (Instituciones Prestadoras de Salud – Health Service Provider Institutions) which were run for private companies to enable workers to return to work.
Further, there were some small geocoding issues which we detail below: The geographic metadata of each case (Department, Province & District) are often provided in a non-standard way in this database such that they sometimes do not exactly match the official place name.
For example, place names are abbreviated:
“SAN FRANCISCO DE ASIS DE YARUSYACAN, PASCO, PASCO” is abbreviated as
“SAN FCO DE ASIS DE YARUSYACAN, PASCO, PASCO”
By inspecting commonly occurring place name mismatches, we augmented our lookup table with a number of mispelled or abbreviated place names, to ensure more cases could be correctly geocoded and therefore added to our database.
During one recent ingestion (May 26th, 2021) 1,913,719 cases were ingested out of a total of 1,925,289. We identified one error where age validation failed (age = -79), out of the remaining 11,320 that did not ingest 2023 did not contain information about ‘FECHA_RESULTADO; and 9297 failed the geocoding lookup and were not added to the database. We also note that there are 23 cases with age >120 which were excluded.
June spike in cases:
Between May 23rd and June 9th, an additional 51,315 cases were added to the database, creating a spike in the daily case rate. A small subset of these had Date Confirmed (FECHA RESULTADO) in 2020 (indicated with black crosses in the Figure below). This spike in daily cases coincides with reports of a large increase in COVID-19 related deaths counted in Peru, due to a revision in the case definition of COVID-19 related deaths (see also: https://www.bmj.com/content/373/bmj.n1442).
Figure 1: Daily COVID-19 cases in Peru reported by the Peruvian Ministry of Health (MINSA) which are available daily. Red bars and black crosses show daily cases reported in the June 9th, 2021 dataset that were not reported in the dataset downloaded on May 27th, 2021.
How to filter, view, and download this data:
Signature & Contact
Currently in development, launching early 2021.