Data Deep Dive

Brazil Covid-19 Line List

Data up to July 26th, 2021
Completeness *$
DT_NOTIFIC (Date of confirmation)100%
DT_SIN_PRI (Date of symptom onset)100%
CS_RACA (Ethnicity)98.01%
HISTO_VGM (Travel history)100%
NU_IDADE_N (Age)100%
HOSPITAL (Hospitalization) 97.72%
UTI (ICU admission) 87.62%

*as of 26/07/2021;
$line list includes 5.5% of cases reported for Brazil by the World Health Organization; completeness metrics were computed on the subset of data representing confirmed COVID-19 cases.

In the following case-study we take a deep-dive into COVID-19 line-list data from Brazil, one of the >130 countries included in the platform. The case-study covers information about provenance of the data, data transformations to fit the schema, and key characteristics and limitations of the data. While only addressing one country, the design of lets users quickly ask similar questions about any country included in the platform. We will discuss how users can conduct such an investigation on their own. Stay tuned for more of these data deep dives coming up. Here are other examples for Peru and Colombia.

In general, the COVID-19 case reporting in Brazil is complex with multiple sources existing on multiple administrative levels (state, country). As well as federal-level datasets such as SRAG, many states have their own COVID-19 data reporting systems, such as the Portal de Informações sobre o Combate à Covid-19 of the State of Acre. We here describe details about the national Severe Acute Respiratory Syndrome (SRAG) surveillance system. Among others this dataset has been used in a study to monitor the severe second wave of infections in Manaus caused by the SARS-CoV-2 lineage P.1 (Gamma Variant of Concern) and to map the spatio-temporal distribution of onset-to-hospitalisations across Brazilian states.

What is the provenance of the data?

The Ministry of Health of the Government of Brazil (Ministério da Saúde), through the Health Surveillance Secretariat (SVS), has been carrying out surveillance of Severe Acute Respiratory Syndrome (SRAG) in Brazil since the swine flu (H1N1) pandemic of 2009. In 2020 surveillance of COVID-19 was incorporated into this surveillance network. The SRAG database is openly available through a website as a series of downloadable datasets. The relevant dataset which includes COVID-19 cases for 2021 and 2022 can be found here. Data are updated weekly on Wednesdays but may exceptionally occur on other days. The first recorded confirmed COVID-19 case in the dataset is from February 22nd, 2020. A data dictionary and further information can be found here; note that some fields indicated in the data dictionary are not present in the dataset.

Where can I find the original data and how is the data transformed?

Raw data can be downloaded here and here for 2020 and 2021 cases, respectively, and details of our parser that transforms the data to our standard schema can be found here. The Brazilian SRAG dataset records all cases of individuals with severe respiratory infections, including patients with SARS-CoV-2 and Influenza. To subset the data to COVID-19 cases only we use the final case classification as reported in the dataset (‘CLASSI_FIN’). A case is considered to be COVID-19 if the entry is ‘5’, corresponding to an individual infected with SARS-CoV-2. 

We geocode cases within Brazil by adding centroids (latitude and longitude) to the municipality level through a manual lookup table, which is provided by the Instituto Brasileiro de Geografia e Estatística. For geocoding country-level information included in travel information, a lookup table is used to map country ISO-2 codes to longitude/latitude of country centroids, obtained from GoogleMaps. We delete and re-ingest this whole dataset weekly on Sundays, to ensure that all cases are updated and no duplicates exist. However do note that this means the data may have not been updated for up to one week.

How complete is the data compared to aggregated data sources?

It is important to recognize that this dataset is not intended to track all cases of SARS-CoV-2 reported in Brazil, but only those with more severe outcomes requiring hospitalization. Consequently, when compared to the World Health Organization’s tally, on July 26th 2021 the dataset included 1,090,429 confirmed COVID-19 cases compared to the WHO’s 19,880,273, approximately 5.5% of the total. 

Key characteristics and limitations of this database:

The SRAG line list dataset from Brazil provides 154 metadata fields for each patient. No unique ID is provided per patient. The date that the case notification form to SRAG was filled out (‘DT_NOTIFIC’) is used as the date of confirmation of the patient in our database, and is provided 100% of the time. 

Geographic data is provided to the municipality level, both in the form of a name (‘ID_MUNICIP’) and code (‘CO_MUN_NOT’). The information contained in the code was used to map to the correct latitude and longitude and provide the municipality name of each patient to avoid possible issues with spelling. The state-level information was also provided in the form of a code (‘SG_UF_NOT’) which was mapped to the respective state name. International travel within 14 days before symptoms appear is recorded in the data (‘HISTO_VGM’). Patient outcome (‘EVOLUCAO’) is indicated as a recovery or death, with a distinction made for death from other causes. When this is the case, this is added to the notes field of the G.h schema. The data contains extensive information regarding patient demographics, reporting on age (‘NU_IDADE_N’), sex (‘CS_SEXO’), and ethnicity (‘CS_RACA’), as well as reporting on the presence of 10 and 12 comorbidities and symptoms, respectively. 

Since the SRAG database is very rich and detailed compared to most countries, not all of the fields provided in the original data fit our global schema. Interesting information such as whether the patient had a flu vaccine during the previous vaccination campaign (‘VACINA’) or results of a possible chest X-ray (‘RAIOX_RES’) does not yet have a dedicated field in our schema but can be obtained from the original data. In addition, recent changes to the dataset include reporting on whether the patient has been vaccinated against COVID-19 (‘VACINA_COV’) and goes on to report details such as dates of 1st and 2nd doses (‘DOSE_1_COV’ and ‘DOSE_2_COV’, respectively) as well as the lot numbers of both 1st and 2nd doses (‘LOTE_1_COV’ and ‘LOTE_2_COV’, respectively). Further iterations of the parser may try to partly capture this data by including it in the notes field.

Importantly, the Brazilian SRAG dataset records all cases of individuals with syndromes consistent with severe respiratory infections rather than all COVID-19 cases. Although the total dataset as of 26 July 2021 included 2,497,568 individuals, only a subset of these correspond to confirmed COVID-19 cases, which are filtered as described above. In addition, because this dataset records the subset of cases with more severe outcomes requiring hospitalization, so if users are interested in analyzing features related to all cases, the SRAG database represents a biased subset of the true total. Within the database, we have some Brazilian state-specific line lists (e.g. for the State of São Paulo), which while more complete with respect to all infections detected in that state, lack the richness in clinical metadata. As a result, we recommend that users filter the database by source depending on what their analyses require.

How to filter, view, and download this data:

To access the most up to date data described above please follow this link. You can also access a visualisation of these data on our Map application.

Signature & Contact

Anya Lindström Battle

Anya Lindström Battle

Data Scientist,
University of Oxford
on behalf of the team

In Development

Currently in development, launching early 2021.