Skip to contents

The covid R package includes a dataset containing COVID-19 epidemiological data from Brazil. In this introductory vignette, we will highlight some of the characteristics of this dataset that make it useful for reporting COVID-19 statistics in Brazil.

The covidBR package

The original data comes from the Brazilian COVID-19 Portal (Portal do COVID-19), and can be found at https://covid.saude.gov.br. Daily, the Ministry of Health (Ministério da Saúde) updates and publicly disseminates the database. In this context, the covidBR R package was built to extract data from the Brazilian COVID-19 Portal.

This package contains a dataset called covidBR. Furthermore, it provides raw data, downloaded from the COVID-19 Portal, in CSV format files (access here).

Installation

You can install the development version of covidBR from GitHub with:

# Development version
# install.packages("devtools")
devtools::install_github("arianacabral/covidBR")

Load packages

Now, let’s load the covidBR and the libraries we will use in this vignette.

Meet the covidBR dataset

The covidBR dataset contains 17 variables for each Brazilian region, municipality, and state.

You can read more about the variables by typing ?covidBR::covidBR.

You can view what variables are available using the str() function.

# Display the dataset's structure
str(covidBR)
#> Classes 'data.table' and 'data.frame':   511329 obs. of  17 variables:
#>  $ regiao                 : chr  "Brasil" "Brasil" "Brasil" "Brasil" ...
#>  $ estado                 : chr  "" "" "" "" ...
#>  $ municipio              : chr  "" "" "" "" ...
#>  $ coduf                  : int  76 76 76 76 76 76 76 76 76 76 ...
#>  $ codmun                 : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ cod_regiao_saude       : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ nome_regiao_saude      : chr  "" "" "" "" ...
#>  $ data                   : IDate, format: "2022-08-06" "2022-08-07" ...
#>  $ semana_epi             : int  31 32 32 32 32 32 32 32 33 33 ...
#>  $ populacao_tcu2019      : int  210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 ...
#>  $ casos_acumulado        : num  34011173 34018371 34035780 34066000 34096935 ...
#>  $ casos_novos            : int  16703 7198 17409 30220 30935 27644 23552 17726 4429 7954 ...
#>  $ obitos_acumulado       : int  679939 679996 680166 680531 680786 681006 681253 681400 681437 681557 ...
#>  $ obitos_novos           : int  181 57 170 365 255 220 247 147 37 120 ...
#>  $ recuperadosnovos       : int  32691603 32731706 32790294 32854341 32901273 32927762 32945953 32966689 32993386 33033317 ...
#>  $ em_acompanhamento_novos: int  639631 606669 565320 531128 514876 515811 520925 517768 495463 463366 ...
#>  $ interior_metropolitana : int  NA NA NA NA NA NA NA NA NA NA ...
#>  - attr(*, ".internal.selfref")=<externalptr>

You can also view the data in a tabular form.

covidBR %>%
  dplyr::filter(data >= max(data),
                municipio == "") %>%
  dplyr::select(-c(municipio, 
                   coduf,
                   codmun, 
                   cod_regiao_saude,
                   nome_regiao_saude,
                   semana_epi,
                   populacao_tcu2019,
                   interior_metropolitana,
                   em_acompanhamento_novos,
                   recuperadosnovos)) %>%
  dplyr::arrange(regiao, estado) %>% 
  head(n = 15L) %>% 
  knitr::kable()
regiao estado data casos_acumulado casos_novos obitos_acumulado obitos_novos
Brasil 2022-11-04 34849063 2755 688332 65
Centro-Oeste DF 2022-11-04 843162 42 11832 0
Centro-Oeste GO 2022-11-04 1727729 258 27581 1
Centro-Oeste GO 2022-11-04 0 0 0 0
Centro-Oeste MS 2022-11-04 582378 15 10845 0
Centro-Oeste MT 2022-11-04 832626 114 14957 0
Centro-Oeste MT 2022-11-04 0 0 0 0
Nordeste AL 2022-11-04 321582 6 7128 0
Nordeste AL 2022-11-04 0 0 0 0
Nordeste BA 2022-11-04 1704438 107 30795 2
Nordeste BA 2022-11-04 20872 5 356 0
Nordeste CE 2022-11-04 1386713 67 28007 0
Nordeste CE 2022-11-04 18155 0 0 0
Nordeste MA 2022-11-04 474755 33 10997 0
Nordeste MA 2022-11-04 0 0 0 0

Note that the raw data contains duplicate information!

Tidy data for more efficient data science

When data is duplicated, it is necessary to organize data to help us work in an efficient, reproducible, and collaborative way.

The example below shows one way to organize the data and remove duplicate elements, but you can represent this same data in several ways.

# Get covidBR dataset
tidycovidBR <- covidBR

# Retain all rows with the most recent notification date for each Brazilian state.
tidycovidBR <- tidycovidBR %>% 
  dplyr::filter(data >= max(data),
                municipio == "") %>%
  dplyr::arrange(regiao, estado)  

# Pivot the offending columns into a new pair of variables
tidycovidBR <- tidycovidBR %>% 
  tidyr::pivot_longer(
  cols = c(populacao_tcu2019:interior_metropolitana,
           codmun),
  names_to  = "variavel",
  values_to = "valor")

# Select the interesting columns
tidycovidBR <- tidycovidBR %>% 
  dplyr::select(-c(cod_regiao_saude, nome_regiao_saude))

# Aggregate the duplicate columns
tidycovidBR <- tidycovidBR %>% 
  dplyr::group_by(variavel,
                  data,
                  regiao,
                  estado,
                  municipio,
                  semana_epi,
                  coduf) %>% 
  dplyr::summarise_at(vars(valor),
                      list(valor = max), 
                      na.rm = TRUE)

# View the data in the original structure
tidycovidBR %>% 
  tidyr::pivot_wider(names_from = variavel, 
                     values_from = valor) %>%
  dplyr::ungroup() %>%  
  dplyr::select(c(regiao,
                  estado,
                  data,
                  casos_acumulado,
                  casos_novos,
                  obitos_acumulado,
                  obitos_novos)) %>% 
  dplyr::arrange(regiao, estado, data) %>% 
  head(n = 15L) %>% 
  knitr::kable()
regiao estado data casos_acumulado casos_novos obitos_acumulado obitos_novos
Brasil 2022-11-04 34849063 2755 688332 65
Centro-Oeste DF 2022-11-04 843162 42 11832 0
Centro-Oeste GO 2022-11-04 1727729 258 27581 1
Centro-Oeste MS 2022-11-04 582378 15 10845 0
Centro-Oeste MT 2022-11-04 832626 114 14957 0
Nordeste AL 2022-11-04 321582 6 7128 0
Nordeste BA 2022-11-04 1704438 107 30795 2
Nordeste CE 2022-11-04 1386713 67 28007 0
Nordeste MA 2022-11-04 474755 33 10997 0
Nordeste PB 2022-11-04 654535 14 10406 0
Nordeste PE 2022-11-04 1066026 335 22413 2
Nordeste PI 2022-11-04 404893 0 7959 0
Nordeste RN 2022-11-04 557573 12 8483 0
Nordeste SE 2022-11-04 342915 0 6442 0
Norte AC 2022-11-04 149885 2 2029 0

Plot data

You can now use ggplot2 to explore the data and create visually appealing plots. See an example.

tidycovidBR  %>% 
  dplyr::ungroup() %>% 
  dplyr::filter(variavel != "codmun" &
                  variavel != "interior_metropolitana" &
                  variavel != "interior_metropolitana" & 
                  variavel != "populacao_tcu2019" &
                  variavel != "recuperadosnovos", 
                regiao != "Brasil"
  ) %>%  
  dplyr::mutate_at("variavel", 
                   stringr::str_replace_all,
                   pattern = "_",
                   replacement = " ")  %>% 
  ggplot(aes(x = estado, y = valor, fill = estado)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~ variavel, scales = "free_y", ncol = 1) + 
  labs(x = "", y = "") + 
  theme_minimal() + 
  theme(legend.position = "none")