In this practical, you will read in the taxation.csv data set and analyze it descriptively using the dplyr package.

0 - Preliminaries

Run the code line below to load the tidyverse. To do this, you can either place the cursor on the line and then hit Cmd-Enter (Mac) or Ctrl-Enter (Windows). Alternatively you can click on the green arrow in the top-right corner of the code chunk. If you have not previously installed the tidyverse, run install.packages("tidyverse") first and then the line below.

library(tidyverse)

Use the code below to read in the taxation.csv data set and store it as an object called basel. The messages printed in the console tell you how R has inferred the types of each of the columns. Should be all good.

basel <- read_csv('1_Data/taxation.csv')

Take a look at basel by sending it to console. Note how the tibble shows types for each of the columns. Also note how some cases and variables are abbreviated.

basel

## # A tibble: 357 × 10
##     year quarter          quarter_no     N income_mean income_median income_gini
##    <dbl> <chr>                 <dbl> <dbl>       <dbl>         <dbl>       <dbl>
##  1  2001 Altstadt Grossb…          1  1673       87776         51819       0.593
##  2  2001 Vorstädte                 2  3204       84109         49914       0.577
##  3  2001 Am Ring                   3  6579       62582         49426       0.467
##  4  2001 Breite                    4  5433       52039         47227       0.358
##  5  2001 St. Alban                 5  6179       89956         58112       0.54 
##  6  2001 Gundeldingen              6 11224       51229         46265       0.387
##  7  2001 Bruderholz                7  5090       96124         64512       0.52 
##  8  2001 Bachletten                8  8157       70348         56258       0.444
##  9  2001 Gotthelf                  9  4256       59049         47960       0.435
## 10  2001 Iselin                   10  9853       49631         45530       0.371
## # ℹ 347 more rows
## # ℹ 3 more variables: wealth_mean <dbl>, wealth_median <dbl>, wealth_gini <dbl>

1 - Summarize

Let’s analyze the data. Complete the code below to compute the average Gini coefficient for income (income_gini) for Basel across all years. Simply replace XX with the appropriate variable name.

basel %>% 
  summarize(avg_gini_income = mean(income_gini))

## # A tibble: 1 × 1
##   avg_gini_income
##             <dbl>
## 1           0.473

Now add another statement within summarize to also calculate the average Gini for wealth. What do you make of the results? Think back to the distributions and corresponding Gini values from the intro presentation.

basel %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini))

## # A tibble: 1 × 2
##   avg_gini_income avg_gini_wealth
##             <dbl>           <dbl>
## 1           0.473           0.882

2 - Filter

Now look at the average Gini coefficients for just one quarter: “Bettingen”.

basel %>% 
  filter(quarter == "Bettingen") %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini))

## # A tibble: 1 × 2
##   avg_gini_income avg_gini_wealth
##             <dbl>           <dbl>
## 1           0.542           0.854

Now try "Gundeldingen" instead. How do the results compare?

basel %>% 
  filter(quarter == "Gundeldingen") %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini))

## # A tibble: 1 × 2
##   avg_gini_income avg_gini_wealth
##             <dbl>           <dbl>
## 1           0.411           0.861

3 - Grouping

Wouldn’t it be great if one did not have to select each quarter one by one using filter? The group_by andsummarize idiom saves you from exactly that. Complete the code below to calculate the average Gini coefficients for every quarter.

basel %>% 
  group_by(quarter) %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini))

## # A tibble: 21 × 3
##    quarter             avg_gini_income avg_gini_wealth
##    <chr>                         <dbl>           <dbl>
##  1 Altstadt Grossbasel           0.617           0.931
##  2 Altstadt Kleinbasel           0.496           0.914
##  3 Am Ring                       0.502           0.876
##  4 Bachletten                    0.464           0.849
##  5 Bettingen                     0.542           0.854
##  6 Breite                        0.391           0.835
##  7 Bruderholz                    0.587           0.940
##  8 Clara                         0.439           0.863
##  9 Gotthelf                      0.457           0.936
## 10 Gundeldingen                  0.411           0.861
## # ℹ 11 more rows

R has printed only 10 of the 21 quarters due to a default setting for the tibble print. Overwrite this setting by telling R exactly how many rows you would like to see.

basel %>% 
  group_by(quarter) %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini)) %>% 
  print(n = 21)

## # A tibble: 21 × 3
##    quarter             avg_gini_income avg_gini_wealth
##    <chr>                         <dbl>           <dbl>
##  1 Altstadt Grossbasel           0.617           0.931
##  2 Altstadt Kleinbasel           0.496           0.914
##  3 Am Ring                       0.502           0.876
##  4 Bachletten                    0.464           0.849
##  5 Bettingen                     0.542           0.854
##  6 Breite                        0.391           0.835
##  7 Bruderholz                    0.587           0.940
##  8 Clara                         0.439           0.863
##  9 Gotthelf                      0.457           0.936
## 10 Gundeldingen                  0.411           0.861
## 11 Hirzbrunnen                   0.391           0.833
## 12 Iselin                        0.395           0.839
## 13 Kleinhüningen                 0.394           0.857
## 14 Klybeck                       0.401           0.878
## 15 Matthäus                      0.471           0.891
## 16 Riehen                        0.503           0.871
## 17 Rosental                      0.426           0.886
## 18 St. Alban                     0.540           0.896
## 19 St. Johann                    0.446           0.877
## 20 Vorstädte                     0.593           0.952
## 21 Wettstein                     0.457           0.880

Looks like there are some differences between the quarters. How about arranging the results such that the quarters are sorted according to avg_gini_income. Which quarter has the highest Gini coefficient, which the lowest? How different are the two?

basel %>% 
  group_by(quarter) %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini)) %>% 
  arrange(avg_gini_income) %>% 
  print(n = 21)

## # A tibble: 21 × 3
##    quarter             avg_gini_income avg_gini_wealth
##    <chr>                         <dbl>           <dbl>
##  1 Breite                        0.391           0.835
##  2 Hirzbrunnen                   0.391           0.833
##  3 Kleinhüningen                 0.394           0.857
##  4 Iselin                        0.395           0.839
##  5 Klybeck                       0.401           0.878
##  6 Gundeldingen                  0.411           0.861
##  7 Rosental                      0.426           0.886
##  8 Clara                         0.439           0.863
##  9 St. Johann                    0.446           0.877
## 10 Gotthelf                      0.457           0.936
## 11 Wettstein                     0.457           0.880
## 12 Bachletten                    0.464           0.849
## 13 Matthäus                      0.471           0.891
## 14 Altstadt Kleinbasel           0.496           0.914
## 15 Am Ring                       0.502           0.876
## 16 Riehen                        0.503           0.871
## 17 St. Alban                     0.540           0.896
## 18 Bettingen                     0.542           0.854
## 19 Bruderholz                    0.587           0.940
## 20 Vorstädte                     0.593           0.952
## 21 Altstadt Grossbasel           0.617           0.931

4 - Pivoting

Finally, transpose the output of the previous chunk to obtain the results in the long format. The first argument specifies the vector of to be transposed columns. The next two specify the column names of the two variables in the long format carrying the previous column names and column values, respectively.

basel %>% 
  group_by(quarter) %>% 
  summarize(avg_gini_income = mean(income_gini),
            avg_gini_wealth = mean(wealth_gini)) %>% 
  pivot_longer(c(avg_gini_income, avg_gini_wealth), 
               names_to = "variable", values_to = "value")

## # A tibble: 42 × 3
##    quarter             variable        value
##    <chr>               <chr>           <dbl>
##  1 Altstadt Grossbasel avg_gini_income 0.617
##  2 Altstadt Grossbasel avg_gini_wealth 0.931
##  3 Altstadt Kleinbasel avg_gini_income 0.496
##  4 Altstadt Kleinbasel avg_gini_wealth 0.914
##  5 Am Ring             avg_gini_income 0.502
##  6 Am Ring             avg_gini_wealth 0.876
##  7 Bachletten          avg_gini_income 0.464
##  8 Bachletten          avg_gini_wealth 0.849
##  9 Bettingen           avg_gini_income 0.542
## 10 Bettingen           avg_gini_wealth 0.854
## # ℹ 32 more rows

5 - Project work

Try to use what you have learned in this section to read in and process the data for your data visualization project.

Intro 2 R solutions

0 - Preliminaries

1 - Summarize

2 - Filter

3 - Grouping

4 - Pivoting

5 - Project work