Cleaning the Credit Card Defaults Dataset

Jose M Sallan 2023-05-15 11 min read

In this post, I will present how have I prepared for analysis the default of credit card clients Data Set available from the UCI Machine Learning Repository. I will be using the tidyverse for data handling, the readxl package to read the Excel file, the janitor package for data cleaning, and skimr for dataset summary. The kableExtra package will be used to present tabular results:

library(tidyverse)
library(readxl)
library(janitor)
library(skimr)
library(kableExtra)

This dataset was used in the Yeh & Lien (2009) paper of comparison of data mining techniques. The description of dataset variables presented in the UCI Machine Learning repository comes from that paper:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

X2: Gender (1 = male; 2 = female).

X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

X4: Marital status (1 = married; 2 = single; 3 = others).

X5: Age (year).

X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.

X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

The dataset is delivered as an Excel file. The first row corresponds to the variables of the original paper, and the second row is a renaming of the variables presented above. In the Excel there is an additional first column of ID observations, so that variable X1 is in the second column of the file, and so on. As I don’t want to modify the original file, I will skip the first row of the file and keep the renamed variables.

ld <- read_excel("default of credit card clients.xls", skip = 1)

I will save the resulting data frame as a RDS object. So the storage required goes from 5.8 MB to 1.2 MB:

saveRDS(ld, file = "ld.RDS")

I will load the dataset from the RDS by doing:

cc_defaults <- readRDS("ld.RDS")

Let’s examine the dataset with the skim summary function:

skim(cc_defaults)
Table 1: Data summary
Name cc_defaults
Number of rows 30000
Number of columns 25
_______________________
Column type frequency:
numeric 25
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 15000.50 8660.40 1 7500.75 15000.5 22500.25 30000 ▇▇▇▇▇
LIMIT_BAL 0 1 167484.32 129747.66 10000 50000.00 140000.0 240000.00 1000000 ▇▃▁▁▁
SEX 0 1 1.60 0.49 1 1.00 2.0 2.00 2 ▅▁▁▁▇
EDUCATION 0 1 1.85 0.79 0 1.00 2.0 2.00 6 ▆▇▃▁▁
MARRIAGE 0 1 1.55 0.52 0 1.00 2.0 2.00 3 ▁▇▁▇▁
AGE 0 1 35.49 9.22 21 28.00 34.0 41.00 79 ▇▇▂▁▁
PAY_0 0 1 -0.02 1.12 -2 -1.00 0.0 0.00 8 ▇▂▁▁▁
PAY_2 0 1 -0.13 1.20 -2 -1.00 0.0 0.00 8 ▇▁▁▁▁
PAY_3 0 1 -0.17 1.20 -2 -1.00 0.0 0.00 8 ▇▁▁▁▁
PAY_4 0 1 -0.22 1.17 -2 -1.00 0.0 0.00 8 ▇▁▁▁▁
PAY_5 0 1 -0.27 1.13 -2 -1.00 0.0 0.00 8 ▇▁▁▁▁
PAY_6 0 1 -0.29 1.15 -2 -1.00 0.0 0.00 8 ▇▁▁▁▁
BILL_AMT1 0 1 51223.33 73635.86 -165580 3558.75 22381.5 67091.00 964511 ▇▃▁▁▁
BILL_AMT2 0 1 49179.08 71173.77 -69777 2984.75 21200.0 64006.25 983931 ▇▁▁▁▁
BILL_AMT3 0 1 47013.15 69349.39 -157264 2666.25 20088.5 60164.75 1664089 ▇▁▁▁▁
BILL_AMT4 0 1 43262.95 64332.86 -170000 2326.75 19052.0 54506.00 891586 ▇▃▁▁▁
BILL_AMT5 0 1 40311.40 60797.16 -81334 1763.00 18104.5 50190.50 927171 ▇▁▁▁▁
BILL_AMT6 0 1 38871.76 59554.11 -339603 1256.00 17071.0 49198.25 961664 ▁▇▁▁▁
PAY_AMT1 0 1 5663.58 16563.28 0 1000.00 2100.0 5006.00 873552 ▇▁▁▁▁
PAY_AMT2 0 1 5921.16 23040.87 0 833.00 2009.0 5000.00 1684259 ▇▁▁▁▁
PAY_AMT3 0 1 5225.68 17606.96 0 390.00 1800.0 4505.00 896040 ▇▁▁▁▁
PAY_AMT4 0 1 4826.08 15666.16 0 296.00 1500.0 4013.25 621000 ▇▁▁▁▁
PAY_AMT5 0 1 4799.39 15278.31 0 252.50 1500.0 4031.50 426529 ▇▁▁▁▁
PAY_AMT6 0 1 5215.50 17777.47 0 117.75 1500.0 4000.00 528666 ▇▁▁▁▁
default payment next month 0 1 0.22 0.42 0 0.00 0.0 0.00 1 ▇▁▁▁▂

In the summary, we can see that all variables are numerical. This can be a requirement of the software used by in the Yeh & Lien (2009) paper, but using these variables as is can be misleading, as they are categorical variables. In R we can do better than that, so I will modify the dataset following four steps:

  • Clean the names of the dataset with janitor::clean_names.
  • Transform some features into factors with meaningful factor levels.
  • Rename the target variable, and transform it into a factor with the positive case as first level.
  • Rename some of the features to facilitate further analysis.

Clean Names of Dataset

The original names of the dataset are:

colnames(cc_defaults)
##  [1] "ID"                         "LIMIT_BAL"                 
##  [3] "SEX"                        "EDUCATION"                 
##  [5] "MARRIAGE"                   "AGE"                       
##  [7] "PAY_0"                      "PAY_2"                     
##  [9] "PAY_3"                      "PAY_4"                     
## [11] "PAY_5"                      "PAY_6"                     
## [13] "BILL_AMT1"                  "BILL_AMT2"                 
## [15] "BILL_AMT3"                  "BILL_AMT4"                 
## [17] "BILL_AMT5"                  "BILL_AMT6"                 
## [19] "PAY_AMT1"                   "PAY_AMT2"                  
## [21] "PAY_AMT3"                   "PAY_AMT4"                  
## [23] "PAY_AMT5"                   "PAY_AMT6"                  
## [25] "default payment next month"

Let’s clean the names of the dataset doing:

cc_defaults <- clean_names(cc_defaults)

The dataset names are now:

colnames(cc_defaults)
##  [1] "id"                         "limit_bal"                 
##  [3] "sex"                        "education"                 
##  [5] "marriage"                   "age"                       
##  [7] "pay_0"                      "pay_2"                     
##  [9] "pay_3"                      "pay_4"                     
## [11] "pay_5"                      "pay_6"                     
## [13] "bill_amt1"                  "bill_amt2"                 
## [15] "bill_amt3"                  "bill_amt4"                 
## [17] "bill_amt5"                  "bill_amt6"                 
## [19] "pay_amt1"                   "pay_amt2"                  
## [21] "pay_amt3"                   "pay_amt4"                  
## [23] "pay_amt5"                   "pay_amt6"                  
## [25] "default_payment_next_month"

Features Transformation

In the description of the dataset we see that some variables like sex, education and marriage are categorical, although they are coded as numerical. To see why is that misleading, let’s consider sex, which is equal to 1 for males and 2 for females. This labelling is arbitrary, so instead of describing a property larger for women than for men, is splitting the dataset into two different categories. We need to turn these variables into factors, and rename the levels of these factors adequately. This can be done with forcats::fct_recode.

cc_defaults <- cc_defaults |>
  mutate(sex = factor(sex)) |>
  mutate(sex = fct_recode(sex, male = "1", female = "2"))

The transformation of education is more complex. From the summary, we observe that goes from 0 to 6, but it is defined from 1 to 4. Let’s count how many observations have for each level using a janitor::tabyl function:

cc_defaults |>
  tabyl(education) |>
  adorn_pct_formatting() |>
  kbl() |>
  kable_styling(full_width = FALSE)
education n percent
0 14 0.0%
1 10585 35.3%
2 14030 46.8%
3 4917 16.4%
4 123 0.4%
5 280 0.9%
6 51 0.2%

The proportion of odd values of the variable is small, so it is safer to group them into an unknown level:

cc_defaults <- cc_defaults |>
  mutate(education = factor(education)) |>
  mutate(education = fct_recode(education, graduate = "1", university = "2", high_school = "3", others = "4", unknown = "0", unknown = "5", unknown = "6"))

The resulting levels of the factor are:

levels(cc_defaults$education)
## [1] "unknown"     "graduate"    "university"  "high_school" "others"

And the new counting of levels is:

cc_defaults |>
  tabyl(education) |>
  adorn_pct_formatting() |>
  kbl() |>
  kable_styling(full_width = FALSE)
education n percent
unknown 345 1.1%
graduate 10585 35.3%
university 14030 46.8%
high_school 4917 16.4%
others 123 0.4%

Something similar happens with the ’marriage` variable, which has more levels than the ones defined in Yeh & Lien (2009):

cc_defaults |>
  tabyl(marriage) |>
  adorn_pct_formatting() |>
  kbl() |>
  kable_styling(full_width = FALSE)
marriage n percent
0 54 0.2%
1 13659 45.5%
2 15964 53.2%
3 323 1.1%

When transforming marriage into a factor, I will be coding the zero level as unknown:

cc_defaults <- cc_defaults |>
  mutate(marriage = factor(marriage)) |>
  mutate(marriage = fct_recode(marriage, unknown = "0", unknown = "3", married = "1", single = "2"))

Transforming the Target Variable

The target variable default_payment_next_month must be transformed into a factor, with the positive case as first level. I will use dplyr::rename to give the variable a shorter name and fct_relevel to reorder the levels of variables.

cc_defaults <- cc_defaults |>
  rename("default" = default_payment_next_month) |>
  mutate(default = factor(default)) |>
  mutate(default = fct_relevel(default, "1")) |>
  mutate(default = fct_recode(default, yes = "1", "no" = "0"))

Counting the observations of each level, we observe that the dataset is imbalanced, as there are much more observations with no default:

cc_defaults |>
  tabyl(default) |>
  adorn_pct_formatting() |>
  kbl() |>
  kable_styling(full_width = FALSE)
default n percent
yes 6636 22.1%
no 23364 77.9%

Rename Features

The family of pay_*, bill_amt* and pay_amt* variables are labelled inconsistently, and the names do not provide information about the month they are describing. We are trying to predict if a customer will default in October by examining her/his behavior in the previous six months, April to September. So I will relabel the variables accordingly using dplyr::rename:

cc_defaults <- cc_defaults |>
  rename("pay_sep" = pay_0,
         "pay_aug" = pay_2,
         "pay_jul" = pay_3,
         "pay_jun" = pay_4,
         "pay_may" = pay_5,
         "pay_apr" = pay_6,
         "bill_amt_sep" = bill_amt1,
         "bill_amt_aug" = bill_amt2,
         "bill_amt_jul" = bill_amt3,
         "bill_amt_jun" = bill_amt4,
         "bill_amt_may" = bill_amt5,
         "bill_amt_apr" = bill_amt6,
         "pay_amt_sep" = pay_amt1,
         "pay_amt_aug" = pay_amt2,
         "pay_amt_jul" = pay_amt3,
         "pay_amt_jun" = pay_amt4,
         "pay_amt_may" = pay_amt5,
         "pay_amt_apr" = pay_amt6)

Significance of pay Variables

The original meaning of the pay_apr:pay_sep variables is:

X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Nevertheless, we observe that there are other variable values around. Let’s see for instance pay_sep:

cc_defaults |>
  tabyl(pay_sep) |>
  adorn_pct_formatting() |>
  kbl() |>
  kable_styling(full_width = FALSE)
pay_sep n percent
-2 2759 9.2%
-1 5686 19.0%
0 14737 49.1%
1 3688 12.3%
2 2667 8.9%
3 322 1.1%
4 76 0.3%
5 26 0.1%
6 11 0.0%
7 9 0.0%
8 19 0.1%

This fact has caught the attention of other users. In a Kaggle discussion forum, we are informed that the authors of the paper provided further clarification about the meaning of this variable:

-2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

This means that the pay_* variables have a different meaning when they are positive than when they are zero or negative. Further analysis must take this into account.

The Importance of Data Cleaning

Data cleaning is a relevant tasks in data analysis. As quoted in the janitor website:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

The skimr::skim function helps to take a first look of a dataset with an exhaustive summary report. The janitor package helps to simplify this cumbersome task in the context of R, reproducing with janitor::tabyl some of the results of counting the number of elements of each level using Excel dynamic tables. The use of a programming language like R helps to make this data cleaning task reproducible.

References

Session Info

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 21.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4 skimr_2.1.5      janitor_2.2.0    readxl_1.4.2    
##  [5] lubridate_1.9.2  forcats_1.0.0    stringr_1.5.0    dplyr_1.1.2     
##  [9] purrr_1.0.1      readr_2.1.4      tidyr_1.3.0      tibble_3.2.1    
## [13] ggplot2_3.4.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.5        utf8_1.2.3        generics_0.1.3    xml2_1.3.3       
##  [5] blogdown_1.16     stringi_1.7.12    hms_1.1.3         digest_0.6.31    
##  [9] magrittr_2.0.3    evaluate_0.20     grid_4.3.0        timechange_0.2.0 
## [13] bookdown_0.33     fastmap_1.1.1     cellranger_1.1.0  jsonlite_1.8.4   
## [17] httr_1.4.5        rvest_1.0.3       fansi_1.0.4       viridisLite_0.4.1
## [21] scales_1.2.1      jquerylib_0.1.4   cli_3.6.1         rlang_1.1.0      
## [25] munsell_0.5.0     base64enc_0.1-3   repr_1.1.6        withr_2.5.0      
## [29] cachem_1.0.7      yaml_2.3.7        tools_4.3.0       tzdb_0.3.0       
## [33] colorspace_2.1-0  webshot_0.5.4     vctrs_0.6.2       R6_2.5.1         
## [37] lifecycle_1.0.3   snakecase_0.11.0  pkgconfig_2.0.3   pillar_1.9.0     
## [41] bslib_0.4.2       gtable_0.3.3      glue_1.6.2        systemfonts_1.0.4
## [45] highr_0.10        xfun_0.39         tidyselect_1.2.0  rstudioapi_0.14  
## [49] knitr_1.42        htmltools_0.5.5   svglite_2.1.1     rmarkdown_2.21   
## [53] compiler_4.3.0