A Syllabus for Tabular Data Analytics

Jose M Sallan 2024-02-19 6 min read

In contemporary education in business, engineering and computer science there is growing concern in conveying to students data analytics capabilities. Today, we have multiples sources of data: urban activity from open data portals, transactions registered in commercial Internet platforms, and activities in the factory detected by Internet of Things are salient examples of this data overhaul. Data may appear as images, text, sound or tabular registers, among other forms. We have developed effective tools to deal with each type of data, such as convolutional neural networks for image processing or recurrent neural networks for speech or text. In this post, I want to focus on tabular data, where variables are organized in columns and observations in rows. In particular, I aim to convey my present view about how can we introduce students to processing tabular data effectively.

In my opinion, the tidyverse offers a good fromework for data analytics, offering a set of R packages to cover specific needs of the data analyist. That is why I will present this framework through tidyverse packages. But obviously other approaches are possible: data.table is a good solution to handle very large datasets, and the popular Python language has its own tools to perform these tasks.

Here is my particular framework to introduce data analytics of tabular data.

Unit 1: Introduction to R

The first step of the course must be introducing students to the working environment, in my case R and RStudio. This can be challenging for most of them, as they are used to graphical using interfaces instead of coding base workflows. I start presenting R and RStudio. Then, I move to R data structures: vectors, lists and data frames. This includes presenting R base operations and the dollar notation. Later, I am presenting some R data types: numeric, logical, character and factor. Finally, I introduce functions and the extension of R through packages.

Unit 2: Handling Tabular Data

The mst common tool to handle tabular data are spreadsheets, being Excel, Google Spreadhseets or Libreoffice the most common implementations. I motivate students to move from Excel sheet to data frames active in the working environment. Then, we learn how to use dplyr verbs some basic tabular data operations: filtering rows, selecting columns, arranging (ordering) by column value, creating new columns or modifying existing ones, and reordering and renaming columns.

Unit 3: Visualization Tools

Once we know how to handle data, we can move to visualize it. It is time to present the gpplot2 package to visualize tabular data through the grammar of graphics. At this stage, we focus on the possiblities of producing graphs to explore data, rather than on sophisticated charts for presentations.

Unit 4: Pivoting Tables

In the context of tabular data, pivot has two meanings. The first meaning is about producing summary tables of values across levels of categorical variables. This is accomplished in spreadsheets through dynamic tables. The second is about moving from wide tables to long tables. The later can be cases of tidy data, the preferred way of storing data in tables for exploitation with computers. Here I am introducing group_by() and summarise() from dplyr to obtain summary tables. Then, I am presenting some basic tidyr verbs: pivot_longer(), pivot_wider() and separate() to move from wide to long tables and vice versa. This concepts will be useful later if students are introduced to structured databases in platforms like mySQL.

Unit 5: Joining Tables

The introduction of tidy data allows us to explore relational databases, as one of the principles of tidy data is that each observational unit is a table. So it is likely that our data will be a set of tables, related to each other through relational variables. This is a crucial concept for students, if they need to be familiar with relational databases like mySQL. In this unit, I introduce the several types of dplyr joins.

Unit 6: Reading and Cleaning Data

The most tedious and time-consuming activity of a data analysis project is reading and cleaning data. We need to retrieve data from a file in our computer or an Internet connection, and then transforming it into a correct and consistent dataset. This is the place to introduce the tidyverse resources to read data: readr for csv and text files, readxl for Excel files, and dbplyr for SQL databases. Once the data is loaded, the resources of janitor and the data handling packages introduced in previous sections can help in cleaning the dataset.

Unit 7: Producing Visualizations for Presentation

The ggplot2 package allows us to work in two “modes”: an exploration mode, where we want to produce fast and rough charts to explore the dataset, and a presentation mode, to generate charts for presentations. In this section, the most useful graphs for presentations are introduced, like the horizontal bar chart, scatter plots, line plots, and others. Now it is a good moment to present some utilities of the forcats package, like reordering a factor by values of a second variable with fct_reorder() or collapsing uncommon values into one with fct_lump(). Other miscellaneous issues can be introduced here, like annotation of charts, or dealing with power-law distributions with logarithmic scales.

Unit 8: Working with Strings

Although most data transformation and modelling deal with numerical variables, it is frequent that we need to work with massive amounts of text. Here it is time to introduce the stringr functions to perform functions on text data like splitting, searching and replacing, among others. These functions work with regular expressions, an abstruse field where large language models can be of great help. In addition, the glue package allows generating string literals including the values of variables in the environment. Text analysis is a world in its own, a good introduction is the Silgen & Robinson (2024) online book.

Unit 9: Working with Dates and Times

Times and dates appear pervasively in data analysis, and have their specific data types in R or Python. In R dates are presented in the POSIXct format, so they are stored internally as the number of seconds since the epoch, 00:00:00 UTC of 1 January 1970. In this module, we can introduce students to how to work with dates and times in base R and present the packages lubridate and hms, that can make working with dates and times easier.

Capstone Project

A nice way to evaluate students on the contents above is to make them do a capstone project of exploratory data analysis, where they can deploy the knowledge acquired in the previous units.

Extensions

Depending on the background of the students, several extensions for these contents are possible:

  • Statistical analysis: Presenting statistical inference, parametric tests, linear regression and factor analysis. Here can be useful to present the broom package to obtain tidy outputs of statistical models.
  • Predictive modelling: Here can be presented models of unsupervised learning, like clustering or association rules, and workflows of supervised learning with the tidymodels framework.
  • Spatial analysis: R allows analyzing, visualizing and modelling spatial data with open source software. A nice introduction to that topic is the Lovelace et al. (2024) book.

References