Skip to content

Data Management

One of the most mature tools for handling data in Python is the NumPy library, which adds support for large, multi-dimensional arrays and matrices, as well as functions for performing vectorized mathematical operations on these arrays. NumPy is still widely used for numerical computations, but has mostly been superseded by other libraries for analysing general tabular data (data frames). One of these is Pandas, which is built on top of NumPy, but adds a lot of functionality that makes working with tabular data more convenient. Pandas is one of the most popular tools for data management in Python, but there are more modern alternatives that can provide additional benefits, such as Polars, PyArrow, Dask, DuckDB, and others. Finally, there is a compatibility layer called Narwhals that supports using most modern data libraries with a single syntax (a subset of the Polars API).

Here we will focus on using the Polars library, as it is very fast, has an intuitive API, allows for out-of-core processing (doesn’t need to read in all the data at once), and makes parallelisation easy. As an added bonus, it is easy to go from Polars to Narwhals, if we ever need to work with another data library.

import polars as pl # The polars module
from polars import col # Function to manipulate DataFrame columns

The main data structures in Polars are series and DataFrames.

A series is a one-dimensional data structure where all elements have the same data type.

pl.Series("numbers", [1, 2, 3, 4, 5])

Extending the series is the DataFrame, a two-dimensional data structure that contains uniquely named series (the columns).

pl.DataFrame(
{
"numbers": [1, 2, 3, 4, 5],
"letters": ["A", "B", "C", "D", "E"],
}
)

By having the data in a DataFrame, we can use Polars’s functions to select columns, filter rows, create new variables, or perform any other data manipulation we might need.

Polars supports a many different data types (see their documentation):

  • Numeric: signed integers, unsigned integers, floating point numbers, and decimals
  • Nested: lists, structs, and arrays
  • Temporal: dates, datetimes, times, and time deltas
  • Others: strings, binary data, booleans, categoricals, enums, and objects

Missing data is denoted by a special null value for all data types.

Unless we are creating our own data, we need to import it from somewhere. Data can be stored in many forms: files in different formats, databases, cloud storage. There are functions to read the data for all of these forms (read more here), but we will focus on reading data from CSV and, a little later, Parquet files. The functions to read data take many parameters, with the most useful being columns (the columns to select), and specifically for CSV null_values (values to interpret as missing), try_parse_dates (automatically parse dates), separator, and file encoding.

data = pl.read_csv(
"data.csv",
columns=["id", "date", "value"],
null_values=["MISSING", "---", "999"],
try_parse_dates=True,
separator=";",
encoding="utf8",
)

data = pl.read_parquet("data.parquet", columns=["id", "date", "value"])

For the rest examples, we’ll use the MIMIC-IV Clinical Database Demo to demonstrate how to work with data. We can import the patients DataFrame with (change the path as applicable):

patients = pl.read_csv("../../data/hosp/patients.csv.gz")

Let’s see at how we can inspect the data in different ways.

Shows the first/last n rows of the DataFrame (default is 5 rows).

patients.head()
patients.tail(10)

Gives some descriptive statistics for the columns; mostly useful for numerical columns.

patients.describe()

Counts the number of occurrences of each category in categorical columns.

patients["gender"].value_counts()

Filters the rows based on one or more criteria.

patients.filter(col("gender").eq("M"))

Sorts the rows based on one or more criteria.

patients.sort("anchor_age")

Keep only unique observations, optionally within a subset of the columns.

patients.unique("gender", keep="first")

Select columns based on their names.

patients.select("gender", "dod")
patients["gender", "dod"]

Indexing the DataFrame like a dictionary [...] will give the same result,

patients.select(is_adult=col("anchor_age").gt(18))

Using regular expressions (a mini language for matching patterns in text), we can select columns based on some pattern. For example, all columns starting with “anchor_”.

patients.select(col("^anchor_.*$"))

Create new columns and append them to the data.

patients.with_columns(is_adult=col("anchor_age").gt(18))

Rename columns.

patients.rename({"gender": "sex"})

Vectorized mathematical operations on columns.

Add and subtract columns.

patients.select(
"anchor_age",
"anchor_year",
age_plus_5=col("anchor_age").add(5),
age_minus_5=col("anchor_age").sub(5),
year_minus_age=col("anchor_age").sub("anchor_year"),
# Which is the same as the more verbose, but more flexible:
# year_minus_age=col("anchor_age").sub(col("anchor_year"))
)

Multiply and divide columns.

patients.select(
"anchor_age",
age_mul_5=col("anchor_age").mul(5),
age_div_5=col("anchor_age").truediv(5),
)

Compare values between columns.

Equal and not equal to.

patients.select(
"gender",
is_male=col("gender").eq("M"),
is_not_male=col("gender").ne("M"),
)

Greater than and greater than or equal to.

patients.select(
"anchor_age",
age_ge_21=col("anchor_age").ge(21),
age_gt_21=col("anchor_age").gt(21),
)

Less than and less than or equal to.

patients.select(
"anchor_age",
age_le_72=col("anchor_age").le(72),
age_lt_72=col("anchor_age").lt(72),
)

Check if values are between some other values.

patients.select(
"anchor_age",
age_21_72=col("anchor_age").is_between(21, 72, closed="both"),
)

Check if values are present in a list of values.

patients.select(
"anchor_year",
year_2125_2134_2155=col("anchor_year").is_in([2125, 2134, 2155]),
)

Check for missing values.

patients.select(
"dod",
dead=col("dod").is_null(),
not_dead=col("dod").is_not_null(),
)

Logical inverse, True <-> False.

patients.select(
"gender",
is_not_male=col("gender").eq("M").not_(),
)

Aggregate data over rows.

Get the min, max, and mode for a column.

patients.select(
n=pl.len(),
age_min=col("anchor_age").min(),
age_max=col("anchor_age").max(),
gender_mode=col("gender").mode(),
)

Get the mean, median, and any quantile for a column.

patients.select(
age_mean=col("anchor_age").mean(),
age_median=col("anchor_age").median(),
age_q25=col("anchor_age").quantile(0.25),
age_q50=col("anchor_age").quantile(0.50),
age_q75=col("anchor_age").quantile(0.75),
)

Get the standard deviation and calculate the interquartile range for a column.

patients.select(
age_std=col("anchor_age").std(),
age_iqr=col("anchor_age").quantile(0.75).sub(col("anchor_age").quantile(0.25)),
)

Using the horizontal version of aggregations (eg. pl.sum_horizontal) allows for aggregations over the columns instead of the rows (ie. aggregation within each row).

patients.select(
"anchor_age",
"anchor_year",
sum_horizontal=pl.sum_horizontal("anchor_age", "anchor_year"),
mean_horizontal=pl.mean_horizontal("anchor_age", "anchor_year"),
)

Group data based on one or more columns perform calculations within these groups.

After we have grouped the data based on some criteria we can aggregate the data within groups.

patients.group_by("gender").agg(
n=pl.len(),
age_mean=col("anchor_age").mean(),
age_std=col("anchor_age").std(),
age_median=col("anchor_age").median(),
age_q25=col("anchor_age").quantile(0.25),
age_q75=col("anchor_age").quantile(0.75),
)

Join (sometimes called merge) one DataFrame with another.

The join function allows for many different times of joins (inner, left, right, etc.).

admissions = pl.read_csv(
"../../data/hosp/admissions.csv.gz", columns=["subject_id", "insurance"]
)
patients.select("subject_id", "gender", "anchor_age").join(
admissions, on="subject_id", how="left", validate="1:m"
)

Polars has special functions for working with strings.

Create a string that contains values from one or more columns.

patients.select(
str_column=pl.format(
"Subject {} is {} and {} years old.",
col("subject_id"),
col("gender").replace({"F": "female", "M": "male"}),
col("anchor_age"),
)
).write_clipboard()

This can be useful for creating baseline characteristics tables:

patients.select(
age_mean_std=pl.format(
"{} ({})",
col("anchor_age").mean().round(1),
col("anchor_age").std().round(1),
),
male_n_prop=pl.format(
"{} ({}%)",
col("gender").eq("M").sum(),
col("gender").eq("M").mean().mul(100).round(1),
),
)

Chaining is useful for performing several related operations in a single block of code, which can increase readability and code flow.

(
patients.with_columns(
male=col("gender").eq("M"),
is_working_age=col("anchor_age").is_between(18, 65),
)
.drop("gender")
.filter(col("anchor_year").gt(2140), "is_working_age")
.group_by("male")
.agg(pl.len(), col("dod").is_null().sum())
.with_columns(proportion=col("dod").truediv("len"))
)

That’s great! Feel free to use whatever tool you like and it shouldn’t be too hard to translate the examples on this site into the Pandas equivalent.