Data Management

One of the most mature tools for handling data in Python is the NumPy library, which adds support for large, multi-dimensional arrays and matrices, as well as functions for performing vectorized mathematical operations on these arrays. NumPy is still widely used for numerical computations, but has mostly been superseded by other libraries for analysing general tabular data (data frames). One of these is Pandas, which is built on top of NumPy, but adds a lot of functionality that makes working with tabular data more convenient. Pandas is one of the most popular tools for data management in Python, but there are more modern alternatives that can provide additional benefits, such as Polars, PyArrow, Dask, DuckDB, and others. Finally, there is a compatibility layer called Narwhals that supports using most modern data libraries with a single syntax (a subset of the Polars API).

Polars pl

Here we will focus on using the Polars library, as it is very fast, has an intuitive API, allows for out-of-core processing (doesn’t need to read in all the data at once), and makes parallelisation easy. As an added bonus, it is easy to go from Polars to Narwhals, if we ever need to work with another data library.

import polars as pl  # The polars module
from polars import col  # Function to manipulate DataFrame columns

`Series` / `DataFrame`

The main data structures in Polars are series and DataFrames.

A series is a one-dimensional data structure where all elements have the same data type.

Code
Output

pl.Series("numbers", [1, 2, 3, 4, 5])

shape: (5,)
Series: 'numbers' [i64]
[
        1
        2
        3
        4
        5
]

Extending the series is the DataFrame, a two-dimensional data structure that contains uniquely named series (the columns).

Code
Output

pl.DataFrame(
    {
        "numbers": [1, 2, 3, 4, 5],
        "letters": ["A", "B", "C", "D", "E"],
    }
)

shape: (5, 2)
┌─────────┬─────────┐
│ numbers ┆ letters │
│ ---     ┆ ---     │
│ i64     ┆ str     │
╞═════════╪═════════╡
│ 1       ┆ A       │
│ 2       ┆ B       │
│ 3       ┆ C       │
│ 4       ┆ D       │
│ 5       ┆ E       │
└─────────┴─────────┘

By having the data in a DataFrame, we can use Polars’s functions to select columns, filter rows, create new variables, or perform any other data manipulation we might need.

Data Types

Polars supports a many different data types (see their documentation):

Numeric: signed integers, unsigned integers, floating point numbers, and decimals
Nested: lists, structs, and arrays
Temporal: dates, datetimes, times, and time deltas
Others: strings, binary data, booleans, categoricals, enums, and objects

Missing data is denoted by a special null value for all data types.

Importing Data

Unless we are creating our own data, we need to import it from somewhere. Data can be stored in many forms: files in different formats, databases, cloud storage. There are functions to read the data for all of these forms (read more here), but we will focus on reading data from CSV and, a little later, Parquet files. The functions to read data take many parameters, with the most useful being columns (the columns to select), and specifically for CSV null_values (values to interpret as missing), try_parse_dates (automatically parse dates), separator, and file encoding.

`read_csv`

data = pl.read_csv(
    "data.csv",
    columns=["id", "date", "value"],
    null_values=["MISSING", "---", "999"],
    try_parse_dates=True,
    separator=";",
    encoding="utf8",
)

`read_parquet`

data = pl.read_parquet("data.parquet", columns=["id", "date", "value"])

Example data

For the rest examples, we’ll use the MIMIC-IV Clinical Database Demo to demonstrate how to work with data. We can import the patients DataFrame with (change the path as applicable):

patients = pl.read_csv("../../data/hosp/patients.csv.gz")

Exporting Data

`write_csv`

data.write_csv("../../data/processed/data.csv")
data.write_csv("../../data/processed/data.csv.gz")

`write_parquet`

data.write_parquet("../../data/processed/data.parquet")

Inspecting the Data

Let’s see at how we can inspect the data in different ways.

`head` / `tail`

Shows the first/last n rows of the DataFrame (default is 5 rows).

Code
Output

patients.head()
patients.tail(10)

shape: (5, 6)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str        │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ 10014729   ┆ F      ┆ 21         ┆ 2125        ┆ 2011 - 2013       ┆ null       │
│ 10003400   ┆ F      ┆ 72         ┆ 2134        ┆ 2011 - 2013       ┆ 2137-09-02 │
│ 10002428   ┆ F      ┆ 80         ┆ 2155        ┆ 2011 - 2013       ┆ null       │
│ 10032725   ┆ F      ┆ 38         ┆ 2143        ┆ 2011 - 2013       ┆ 2143-03-30 │
│ 10027445   ┆ F      ┆ 48         ┆ 2142        ┆ 2011 - 2013       ┆ 2146-02-09 │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘

shape: (10, 6)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str        │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ 10005817   ┆ M      ┆ 66         ┆ 2132        ┆ 2014 - 2016       ┆ 2135-01-19 │
│ 10009049   ┆ M      ┆ 56         ┆ 2174        ┆ 2014 - 2016       ┆ null       │
│ 10015931   ┆ M      ┆ 87         ┆ 2176        ┆ 2014 - 2016       ┆ 2177-03-29 │
│ 10006053   ┆ M      ┆ 52         ┆ 2111        ┆ 2014 - 2016       ┆ 2111-11-15 │
│ 10012552   ┆ M      ┆ 78         ┆ 2140        ┆ 2014 - 2016       ┆ null       │
│ 10004733   ┆ M      ┆ 51         ┆ 2174        ┆ 2014 - 2016       ┆ null       │
│ 10021118   ┆ M      ┆ 62         ┆ 2161        ┆ 2014 - 2016       ┆ null       │
│ 10018501   ┆ M      ┆ 83         ┆ 2141        ┆ 2014 - 2016       ┆ null       │
│ 10007058   ┆ M      ┆ 48         ┆ 2167        ┆ 2014 - 2016       ┆ null       │
│ 10038992   ┆ M      ┆ 70         ┆ 2185        ┆ 2014 - 2016       ┆ null       │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘

`describe`

Gives some descriptive statistics for the columns; mostly useful for numerical columns.

Code
Output

patients.describe()

shape: (9, 7)
┌────────────┬──────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐
│ statistic  ┆ subject_id   ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ ---          ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ str        ┆ f64          ┆ str    ┆ f64        ┆ f64         ┆ str               ┆ str        │
╞════════════╪══════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ count      ┆ 100.0        ┆ 100    ┆ 100.0      ┆ 100.0       ┆ 100               ┆ 31         │
│ null_count ┆ 0.0          ┆ 0      ┆ 0.0        ┆ 0.0         ┆ 0                 ┆ 69         │
│ mean       ┆ 1.0019e7     ┆ null   ┆ 61.75      ┆ 2148.68     ┆ null              ┆ null       │
│ std        ┆ 11151.154261 ┆ null   ┆ 16.16979   ┆ 25.629008   ┆ null              ┆ null       │
│ min        ┆ 1.0000032e7  ┆ F      ┆ 21.0       ┆ 2110.0      ┆ 2011 - 2013       ┆ 2111-11-15 │
│ 25%        ┆ 1.0009035e7  ┆ null   ┆ 52.0       ┆ 2129.0      ┆ null              ┆ null       │
│ 50%        ┆ 1.0019172e7  ┆ null   ┆ 63.0       ┆ 2146.0      ┆ null              ┆ null       │
│ 75%        ┆ 1.0025463e7  ┆ null   ┆ 72.0       ┆ 2172.0      ┆ null              ┆ null       │
│ max        ┆ 1.0040025e7  ┆ M      ┆ 91.0       ┆ 2201.0      ┆ 2014 - 2016       ┆ 2201-12-24 │
└────────────┴──────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘

`value_counts`

Counts the number of occurrences of each category in categorical columns.

Code
Output

patients["gender"].value_counts()

shape: (2, 2)
┌────────┬───────┐
│ gender ┆ count │
│ ---    ┆ ---   │
│ str    ┆ u32   │
╞════════╪═══════╡
│ M      ┆ 57    │
│ F      ┆ 43    │
└────────┴───────┘

Rows

`filter`

Filters the rows based on one or more criteria.

Code
Output

patients.filter(col("gender").eq("M"))

shape: (57, 6)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str        │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ 10022281   ┆ M      ┆ 84         ┆ 2125        ┆ 2011 - 2013       ┆ null       │
│ 10035631   ┆ M      ┆ 63         ┆ 2112        ┆ 2011 - 2013       ┆ 2116-03-12 │
│ 10024043   ┆ M      ┆ 67         ┆ 2117        ┆ 2011 - 2013       ┆ 2117-06-26 │
│ 10025612   ┆ M      ┆ 82         ┆ 2125        ┆ 2011 - 2013       ┆ null       │
│ 10003046   ┆ M      ┆ 64         ┆ 2154        ┆ 2011 - 2013       ┆ null       │
│ …          ┆ …      ┆ …          ┆ …           ┆ …                 ┆ …          │
│ 10004733   ┆ M      ┆ 51         ┆ 2174        ┆ 2014 - 2016       ┆ null       │
│ 10021118   ┆ M      ┆ 62         ┆ 2161        ┆ 2014 - 2016       ┆ null       │
│ 10018501   ┆ M      ┆ 83         ┆ 2141        ┆ 2014 - 2016       ┆ null       │
│ 10007058   ┆ M      ┆ 48         ┆ 2167        ┆ 2014 - 2016       ┆ null       │
│ 10038992   ┆ M      ┆ 70         ┆ 2185        ┆ 2014 - 2016       ┆ null       │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘

`sort`

Sorts the rows based on one or more criteria.

Code
Output

patients.sort("anchor_age")

shape: (100, 6)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str        │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ 10014729   ┆ F      ┆ 21         ┆ 2125        ┆ 2011 - 2013       ┆ null       │
│ 10008454   ┆ F      ┆ 26         ┆ 2110        ┆ 2011 - 2013       ┆ null       │
│ 10010867   ┆ F      ┆ 28         ┆ 2147        ┆ 2014 - 2016       ┆ null       │
│ 10009035   ┆ M      ┆ 28         ┆ 2161        ┆ 2011 - 2013       ┆ null       │
│ 10023239   ┆ F      ┆ 29         ┆ 2137        ┆ 2014 - 2016       ┆ null       │
│ …          ┆ …      ┆ …          ┆ …           ┆ …                 ┆ …          │
│ 10036156   ┆ F      ┆ 88         ┆ 2157        ┆ 2014 - 2016       ┆ null       │
│ 10010471   ┆ F      ┆ 89         ┆ 2155        ┆ 2014 - 2016       ┆ 2155-12-07 │
│ 10020640   ┆ F      ┆ 91         ┆ 2153        ┆ 2011 - 2013       ┆ 2154-02-04 │
│ 10012853   ┆ F      ┆ 91         ┆ 2175        ┆ 2014 - 2016       ┆ null       │
│ 10018845   ┆ M      ┆ 91         ┆ 2184        ┆ 2014 - 2016       ┆ 2184-11-22 │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘

`unique`

Keep only unique observations, optionally within a subset of the columns.

Code
Output

patients.unique("gender", keep="first")

shape: (2, 6)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬──────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod  │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---  │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str  │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪══════╡
│ 10014729   ┆ F      ┆ 21         ┆ 2125        ┆ 2011 - 2013       ┆ null │
│ 10022281   ┆ M      ┆ 84         ┆ 2125        ┆ 2011 - 2013       ┆ null │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴──────┘

Columns

`select`

Select columns based on their names.

Code
Output

patients.select("gender", "dod")
patients["gender", "dod"]

shape: (100, 2)
┌────────┬────────────┐
│ gender ┆ dod        │
│ ---    ┆ ---        │
│ str    ┆ str        │
╞════════╪════════════╡
│ F      ┆ null       │
│ F      ┆ 2137-09-02 │
│ F      ┆ null       │
│ F      ┆ 2143-03-30 │
│ F      ┆ 2146-02-09 │
│ …      ┆ …          │
│ M      ┆ null       │
│ M      ┆ null       │
│ M      ┆ null       │
│ M      ┆ null       │
│ M      ┆ null       │
└────────┴────────────┘

Indexing the DataFrame like a dictionary using brackets [...] will give the same result.

Code
Output

patients.select(is_adult=col("anchor_age").gt(18))

shape: (100, 1)
┌──────────┐
│ is_adult │
│ ---      │
│ bool     │
╞══════════╡
│ true     │
│ true     │
│ true     │
│ true     │
│ true     │
│ …        │
│ true     │
│ true     │
│ true     │
│ true     │
│ true     │
└──────────┘

Using regular expressions (a mini language for matching patterns in text), we can select columns based on some pattern. For example, all columns starting with “anchor_”.

Code
Output

patients.select(col("^anchor_.*$"))

shape: (100, 3)
┌────────────┬─────────────┬───────────────────┐
│ anchor_age ┆ anchor_year ┆ anchor_year_group │
│ ---        ┆ ---         ┆ ---               │
│ i64        ┆ i64         ┆ str               │
╞════════════╪═════════════╪═══════════════════╡
│ 21         ┆ 2125        ┆ 2011 - 2013       │
│ 72         ┆ 2134        ┆ 2011 - 2013       │
│ 80         ┆ 2155        ┆ 2011 - 2013       │
│ 38         ┆ 2143        ┆ 2011 - 2013       │
│ 48         ┆ 2142        ┆ 2011 - 2013       │
│ …          ┆ …           ┆ …                 │
│ 51         ┆ 2174        ┆ 2014 - 2016       │
│ 62         ┆ 2161        ┆ 2014 - 2016       │
│ 83         ┆ 2141        ┆ 2014 - 2016       │
│ 48         ┆ 2167        ┆ 2014 - 2016       │
│ 70         ┆ 2185        ┆ 2014 - 2016       │
└────────────┴─────────────┴───────────────────┘

`with_columns`

Create new columns and append them to the data.

Code
Output

patients.with_columns(is_adult=col("anchor_age").gt(18))

shape: (100, 7)
┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┬──────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        ┆ is_adult │
│ ---        ┆ ---    ┆ ---        ┆ ---         ┆ ---               ┆ ---        ┆ ---      │
│ i64        ┆ str    ┆ i64        ┆ i64         ┆ str               ┆ str        ┆ bool     │
╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╪══════════╡
│ 10014729   ┆ F      ┆ 21         ┆ 2125        ┆ 2011 - 2013       ┆ null       ┆ true     │
│ 10003400   ┆ F      ┆ 72         ┆ 2134        ┆ 2011 - 2013       ┆ 2137-09-02 ┆ true     │
│ 10002428   ┆ F      ┆ 80         ┆ 2155        ┆ 2011 - 2013       ┆ null       ┆ true     │
│ 10032725   ┆ F      ┆ 38         ┆ 2143        ┆ 2011 - 2013       ┆ 2143-03-30 ┆ true     │
│ 10027445   ┆ F      ┆ 48         ┆ 2142        ┆ 2011 - 2013       ┆ 2146-02-09 ┆ true     │
│ …          ┆ …      ┆ …          ┆ …           ┆ …                 ┆ …          ┆ …        │
│ 10004733   ┆ M      ┆ 51         ┆ 2174        ┆ 2014 - 2016       ┆ null       ┆ true     │
│ 10021118   ┆ M      ┆ 62         ┆ 2161        ┆ 2014 - 2016       ┆ null       ┆ true     │
│ 10018501   ┆ M      ┆ 83         ┆ 2141        ┆ 2014 - 2016       ┆ null       ┆ true     │
│ 10007058   ┆ M      ┆ 48         ┆ 2167        ┆ 2014 - 2016       ┆ null       ┆ true     │
│ 10038992   ┆ M      ┆ 70         ┆ 2185        ┆ 2014 - 2016       ┆ null       ┆ true     │
└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┴──────────┘

`rename`

Rename columns.

Code
Output

patients.rename({"gender": "sex"})

shape: (100, 6)
┌────────────┬─────┬────────────┬─────────────┬───────────────────┬────────────┐
│ subject_id ┆ sex ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod        │
│ ---        ┆ --- ┆ ---        ┆ ---         ┆ ---               ┆ ---        │
│ i64        ┆ str ┆ i64        ┆ i64         ┆ str               ┆ str        │
╞════════════╪═════╪════════════╪═════════════╪═══════════════════╪════════════╡
│ 10014729   ┆ F   ┆ 21         ┆ 2125        ┆ 2011 - 2013       ┆ null       │
│ 10003400   ┆ F   ┆ 72         ┆ 2134        ┆ 2011 - 2013       ┆ 2137-09-02 │
│ 10002428   ┆ F   ┆ 80         ┆ 2155        ┆ 2011 - 2013       ┆ null       │
│ 10032725   ┆ F   ┆ 38         ┆ 2143        ┆ 2011 - 2013       ┆ 2143-03-30 │
│ 10027445   ┆ F   ┆ 48         ┆ 2142        ┆ 2011 - 2013       ┆ 2146-02-09 │
│ …          ┆ …   ┆ …          ┆ …           ┆ …                 ┆ …          │
│ 10004733   ┆ M   ┆ 51         ┆ 2174        ┆ 2014 - 2016       ┆ null       │
│ 10021118   ┆ M   ┆ 62         ┆ 2161        ┆ 2014 - 2016       ┆ null       │
│ 10018501   ┆ M   ┆ 83         ┆ 2141        ┆ 2014 - 2016       ┆ null       │
│ 10007058   ┆ M   ┆ 48         ┆ 2167        ┆ 2014 - 2016       ┆ null       │
│ 10038992   ┆ M   ┆ 70         ┆ 2185        ┆ 2014 - 2016       ┆ null       │
└────────────┴─────┴────────────┴─────────────┴───────────────────┴────────────┘

Operations

Vectorized mathematical operations on columns.

`add` / `sub`

Add and subtract columns.

Code
Output

patients.select(
    "anchor_age",
    "anchor_year",
    age_plus_5=col("anchor_age").add(5),
    age_minus_5=col("anchor_age").sub(5),
    year_minus_age=col("anchor_age").sub("anchor_year"),
    # Which is the same as the more verbose, but more flexible:
    # year_minus_age=col("anchor_age").sub(col("anchor_year"))
)

shape: (100, 5)
┌────────────┬─────────────┬────────────┬─────────────┬────────────────┐
│ anchor_age ┆ anchor_year ┆ age_plus_5 ┆ age_minus_5 ┆ year_minus_age │
│ ---        ┆ ---         ┆ ---        ┆ ---         ┆ ---            │
│ i64        ┆ i64         ┆ i64        ┆ i64         ┆ i64            │
╞════════════╪═════════════╪════════════╪═════════════╪════════════════╡
│ 21         ┆ 2125        ┆ 26         ┆ 16          ┆ -2104          │
│ 72         ┆ 2134        ┆ 77         ┆ 67          ┆ -2062          │
│ 80         ┆ 2155        ┆ 85         ┆ 75          ┆ -2075          │
│ 38         ┆ 2143        ┆ 43         ┆ 33          ┆ -2105          │
│ 48         ┆ 2142        ┆ 53         ┆ 43          ┆ -2094          │
│ …          ┆ …           ┆ …          ┆ …           ┆ …              │
│ 51         ┆ 2174        ┆ 56         ┆ 46          ┆ -2123          │
│ 62         ┆ 2161        ┆ 67         ┆ 57          ┆ -2099          │
│ 83         ┆ 2141        ┆ 88         ┆ 78          ┆ -2058          │
│ 48         ┆ 2167        ┆ 53         ┆ 43          ┆ -2119          │
│ 70         ┆ 2185        ┆ 75         ┆ 65          ┆ -2115          │
└────────────┴─────────────┴────────────┴─────────────┴────────────────┘

`mul` / `truediv`

Multiply and divide columns.

Code
Output

patients.select(
    "anchor_age",
    age_mul_5=col("anchor_age").mul(5),
    age_div_5=col("anchor_age").truediv(5),
)

shape: (100, 3)
┌────────────┬───────────┬───────────┐
│ anchor_age ┆ age_mul_5 ┆ age_div_5 │
│ ---        ┆ ---       ┆ ---       │
│ i64        ┆ i64       ┆ f64       │
╞════════════╪═══════════╪═══════════╡
│ 21         ┆ 105       ┆ 4.2       │
│ 72         ┆ 360       ┆ 14.4      │
│ 80         ┆ 400       ┆ 16.0      │
│ 38         ┆ 190       ┆ 7.6       │
│ 48         ┆ 240       ┆ 9.6       │
│ …          ┆ …         ┆ …         │
│ 51         ┆ 255       ┆ 10.2      │
│ 62         ┆ 310       ┆ 12.4      │
│ 83         ┆ 415       ┆ 16.6      │
│ 48         ┆ 240       ┆ 9.6       │
│ 70         ┆ 350       ┆ 14.0      │
└────────────┴───────────┴───────────┘

Comparisons

Compare values between columns.

`eq` / `ne`

Equal and not equal to.

Code
Output

patients.select(
    "gender",
    is_male=col("gender").eq("M"),
    is_not_male=col("gender").ne("M"),
)

shape: (100, 3)
┌────────┬─────────┬─────────────┐
│ gender ┆ is_male ┆ is_not_male │
│ ---    ┆ ---     ┆ ---         │
│ str    ┆ bool    ┆ bool        │
╞════════╪═════════╪═════════════╡
│ F      ┆ false   ┆ true        │
│ F      ┆ false   ┆ true        │
│ F      ┆ false   ┆ true        │
│ F      ┆ false   ┆ true        │
│ F      ┆ false   ┆ true        │
│ …      ┆ …       ┆ …           │
│ M      ┆ true    ┆ false       │
│ M      ┆ true    ┆ false       │
│ M      ┆ true    ┆ false       │
│ M      ┆ true    ┆ false       │
│ M      ┆ true    ┆ false       │
└────────┴─────────┴─────────────┘

`gt` / `ge`

Greater than and greater than or equal to.

Code
Output

patients.select(
    "anchor_age",
    age_ge_21=col("anchor_age").ge(21),
    age_gt_21=col("anchor_age").gt(21),
)

shape: (100, 3)
┌────────────┬───────────┬───────────┐
│ anchor_age ┆ age_ge_21 ┆ age_gt_21 │
│ ---        ┆ ---       ┆ ---       │
│ i64        ┆ bool      ┆ bool      │
╞════════════╪═══════════╪═══════════╡
│ 21         ┆ true      ┆ false     │
│ 72         ┆ true      ┆ true      │
│ 80         ┆ true      ┆ true      │
│ 38         ┆ true      ┆ true      │
│ 48         ┆ true      ┆ true      │
│ …          ┆ …         ┆ …         │
│ 51         ┆ true      ┆ true      │
│ 62         ┆ true      ┆ true      │
│ 83         ┆ true      ┆ true      │
│ 48         ┆ true      ┆ true      │
│ 70         ┆ true      ┆ true      │
└────────────┴───────────┴───────────┘

`lt` / `le`

Less than and less than or equal to.

Code
Output

patients.select(
    "anchor_age",
    age_le_72=col("anchor_age").le(72),
    age_lt_72=col("anchor_age").lt(72),
)

shape: (100, 3)
┌────────────┬───────────┬───────────┐
│ anchor_age ┆ age_le_72 ┆ age_lt_72 │
│ ---        ┆ ---       ┆ ---       │
│ i64        ┆ bool      ┆ bool      │
╞════════════╪═══════════╪═══════════╡
│ 21         ┆ true      ┆ true      │
│ 72         ┆ true      ┆ false     │
│ 80         ┆ false     ┆ false     │
│ 38         ┆ true      ┆ true      │
│ 48         ┆ true      ┆ true      │
│ …          ┆ …         ┆ …         │
│ 51         ┆ true      ┆ true      │
│ 62         ┆ true      ┆ true      │
│ 83         ┆ false     ┆ false     │
│ 48         ┆ true      ┆ true      │
│ 70         ┆ true      ┆ true      │
└────────────┴───────────┴───────────┘

`is_between`

Check if values are between some other values.

Code
Output

patients.select(
    "anchor_age",
    age_21_72=col("anchor_age").is_between(21, 72, closed="both"),
)

shape: (100, 2)
┌────────────┬───────────┐
│ anchor_age ┆ age_21_72 │
│ ---        ┆ ---       │
│ i64        ┆ bool      │
╞════════════╪═══════════╡
│ 21         ┆ true      │
│ 72         ┆ true      │
│ 80         ┆ false     │
│ 38         ┆ true      │
│ 48         ┆ true      │
│ …          ┆ …         │
│ 51         ┆ true      │
│ 62         ┆ true      │
│ 83         ┆ false     │
│ 48         ┆ true      │
│ 70         ┆ true      │
└────────────┴───────────┘

`is_in`

Check if values are present in a list of values.

Code
Output

patients.select(
    "anchor_year",
    year_2125_2134_2155=col("anchor_year").is_in([2125, 2134, 2155]),
)

shape: (100, 2)
┌─────────────┬─────────────────────┐
│ anchor_year ┆ year_2125_2134_2155 │
│ ---         ┆ ---                 │
│ i64         ┆ bool                │
╞═════════════╪═════════════════════╡
│ 2125        ┆ true                │
│ 2134        ┆ true                │
│ 2155        ┆ true                │
│ 2143        ┆ false               │
│ 2142        ┆ false               │
│ …           ┆ …                   │
│ 2174        ┆ false               │
│ 2161        ┆ false               │
│ 2141        ┆ false               │
│ 2167        ┆ false               │
│ 2185        ┆ false               │
└─────────────┴─────────────────────┘

`is_null` / `is_not_null`

Check for missing values.

Code
Output

patients.select(
    "dod",
    dead=col("dod").is_null(),
    not_dead=col("dod").is_not_null(),
)

shape: (100, 3)
┌────────────┬───────┬──────────┐
│ dod        ┆ dead  ┆ not_dead │
│ ---        ┆ ---   ┆ ---      │
│ str        ┆ bool  ┆ bool     │
╞════════════╪═══════╪══════════╡
│ null       ┆ true  ┆ false    │
│ 2137-09-02 ┆ false ┆ true     │
│ null       ┆ true  ┆ false    │
│ 2143-03-30 ┆ false ┆ true     │
│ 2146-02-09 ┆ false ┆ true     │
│ …          ┆ …     ┆ …        │
│ null       ┆ true  ┆ false    │
│ null       ┆ true  ┆ false    │
│ null       ┆ true  ┆ false    │
│ null       ┆ true  ┆ false    │
│ null       ┆ true  ┆ false    │
└────────────┴───────┴──────────┘

`not_`

Logical inverse, True <-> False.

Code
Output

patients.select(
    "gender",
    is_not_male=col("gender").eq("M").not_(),
)

shape: (100, 2)
┌────────┬─────────────┐
│ gender ┆ is_not_male │
│ ---    ┆ ---         │
│ str    ┆ bool        │
╞════════╪═════════════╡
│ F      ┆ true        │
│ F      ┆ true        │
│ F      ┆ true        │
│ F      ┆ true        │
│ F      ┆ true        │
│ …      ┆ …           │
│ M      ┆ false       │
│ M      ┆ false       │
│ M      ┆ false       │
│ M      ┆ false       │
│ M      ┆ false       │
└────────┴─────────────┘

Aggregations

Aggregate data over rows.

`min` / `max` / `mode`

Get the min, max, and mode for a column.

Code
Output

patients.select(
    n=pl.len(),
    age_min=col("anchor_age").min(),
    age_max=col("anchor_age").max(),
    gender_mode=col("gender").mode(),
)

shape: (1, 4)
┌─────┬─────────┬─────────┬─────────────┐
│ n   ┆ age_min ┆ age_max ┆ gender_mode │
│ --- ┆ ---     ┆ ---     ┆ ---         │
│ u32 ┆ i64     ┆ i64     ┆ str         │
╞═════╪═════════╪═════════╪═════════════╡
│ 100 ┆ 21      ┆ 91      ┆ M           │
└─────┴─────────┴─────────┴─────────────┘

`mean` / `median` / `quantile`

Get the mean, median, and any quantile for a column.

Code
Output

patients.select(
    age_mean=col("anchor_age").mean(),
    age_median=col("anchor_age").median(),
    age_q25=col("anchor_age").quantile(0.25),
    age_q50=col("anchor_age").quantile(0.50),
    age_q75=col("anchor_age").quantile(0.75),
)

shape: (1, 5)
┌──────────┬────────────┬─────────┬─────────┬─────────┐
│ age_mean ┆ age_median ┆ age_q25 ┆ age_q50 ┆ age_q75 │
│ ---      ┆ ---        ┆ ---     ┆ ---     ┆ ---     │
│ f64      ┆ f64        ┆ f64     ┆ f64     ┆ f64     │
╞══════════╪════════════╪═════════╪═════════╪═════════╡
│ 61.75    ┆ 63.0       ┆ 52.0    ┆ 63.0    ┆ 72.0    │
└──────────┴────────────┴─────────┴─────────┴─────────┘

`std` / IQR

Get the standard deviation and calculate the interquartile range for a column.

Code
Output

patients.select(
    age_std=col("anchor_age").std(),
    age_iqr=col("anchor_age").quantile(0.75).sub(col("anchor_age").quantile(0.25)),
)

shape: (1, 2)
┌──────────┬─────────┐
│ age_std  ┆ age_iqr │
│ ---      ┆ ---     │
│ f64      ┆ f64     │
╞══════════╪═════════╡
│ 16.16979 ┆ 20.0    │
└──────────┴─────────┘

`..._horizontal`

Using the horizontal version of aggregations (eg. pl.sum_horizontal) allows for aggregations over the columns instead of the rows (ie. aggregation within each row).

Code
Output

patients.select(
    "anchor_age",
    "anchor_year",
    sum_horizontal=pl.sum_horizontal("anchor_age", "anchor_year"),
    mean_horizontal=pl.mean_horizontal("anchor_age", "anchor_year"),
)

shape: (100, 4)
┌────────────┬─────────────┬────────────────┬─────────────────┐
│ anchor_age ┆ anchor_year ┆ sum_horizontal ┆ mean_horizontal │
│ ---        ┆ ---         ┆ ---            ┆ ---             │
│ i64        ┆ i64         ┆ i64            ┆ f64             │
╞════════════╪═════════════╪════════════════╪═════════════════╡
│ 21         ┆ 2125        ┆ 2146           ┆ 1073.0          │
│ 72         ┆ 2134        ┆ 2206           ┆ 1103.0          │
│ 80         ┆ 2155        ┆ 2235           ┆ 1117.5          │
│ 38         ┆ 2143        ┆ 2181           ┆ 1090.5          │
│ 48         ┆ 2142        ┆ 2190           ┆ 1095.0          │
│ …          ┆ …           ┆ …              ┆ …               │
│ 51         ┆ 2174        ┆ 2225           ┆ 1112.5          │
│ 62         ┆ 2161        ┆ 2223           ┆ 1111.5          │
│ 83         ┆ 2141        ┆ 2224           ┆ 1112.0          │
│ 48         ┆ 2167        ┆ 2215           ┆ 1107.5          │
│ 70         ┆ 2185        ┆ 2255           ┆ 1127.5          │
└────────────┴─────────────┴────────────────┴─────────────────┘

Groups

Group data based on one or more columns perform calculations within these groups.

`group_by` / `agg`

After we have grouped the data based on some criteria we can aggregate the data within groups.

Code
Output

patients.group_by("gender").agg(
    n=pl.len(),
    age_mean=col("anchor_age").mean(),
    age_std=col("anchor_age").std(),
    age_median=col("anchor_age").median(),
    age_q25=col("anchor_age").quantile(0.25),
    age_q75=col("anchor_age").quantile(0.75),
)

shape: (2, 7)
┌────────┬─────┬───────────┬───────────┬────────────┬─────────┬─────────┐
│ gender ┆ n   ┆ mean_age  ┆ std_age   ┆ std_median ┆ std_q25 ┆ std_q75 │
│ ---    ┆ --- ┆ ---       ┆ ---       ┆ ---        ┆ ---     ┆ ---     │
│ str    ┆ u32 ┆ f64       ┆ f64       ┆ f64        ┆ f64     ┆ f64     │
╞════════╪═════╪═══════════╪═══════════╪════════════╪═════════╪═════════╡
│ F      ┆ 43  ┆ 60.813953 ┆ 18.074259 ┆ 63.0       ┆ 50.0    ┆ 74.0    │
│ M      ┆ 57  ┆ 62.45614  ┆ 14.698239 ┆ 64.0       ┆ 52.0    ┆ 70.0    │
└────────┴─────┴───────────┴───────────┴────────────┴─────────┴─────────┘

Joining / Merging Data

Join (sometimes called merge) one DataFrame with another.

`join`

The join function allows for many different types of joins (inner, left, right, etc.).

Code
Output

admissions = pl.read_csv(
    "../../data/hosp/admissions.csv.gz", columns=["subject_id", "insurance"]
)

patients.select("subject_id", "gender", "anchor_age").join(
    admissions, on="subject_id", how="left", validate="1:m"
)

shape: (275, 4)
┌────────────┬────────┬────────────┬───────────┐
│ subject_id ┆ gender ┆ anchor_age ┆ insurance │
│ ---        ┆ ---    ┆ ---        ┆ ---       │
│ i64        ┆ str    ┆ i64        ┆ str       │
╞════════════╪════════╪════════════╪═══════════╡
│ 10014729   ┆ F      ┆ 21         ┆ Other     │
│ 10014729   ┆ F      ┆ 21         ┆ Other     │
│ 10003400   ┆ F      ┆ 72         ┆ Medicare  │
│ 10003400   ┆ F      ┆ 72         ┆ Medicare  │
│ 10003400   ┆ F      ┆ 72         ┆ Medicare  │
│ …          ┆ …      ┆ …          ┆ …         │
│ 10021118   ┆ M      ┆ 62         ┆ Other     │
│ 10018501   ┆ M      ┆ 83         ┆ Medicare  │
│ 10007058   ┆ M      ┆ 48         ┆ Other     │
│ 10038992   ┆ M      ┆ 70         ┆ Medicare  │
│ 10038992   ┆ M      ┆ 70         ┆ Medicare  │
└────────────┴────────┴────────────┴───────────┘

Working with Strings

Polars has special functions for working with strings.

`format`

Create a string that contains values from one or more columns.

Code
Output

patients.select(
    str_column=pl.format(
        "Subject {} is {} and {} years old.",
        col("subject_id"),
        col("gender").replace({"F": "female", "M": "male"}),
        col("anchor_age"),
    )
).write_clipboard()

shape: (100, 1)
┌─────────────────────────────────────────────┐
│ str_column                                  │
│ ---                                         │
│ str                                         │
╞═════════════════════════════════════════════╡
│ Subject 10014729 is female and 21 years old.│
│ Subject 10003400 is female and 72 years old.│
│ Subject 10002428 is female and 80 years old.│
│ Subject 10032725 is female and 38 years old.│
│ Subject 10027445 is female and 48 years old.│
│ …                                           │
│ Subject 10004733 is male and 51 years old.  │
│ Subject 10021118 is male and 62 years old.  │
│ Subject 10018501 is male and 83 years old.  │
│ Subject 10007058 is male and 48 years old.  │
│ Subject 10038992 is male and 70 years old.  │
└─────────────────────────────────────────────┘

This can be useful for creating baseline characteristics tables:

Code
Output

patients.select(
    age_mean_std=pl.format(
        "{} ({})",
        col("anchor_age").mean().round(1),
        col("anchor_age").std().round(1),
    ),
    male_n_prop=pl.format(
        "{} ({}%)",
        col("gender").eq("M").sum(),
        col("gender").eq("M").mean().mul(100).round(1),
    ),
)

shape: (1, 2)
┌──────────────┬─────────────┐
│ age_mean_std ┆ male_n_prop │
│ ---          ┆ ---         │
│ str          ┆ str         │
╞══════════════╪═════════════╡
│ 61.8 (16.2)  ┆ 57 (57.0%)  │
└──────────────┴─────────────┘

Chaining

Chaining is useful for performing several related operations in a single block of code, which can increase readability and code flow.

Code
Output

(
    patients.with_columns(
        male=col("gender").eq("M"),
        is_working_age=col("anchor_age").is_between(18, 65),
    )
    .drop("gender")
    .filter(col("anchor_year").gt(2140), "is_working_age")
    .group_by("male")
    .agg(pl.len(), col("dod").is_null().sum())
    .with_columns(proportion=col("dod").truediv("len"))
)

shape: (2, 4)
┌───────┬─────┬─────┬────────────┐
│ male  ┆ len ┆ dod ┆ proportion │
│ ---   ┆ --- ┆ --- ┆ ---        │
│ bool  ┆ u32 ┆ u32 ┆ f64        │
╞═══════╪═════╪═════╪════════════╡
│ false ┆ 13  ┆ 7   ┆ 0.538462   │
│ true  ┆ 24  ┆ 18  ┆ 0.75       │
└───────┴─────┴─────┴────────────┘

Data Management

Polars pl

Series / DataFrame

Data Types

Importing Data

read_csv

read_parquet

Example data

Exporting Data

write_csv

write_parquet

Inspecting the Data

head / tail

describe

value_counts

Rows

filter

sort

unique

Columns

select

with_columns

rename

Operations

add / sub

mul / truediv