Data Management
One of the most mature tools for handling data in Python is the NumPy
library, which adds support for large, multi-dimensional arrays and matrices, as well as functions for performing vectorized mathematical operations on these arrays.
NumPy
is still widely used for numerical computations, but has mostly been superseded by other libraries for analysing general tabular data (data frames).
One of these is Pandas
, which is built on top of NumPy
, but adds a lot of functionality that makes working with tabular data more convenient.
Pandas
is one of the most popular tools for data management in Python, but there are more modern alternatives that can provide additional benefits, such as Polars
, PyArrow
, Dask
, DuckDB
, and others.
Finally, there is a compatibility layer called Narwhals
that supports using most modern data libraries with a single syntax (a subset of the Polars API).
Polars pl
Section titled “Polars ”Here we will focus on using the Polars
library, as it is very fast, has an intuitive API, allows for out-of-core processing (doesn’t need to read in all the data at once), and makes parallelisation easy.
As an added bonus, it is easy to go from Polars
to Narwhals
, if we ever need to work with another data library.
import polars as pl # The polars modulefrom polars import col # Function to manipulate DataFrame columns
Series
/ DataFrame
Section titled “Series / DataFrame”The main data structures in Polars
are series and DataFrames.
A series is a one-dimensional data structure where all elements have the same data type.
pl.Series("numbers", [1, 2, 3, 4, 5])
shape: (5,)Series: 'numbers' [i64][ 1 2 3 4 5]
Extending the series is the DataFrame, a two-dimensional data structure that contains uniquely named series (the columns).
pl.DataFrame( { "numbers": [1, 2, 3, 4, 5], "letters": ["A", "B", "C", "D", "E"], })
shape: (5, 2)┌─────────┬─────────┐│ numbers ┆ letters ││ --- ┆ --- ││ i64 ┆ str │╞═════════╪═════════╡│ 1 ┆ A ││ 2 ┆ B ││ 3 ┆ C ││ 4 ┆ D ││ 5 ┆ E │└─────────┴─────────┘
By having the data in a DataFrame, we can use Polars
’s functions to select columns, filter rows, create new variables, or perform any other data manipulation we might need.
Data Types
Section titled “Data Types”Polars supports a many different data types (see their documentation):
- Numeric: signed integers, unsigned integers, floating point numbers, and decimals
- Nested: lists, structs, and arrays
- Temporal: dates, datetimes, times, and time deltas
- Others: strings, binary data, booleans, categoricals, enums, and objects
Missing data is denoted by a special null
value for all data types.
Importing Data
Section titled “Importing Data”Unless we are creating our own data, we need to import it from somewhere.
Data can be stored in many forms: files in different formats, databases, cloud storage.
There are functions to read the data for all of these forms (read more here), but we will focus on reading data from CSV
and, a little later, Parquet
files.
The functions to read data take many parameters, with the most useful being columns
(the columns to select), and specifically for CSV null_values
(values to interpret as missing), try_parse_dates
(automatically parse dates), separator
, and file encoding
.
read_csv
Section titled “read_csv”data = pl.read_csv( "data.csv", columns=["id", "date", "value"], null_values=["MISSING", "---", "999"], try_parse_dates=True, separator=";", encoding="utf8",)
read_parquet
Section titled “read_parquet”data = pl.read_parquet("data.parquet", columns=["id", "date", "value"])
Example data
Section titled “Example data”For the rest examples, we’ll use the MIMIC-IV Clinical Database Demo to demonstrate how to work with data. We can import the patients DataFrame with (change the path as applicable):
patients = pl.read_csv("../../data/hosp/patients.csv.gz")
Inspecting the Data
Section titled “Inspecting the Data”Let’s see at how we can inspect the data in different ways.
head
/ tail
Section titled “head / tail”Shows the first/last n
rows of the DataFrame (default is 5 rows).
patients.head()patients.tail(10)
shape: (5, 6)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡│ 10014729 ┆ F ┆ 21 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10003400 ┆ F ┆ 72 ┆ 2134 ┆ 2011 - 2013 ┆ 2137-09-02 ││ 10002428 ┆ F ┆ 80 ┆ 2155 ┆ 2011 - 2013 ┆ null ││ 10032725 ┆ F ┆ 38 ┆ 2143 ┆ 2011 - 2013 ┆ 2143-03-30 ││ 10027445 ┆ F ┆ 48 ┆ 2142 ┆ 2011 - 2013 ┆ 2146-02-09 │└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘
shape: (10, 6)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡│ 10005817 ┆ M ┆ 66 ┆ 2132 ┆ 2014 - 2016 ┆ 2135-01-19 ││ 10009049 ┆ M ┆ 56 ┆ 2174 ┆ 2014 - 2016 ┆ null ││ 10015931 ┆ M ┆ 87 ┆ 2176 ┆ 2014 - 2016 ┆ 2177-03-29 ││ 10006053 ┆ M ┆ 52 ┆ 2111 ┆ 2014 - 2016 ┆ 2111-11-15 ││ 10012552 ┆ M ┆ 78 ┆ 2140 ┆ 2014 - 2016 ┆ null ││ 10004733 ┆ M ┆ 51 ┆ 2174 ┆ 2014 - 2016 ┆ null ││ 10021118 ┆ M ┆ 62 ┆ 2161 ┆ 2014 - 2016 ┆ null ││ 10018501 ┆ M ┆ 83 ┆ 2141 ┆ 2014 - 2016 ┆ null ││ 10007058 ┆ M ┆ 48 ┆ 2167 ┆ 2014 - 2016 ┆ null ││ 10038992 ┆ M ┆ 70 ┆ 2185 ┆ 2014 - 2016 ┆ null │└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘
describe
Section titled “describe”Gives some descriptive statistics for the columns; mostly useful for numerical columns.
patients.describe()
shape: (9, 7)┌────────────┬──────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐│ statistic ┆ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ str ┆ f64 ┆ str ┆ f64 ┆ f64 ┆ str ┆ str │╞════════════╪══════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡│ count ┆ 100.0 ┆ 100 ┆ 100.0 ┆ 100.0 ┆ 100 ┆ 31 ││ null_count ┆ 0.0 ┆ 0 ┆ 0.0 ┆ 0.0 ┆ 0 ┆ 69 ││ mean ┆ 1.0019e7 ┆ null ┆ 61.75 ┆ 2148.68 ┆ null ┆ null ││ std ┆ 11151.154261 ┆ null ┆ 16.16979 ┆ 25.629008 ┆ null ┆ null ││ min ┆ 1.0000032e7 ┆ F ┆ 21.0 ┆ 2110.0 ┆ 2011 - 2013 ┆ 2111-11-15 ││ 25% ┆ 1.0009035e7 ┆ null ┆ 52.0 ┆ 2129.0 ┆ null ┆ null ││ 50% ┆ 1.0019172e7 ┆ null ┆ 63.0 ┆ 2146.0 ┆ null ┆ null ││ 75% ┆ 1.0025463e7 ┆ null ┆ 72.0 ┆ 2172.0 ┆ null ┆ null ││ max ┆ 1.0040025e7 ┆ M ┆ 91.0 ┆ 2201.0 ┆ 2014 - 2016 ┆ 2201-12-24 │└────────────┴──────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘
value_counts
Section titled “value_counts”Counts the number of occurrences of each category in categorical columns.
patients["gender"].value_counts()
shape: (2, 2)┌────────┬───────┐│ gender ┆ count ││ --- ┆ --- ││ str ┆ u32 │╞════════╪═══════╡│ M ┆ 57 ││ F ┆ 43 │└────────┴───────┘
filter
Section titled “filter”Filters the rows based on one or more criteria.
patients.filter(col("gender").eq("M"))
shape: (57, 6)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡│ 10022281 ┆ M ┆ 84 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10035631 ┆ M ┆ 63 ┆ 2112 ┆ 2011 - 2013 ┆ 2116-03-12 ││ 10024043 ┆ M ┆ 67 ┆ 2117 ┆ 2011 - 2013 ┆ 2117-06-26 ││ 10025612 ┆ M ┆ 82 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10003046 ┆ M ┆ 64 ┆ 2154 ┆ 2011 - 2013 ┆ null ││ … ┆ … ┆ … ┆ … ┆ … ┆ … ││ 10004733 ┆ M ┆ 51 ┆ 2174 ┆ 2014 - 2016 ┆ null ││ 10021118 ┆ M ┆ 62 ┆ 2161 ┆ 2014 - 2016 ┆ null ││ 10018501 ┆ M ┆ 83 ┆ 2141 ┆ 2014 - 2016 ┆ null ││ 10007058 ┆ M ┆ 48 ┆ 2167 ┆ 2014 - 2016 ┆ null ││ 10038992 ┆ M ┆ 70 ┆ 2185 ┆ 2014 - 2016 ┆ null │└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘
Sorts the rows based on one or more criteria.
patients.sort("anchor_age")
shape: (100, 6)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╡│ 10014729 ┆ F ┆ 21 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10008454 ┆ F ┆ 26 ┆ 2110 ┆ 2011 - 2013 ┆ null ││ 10010867 ┆ F ┆ 28 ┆ 2147 ┆ 2014 - 2016 ┆ null ││ 10009035 ┆ M ┆ 28 ┆ 2161 ┆ 2011 - 2013 ┆ null ││ 10023239 ┆ F ┆ 29 ┆ 2137 ┆ 2014 - 2016 ┆ null ││ … ┆ … ┆ … ┆ … ┆ … ┆ … ││ 10036156 ┆ F ┆ 88 ┆ 2157 ┆ 2014 - 2016 ┆ null ││ 10010471 ┆ F ┆ 89 ┆ 2155 ┆ 2014 - 2016 ┆ 2155-12-07 ││ 10020640 ┆ F ┆ 91 ┆ 2153 ┆ 2011 - 2013 ┆ 2154-02-04 ││ 10012853 ┆ F ┆ 91 ┆ 2175 ┆ 2014 - 2016 ┆ null ││ 10018845 ┆ M ┆ 91 ┆ 2184 ┆ 2014 - 2016 ┆ 2184-11-22 │└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┘
unique
Section titled “unique”Keep only unique observations, optionally within a subset of the columns.
patients.unique("gender", keep="first")
shape: (2, 6)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬──────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪══════╡│ 10014729 ┆ F ┆ 21 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10022281 ┆ M ┆ 84 ┆ 2125 ┆ 2011 - 2013 ┆ null │└────────────┴────────┴────────────┴─────────────┴───────────────────┴──────┘
Columns
Section titled “Columns”select
Section titled “select”Select columns based on their names.
patients.select("gender", "dod")patients["gender", "dod"]
shape: (100, 2)┌────────┬────────────┐│ gender ┆ dod ││ --- ┆ --- ││ str ┆ str │╞════════╪════════════╡│ F ┆ null ││ F ┆ 2137-09-02 ││ F ┆ null ││ F ┆ 2143-03-30 ││ F ┆ 2146-02-09 ││ … ┆ … ││ M ┆ null ││ M ┆ null ││ M ┆ null ││ M ┆ null ││ M ┆ null │└────────┴────────────┘
Indexing the DataFrame like a dictionary [...]
will give the same result,
patients.select(is_adult=col("anchor_age").gt(18))
shape: (100, 1)┌──────────┐│ is_adult ││ --- ││ bool │╞══════════╡│ true ││ true ││ true ││ true ││ true ││ … ││ true ││ true ││ true ││ true ││ true │└──────────┘
Using regular expressions (a mini language for matching patterns in text), we can select columns based on some pattern. For example, all columns starting with “anchor_”.
patients.select(col("^anchor_.*$"))
shape: (100, 3)┌────────────┬─────────────┬───────────────────┐│ anchor_age ┆ anchor_year ┆ anchor_year_group ││ --- ┆ --- ┆ --- ││ i64 ┆ i64 ┆ str │╞════════════╪═════════════╪═══════════════════╡│ 21 ┆ 2125 ┆ 2011 - 2013 ││ 72 ┆ 2134 ┆ 2011 - 2013 ││ 80 ┆ 2155 ┆ 2011 - 2013 ││ 38 ┆ 2143 ┆ 2011 - 2013 ││ 48 ┆ 2142 ┆ 2011 - 2013 ││ … ┆ … ┆ … ││ 51 ┆ 2174 ┆ 2014 - 2016 ││ 62 ┆ 2161 ┆ 2014 - 2016 ││ 83 ┆ 2141 ┆ 2014 - 2016 ││ 48 ┆ 2167 ┆ 2014 - 2016 ││ 70 ┆ 2185 ┆ 2014 - 2016 │└────────────┴─────────────┴───────────────────┘
with_columns
Section titled “with_columns”Create new columns and append them to the data.
patients.with_columns(is_adult=col("anchor_age").gt(18))
shape: (100, 7)┌────────────┬────────┬────────────┬─────────────┬───────────────────┬────────────┬──────────┐│ subject_id ┆ gender ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ┆ is_adult ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str ┆ bool │╞════════════╪════════╪════════════╪═════════════╪═══════════════════╪════════════╪══════════╡│ 10014729 ┆ F ┆ 21 ┆ 2125 ┆ 2011 - 2013 ┆ null ┆ true ││ 10003400 ┆ F ┆ 72 ┆ 2134 ┆ 2011 - 2013 ┆ 2137-09-02 ┆ true ││ 10002428 ┆ F ┆ 80 ┆ 2155 ┆ 2011 - 2013 ┆ null ┆ true ││ 10032725 ┆ F ┆ 38 ┆ 2143 ┆ 2011 - 2013 ┆ 2143-03-30 ┆ true ││ 10027445 ┆ F ┆ 48 ┆ 2142 ┆ 2011 - 2013 ┆ 2146-02-09 ┆ true ││ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ││ 10004733 ┆ M ┆ 51 ┆ 2174 ┆ 2014 - 2016 ┆ null ┆ true ││ 10021118 ┆ M ┆ 62 ┆ 2161 ┆ 2014 - 2016 ┆ null ┆ true ││ 10018501 ┆ M ┆ 83 ┆ 2141 ┆ 2014 - 2016 ┆ null ┆ true ││ 10007058 ┆ M ┆ 48 ┆ 2167 ┆ 2014 - 2016 ┆ null ┆ true ││ 10038992 ┆ M ┆ 70 ┆ 2185 ┆ 2014 - 2016 ┆ null ┆ true │└────────────┴────────┴────────────┴─────────────┴───────────────────┴────────────┴──────────┘
rename
Section titled “rename”Rename columns.
patients.rename({"gender": "sex"})
shape: (100, 6)┌────────────┬─────┬────────────┬─────────────┬───────────────────┬────────────┐│ subject_id ┆ sex ┆ anchor_age ┆ anchor_year ┆ anchor_year_group ┆ dod ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ str │╞════════════╪═════╪════════════╪═════════════╪═══════════════════╪════════════╡│ 10014729 ┆ F ┆ 21 ┆ 2125 ┆ 2011 - 2013 ┆ null ││ 10003400 ┆ F ┆ 72 ┆ 2134 ┆ 2011 - 2013 ┆ 2137-09-02 ││ 10002428 ┆ F ┆ 80 ┆ 2155 ┆ 2011 - 2013 ┆ null ││ 10032725 ┆ F ┆ 38 ┆ 2143 ┆ 2011 - 2013 ┆ 2143-03-30 ││ 10027445 ┆ F ┆ 48 ┆ 2142 ┆ 2011 - 2013 ┆ 2146-02-09 ││ … ┆ … ┆ … ┆ … ┆ … ┆ … ││ 10004733 ┆ M ┆ 51 ┆ 2174 ┆ 2014 - 2016 ┆ null ││ 10021118 ┆ M ┆ 62 ┆ 2161 ┆ 2014 - 2016 ┆ null ││ 10018501 ┆ M ┆ 83 ┆ 2141 ┆ 2014 - 2016 ┆ null ││ 10007058 ┆ M ┆ 48 ┆ 2167 ┆ 2014 - 2016 ┆ null ││ 10038992 ┆ M ┆ 70 ┆ 2185 ┆ 2014 - 2016 ┆ null │└────────────┴─────┴────────────┴─────────────┴───────────────────┴────────────┘
Operations
Section titled “Operations”Vectorized mathematical operations on columns.
add
/ sub
Section titled “add / sub”Add and subtract columns.
patients.select( "anchor_age", "anchor_year", age_plus_5=col("anchor_age").add(5), age_minus_5=col("anchor_age").sub(5), year_minus_age=col("anchor_age").sub("anchor_year"), # Which is the same as the more verbose, but more flexible: # year_minus_age=col("anchor_age").sub(col("anchor_year")))
shape: (100, 5)┌────────────┬─────────────┬────────────┬─────────────┬────────────────┐│ anchor_age ┆ anchor_year ┆ age_plus_5 ┆ age_minus_5 ┆ year_minus_age ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │╞════════════╪═════════════╪════════════╪═════════════╪════════════════╡│ 21 ┆ 2125 ┆ 26 ┆ 16 ┆ -2104 ││ 72 ┆ 2134 ┆ 77 ┆ 67 ┆ -2062 ││ 80 ┆ 2155 ┆ 85 ┆ 75 ┆ -2075 ││ 38 ┆ 2143 ┆ 43 ┆ 33 ┆ -2105 ││ 48 ┆ 2142 ┆ 53 ┆ 43 ┆ -2094 ││ … ┆ … ┆ … ┆ … ┆ … ││ 51 ┆ 2174 ┆ 56 ┆ 46 ┆ -2123 ││ 62 ┆ 2161 ┆ 67 ┆ 57 ┆ -2099 ││ 83 ┆ 2141 ┆ 88 ┆ 78 ┆ -2058 ││ 48 ┆ 2167 ┆ 53 ┆ 43 ┆ -2119 ││ 70 ┆ 2185 ┆ 75 ┆ 65 ┆ -2115 │└────────────┴─────────────┴────────────┴─────────────┴────────────────┘
mul
/ true_div
Section titled “mul / true_div”Multiply and divide columns.
patients.select( "anchor_age", age_mul_5=col("anchor_age").mul(5), age_div_5=col("anchor_age").truediv(5),)
shape: (100, 3)┌────────────┬───────────┬───────────┐│ anchor_age ┆ age_mul_5 ┆ age_div_5 ││ --- ┆ --- ┆ --- ││ i64 ┆ i64 ┆ f64 │╞════════════╪═══════════╪═══════════╡│ 21 ┆ 105 ┆ 4.2 ││ 72 ┆ 360 ┆ 14.4 ││ 80 ┆ 400 ┆ 16.0 ││ 38 ┆ 190 ┆ 7.6 ││ 48 ┆ 240 ┆ 9.6 ││ … ┆ … ┆ … ││ 51 ┆ 255 ┆ 10.2 ││ 62 ┆ 310 ┆ 12.4 ││ 83 ┆ 415 ┆ 16.6 ││ 48 ┆ 240 ┆ 9.6 ││ 70 ┆ 350 ┆ 14.0 │└────────────┴───────────┴───────────┘
Comparisons
Section titled “Comparisons”Compare values between columns.
eq
/ ne
Section titled “eq / ne”Equal and not equal to.
patients.select( "gender", is_male=col("gender").eq("M"), is_not_male=col("gender").ne("M"),)
shape: (100, 3)┌────────┬─────────┬─────────────┐│ gender ┆ is_male ┆ is_not_male ││ --- ┆ --- ┆ --- ││ str ┆ bool ┆ bool │╞════════╪═════════╪═════════════╡│ F ┆ false ┆ true ││ F ┆ false ┆ true ││ F ┆ false ┆ true ││ F ┆ false ┆ true ││ F ┆ false ┆ true ││ … ┆ … ┆ … ││ M ┆ true ┆ false ││ M ┆ true ┆ false ││ M ┆ true ┆ false ││ M ┆ true ┆ false ││ M ┆ true ┆ false │└────────┴─────────┴─────────────┘
gt
/ ge
Section titled “gt / ge”Greater than and greater than or equal to.
patients.select( "anchor_age", age_ge_21=col("anchor_age").ge(21), age_gt_21=col("anchor_age").gt(21),)
shape: (100, 3)┌────────────┬───────────┬───────────┐│ anchor_age ┆ age_ge_21 ┆ age_gt_21 ││ --- ┆ --- ┆ --- ││ i64 ┆ bool ┆ bool │╞════════════╪═══════════╪═══════════╡│ 21 ┆ true ┆ false ││ 72 ┆ true ┆ true ││ 80 ┆ true ┆ true ││ 38 ┆ true ┆ true ││ 48 ┆ true ┆ true ││ … ┆ … ┆ … ││ 51 ┆ true ┆ true ││ 62 ┆ true ┆ true ││ 83 ┆ true ┆ true ││ 48 ┆ true ┆ true ││ 70 ┆ true ┆ true │└────────────┴───────────┴───────────┘
lt
/ le
Section titled “lt / le”Less than and less than or equal to.
patients.select( "anchor_age", age_le_72=col("anchor_age").le(72), age_lt_72=col("anchor_age").lt(72),)
shape: (100, 3)┌────────────┬───────────┬───────────┐│ anchor_age ┆ age_le_72 ┆ age_lt_72 ││ --- ┆ --- ┆ --- ││ i64 ┆ bool ┆ bool │╞════════════╪═══════════╪═══════════╡│ 21 ┆ true ┆ true ││ 72 ┆ true ┆ false ││ 80 ┆ false ┆ false ││ 38 ┆ true ┆ true ││ 48 ┆ true ┆ true ││ … ┆ … ┆ … ││ 51 ┆ true ┆ true ││ 62 ┆ true ┆ true ││ 83 ┆ false ┆ false ││ 48 ┆ true ┆ true ││ 70 ┆ true ┆ true │└────────────┴───────────┴───────────┘
is_between
Section titled “is_between”Check if values are between some other values.
patients.select( "anchor_age", age_21_72=col("anchor_age").is_between(21, 72, closed="both"),)
shape: (100, 2)┌────────────┬───────────┐│ anchor_age ┆ age_21_72 ││ --- ┆ --- ││ i64 ┆ bool │╞════════════╪═══════════╡│ 21 ┆ true ││ 72 ┆ true ││ 80 ┆ false ││ 38 ┆ true ││ 48 ┆ true ││ … ┆ … ││ 51 ┆ true ││ 62 ┆ true ││ 83 ┆ false ││ 48 ┆ true ││ 70 ┆ true │└────────────┴───────────┘
Check if values are present in a list of values.
patients.select( "anchor_year", year_2125_2134_2155=col("anchor_year").is_in([2125, 2134, 2155]),)
shape: (100, 2)┌─────────────┬─────────────────────┐│ anchor_year ┆ year_2125_2134_2155 ││ --- ┆ --- ││ i64 ┆ bool │╞═════════════╪═════════════════════╡│ 2125 ┆ true ││ 2134 ┆ true ││ 2155 ┆ true ││ 2143 ┆ false ││ 2142 ┆ false ││ … ┆ … ││ 2174 ┆ false ││ 2161 ┆ false ││ 2141 ┆ false ││ 2167 ┆ false ││ 2185 ┆ false │└─────────────┴─────────────────────┘
is_null
/ is_not_null
Section titled “is_null / is_not_null”Check for missing values.
patients.select( "dod", dead=col("dod").is_null(), not_dead=col("dod").is_not_null(),)
shape: (100, 3)┌────────────┬───────┬──────────┐│ dod ┆ dead ┆ not_dead ││ --- ┆ --- ┆ --- ││ str ┆ bool ┆ bool │╞════════════╪═══════╪══════════╡│ null ┆ true ┆ false ││ 2137-09-02 ┆ false ┆ true ││ null ┆ true ┆ false ││ 2143-03-30 ┆ false ┆ true ││ 2146-02-09 ┆ false ┆ true ││ … ┆ … ┆ … ││ null ┆ true ┆ false ││ null ┆ true ┆ false ││ null ┆ true ┆ false ││ null ┆ true ┆ false ││ null ┆ true ┆ false │└────────────┴───────┴──────────┘
Logical inverse, True <-> False
.
patients.select( "gender", is_not_male=col("gender").eq("M").not_(),)
shape: (100, 2)┌────────┬─────────────┐│ gender ┆ is_not_male ││ --- ┆ --- ││ str ┆ bool │╞════════╪═════════════╡│ F ┆ true ││ F ┆ true ││ F ┆ true ││ F ┆ true ││ F ┆ true ││ … ┆ … ││ M ┆ false ││ M ┆ false ││ M ┆ false ││ M ┆ false ││ M ┆ false │└────────┴─────────────┘
Aggregations
Section titled “Aggregations”Aggregate data over rows.
min
/ max
/ mode
Section titled “min / max / mode”Get the min, max, and mode for a column.
patients.select( n=pl.len(), age_min=col("anchor_age").min(), age_max=col("anchor_age").max(), gender_mode=col("gender").mode(),)
shape: (1, 4)┌─────┬─────────┬─────────┬─────────────┐│ n ┆ age_min ┆ age_max ┆ gender_mode ││ --- ┆ --- ┆ --- ┆ --- ││ u32 ┆ i64 ┆ i64 ┆ str │╞═════╪═════════╪═════════╪═════════════╡│ 100 ┆ 21 ┆ 91 ┆ M │└─────┴─────────┴─────────┴─────────────┘
mean
/ median
/ quantile
Section titled “mean / median / quantile”Get the mean, median, and any quantile for a column.
patients.select( age_mean=col("anchor_age").mean(), age_median=col("anchor_age").median(), age_q25=col("anchor_age").quantile(0.25), age_q50=col("anchor_age").quantile(0.50), age_q75=col("anchor_age").quantile(0.75),)
shape: (1, 5)┌──────────┬────────────┬─────────┬─────────┬─────────┐│ age_mean ┆ age_median ┆ age_q25 ┆ age_q50 ┆ age_q75 ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │╞══════════╪════════════╪═════════╪═════════╪═════════╡│ 61.75 ┆ 63.0 ┆ 52.0 ┆ 63.0 ┆ 72.0 │└──────────┴────────────┴─────────┴─────────┴─────────┘
std
/ IQR
Section titled “std / IQR”Get the standard deviation and calculate the interquartile range for a column.
patients.select( age_std=col("anchor_age").std(), age_iqr=col("anchor_age").quantile(0.75).sub(col("anchor_age").quantile(0.25)),)
shape: (1, 2)┌──────────┬─────────┐│ age_std ┆ age_iqr ││ --- ┆ --- ││ f64 ┆ f64 │╞══════════╪═════════╡│ 16.16979 ┆ 20.0 │└──────────┴─────────┘
..._horizontal
Section titled “..._horizontal”Using the horizontal version of aggregations (eg. pl.sum_horizontal
) allows for aggregations over the columns instead of the rows (ie. aggregation within each row).
patients.select( "anchor_age", "anchor_year", sum_horizontal=pl.sum_horizontal("anchor_age", "anchor_year"), mean_horizontal=pl.mean_horizontal("anchor_age", "anchor_year"),)
shape: (100, 4)┌────────────┬─────────────┬────────────────┬─────────────────┐│ anchor_age ┆ anchor_year ┆ sum_horizontal ┆ mean_horizontal ││ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ i64 ┆ i64 ┆ f64 │╞════════════╪═════════════╪════════════════╪═════════════════╡│ 21 ┆ 2125 ┆ 2146 ┆ 1073.0 ││ 72 ┆ 2134 ┆ 2206 ┆ 1103.0 ││ 80 ┆ 2155 ┆ 2235 ┆ 1117.5 ││ 38 ┆ 2143 ┆ 2181 ┆ 1090.5 ││ 48 ┆ 2142 ┆ 2190 ┆ 1095.0 ││ … ┆ … ┆ … ┆ … ││ 51 ┆ 2174 ┆ 2225 ┆ 1112.5 ││ 62 ┆ 2161 ┆ 2223 ┆ 1111.5 ││ 83 ┆ 2141 ┆ 2224 ┆ 1112.0 ││ 48 ┆ 2167 ┆ 2215 ┆ 1107.5 ││ 70 ┆ 2185 ┆ 2255 ┆ 1127.5 │└────────────┴─────────────┴────────────────┴─────────────────┘
Groups
Section titled “Groups”Group data based on one or more columns perform calculations within these groups.
group_by
/ agg
Section titled “group_by / agg”After we have grouped the data based on some criteria we can aggregate the data within groups.
patients.group_by("gender").agg( n=pl.len(), age_mean=col("anchor_age").mean(), age_std=col("anchor_age").std(), age_median=col("anchor_age").median(), age_q25=col("anchor_age").quantile(0.25), age_q75=col("anchor_age").quantile(0.75),)
shape: (2, 7)┌────────┬─────┬───────────┬───────────┬────────────┬─────────┬─────────┐│ gender ┆ n ┆ mean_age ┆ std_age ┆ std_median ┆ std_q25 ┆ std_q75 ││ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ││ str ┆ u32 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │╞════════╪═════╪═══════════╪═══════════╪════════════╪═════════╪═════════╡│ F ┆ 43 ┆ 60.813953 ┆ 18.074259 ┆ 63.0 ┆ 50.0 ┆ 74.0 ││ M ┆ 57 ┆ 62.45614 ┆ 14.698239 ┆ 64.0 ┆ 52.0 ┆ 70.0 │└────────┴─────┴───────────┴───────────┴────────────┴─────────┴─────────┘
Joining / Merging Data
Section titled “Joining / Merging Data”Join (sometimes called merge) one DataFrame with another.
The join
function allows for many different times of joins (inner, left, right, etc.).
admissions = pl.read_csv( "../../data/hosp/admissions.csv.gz", columns=["subject_id", "insurance"])
patients.select("subject_id", "gender", "anchor_age").join( admissions, on="subject_id", how="left", validate="1:m")
shape: (275, 4)┌────────────┬────────┬────────────┬───────────┐│ subject_id ┆ gender ┆ anchor_age ┆ insurance ││ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ i64 ┆ str │╞════════════╪════════╪════════════╪═══════════╡│ 10014729 ┆ F ┆ 21 ┆ Other ││ 10014729 ┆ F ┆ 21 ┆ Other ││ 10003400 ┆ F ┆ 72 ┆ Medicare ││ 10003400 ┆ F ┆ 72 ┆ Medicare ││ 10003400 ┆ F ┆ 72 ┆ Medicare ││ … ┆ … ┆ … ┆ … ││ 10021118 ┆ M ┆ 62 ┆ Other ││ 10018501 ┆ M ┆ 83 ┆ Medicare ││ 10007058 ┆ M ┆ 48 ┆ Other ││ 10038992 ┆ M ┆ 70 ┆ Medicare ││ 10038992 ┆ M ┆ 70 ┆ Medicare │└────────────┴────────┴────────────┴───────────┘
Working with Strings
Section titled “Working with Strings”Polars has special functions for working with strings.
format
Section titled “format”Create a string that contains values from one or more columns.
patients.select( str_column=pl.format( "Subject {} is {} and {} years old.", col("subject_id"), col("gender").replace({"F": "female", "M": "male"}), col("anchor_age"), )).write_clipboard()
shape: (100, 1)┌─────────────────────────────────────────────┐│ str_column ││ --- ││ str │╞═════════════════════════════════════════════╡│ Subject 10014729 is female and 21 years old.││ Subject 10003400 is female and 72 years old.││ Subject 10002428 is female and 80 years old.││ Subject 10032725 is female and 38 years old.││ Subject 10027445 is female and 48 years old.││ … ││ Subject 10004733 is male and 51 years old. ││ Subject 10021118 is male and 62 years old. ││ Subject 10018501 is male and 83 years old. ││ Subject 10007058 is male and 48 years old. ││ Subject 10038992 is male and 70 years old. │└─────────────────────────────────────────────┘
This can be useful for creating baseline characteristics tables:
patients.select( age_mean_std=pl.format( "{} ({})", col("anchor_age").mean().round(1), col("anchor_age").std().round(1), ), male_n_prop=pl.format( "{} ({}%)", col("gender").eq("M").sum(), col("gender").eq("M").mean().mul(100).round(1), ),)
shape: (1, 2)┌──────────────┬─────────────┐│ age_mean_std ┆ male_n_prop ││ --- ┆ --- ││ str ┆ str │╞══════════════╪═════════════╡│ 61.8 (16.2) ┆ 57 (57.0%) │└──────────────┴─────────────┘
Chaining
Section titled “Chaining”Chaining is useful for performing several related operations in a single block of code, which can increase readability and code flow.
( patients.with_columns( male=col("gender").eq("M"), is_working_age=col("anchor_age").is_between(18, 65), ) .drop("gender") .filter(col("anchor_year").gt(2140), "is_working_age") .group_by("male") .agg(pl.len(), col("dod").is_null().sum()) .with_columns(proportion=col("dod").truediv("len")))
shape: (2, 4)┌───────┬─────┬─────┬────────────┐│ male ┆ len ┆ dod ┆ proportion ││ --- ┆ --- ┆ --- ┆ --- ││ bool ┆ u32 ┆ u32 ┆ f64 │╞═══════╪═════╪═════╪════════════╡│ false ┆ 13 ┆ 7 ┆ 0.538462 ││ true ┆ 24 ┆ 18 ┆ 0.75 │└───────┴─────┴─────┴────────────┘
But I like Pandas
!
Section titled “But I like Pandas!”That’s great!
Feel free to use whatever tool you like and it shouldn’t be too hard to translate the examples on this site into the Pandas
equivalent.