Skip to content

Data Management

Copy the code from the exercise below and try out the basics of data management with Polars. Create new cells with # %% as necessary.


# %%
# Import packages polars as pl and col from polars
# %%
# Create a simple Series from a list
# %%
# Create a simple DataFrame from a dictionary
# %%
# Import the hosp/patients.csv.gz data using read_csv (change path as applicable)
# %%
# Inspect the data
# - head / tail
# - describe
# - value_counts
# %%
# Manipulate rows
# - filter
# - sort
# - unique
# %%
# Manipulate columns
# - select
# - with_columns
# - rename
# %%
# Use different operations
# add / sub
# mul / truediv
# %%
# Make comparisons
# - eq / ne
# - gt / ge
# - lt / le
# - is_between
# - is_in
# - is_null / is_not_null
# - not_
# %%
# Use aggregations
# min / max / mode
# mean / median / quantile
# std / IQR
# %%
# Use horizontal aggregations
# - sum_horizontal
# - mean_horizontal
# %%
# Group by some variable
# - group_by / agg
# %%
# Join/merge data
# - join
# %%
# Work with strings
# - format
# %%
# Try to chain many different operations together
# %%
# Patient Admissions
# Tables: hosp/patients, hosp/admissions
# %%
# 1. Import packages
# %%
# 2. Load the data
# %%
# 3. Examine data types
# %%
# 4. Join tables on subject_id to create patient-admission dataset
# %%
# 5. Check for missing data
# %%
# 6. Cross-tabulate gender and marital status using groupby and pivot display it nicely
# %%
# 7. Get age at admission and calculate the mean and standard deviation by sex/gender
# Example: Patient with anchor_year=2150, anchor_age=60, and admission_year 2153,
# then: age_at_admission = 60 + (2153 - 2150) = 63
# %%
# 8. Get the length of stay (dischtime - admittime) in days and calculate the mean and
# standard deviation by insurance type
# %%
# 1. Import packages
# %%
# 2. Load the data
# %%
# 1. Merge ICU stays with hospital admissions
# %%
# 2. Calculate ICU admission proportion per admission type
# %%
# Laboratory Data
# Tables: hosp/labevents, hosp/d_labitems
# %%
# 1. Import packages
# %%
# 2. Load the data
# %%
# 3. Join with d_labitems to get human-readable lab names
# %%
# 4. Check what the most common tests were
# %%
# 5. Check the proportion of abnormal lab values (flag variable) per category
# %%
# 6. Calculate the mean and standard deviation per year for all Glucose values
# %%
# Medication Data
# Tables: hosp/prescriptions, hosp/emar, hosp/emar_detail
# 1. Standardize medication names (handle variations in naming)
# 2. Calculate medication duration from prescriptions
# 3. Join prescriptions with actual administration records (emar)
# 4. Identify compliance: prescribed vs administered medications
# %%
# Study Population
# hosp/diagnoses_icd, hosp/d_icd_diagnoses, hosp/labevents, icu.chartevents
# 1. Identify sepsis patients using ICD codes (ICD-10: A40-A41)
# - Patient meets criteria if ANY diagnosis matches
# 2. Add laboratory criteria (elevated WBC, lactate)
# - WBC > 12,000 OR < 4,000 cells/μL
# - Lactate >= 2.0 mmol/L
# - Calculate using worst values within 24h of admission
# - Patient meets criteria if >=1 lab is abnormal
# 3. Include vital sign criteria (temperature, heart rate)
# - Temperature (itemid: 223761): > 38°C OR < 36°C
# - Heart Rate (itemid: 220045): > 90 bpm
# - Respiratory Rate (itemid: 220210): > 20 breaths/min
# - Calculate using worst values within 48h of admission
# - Patient meets criteria if >=2 vital signs are abnormal
# 4. Export final cohort with inclusion/exclusion documentation
# - Include: Patients meeting ICD codes AND (>=1 abnormal labs OR >=2 vital signs)
# - Exclude: age < 18, length_of_stay < 24 hours
# - Summarize: total screened, excluded by each criterion, and included at each step

See what else you can find in the data. 🔎