Examination

This examination consists of 4 questions (1 point each) and 4 tasks (24 points each), for a total of 100 points. To pass, you need a score of 65 points or more.

Answer the questions and solve all tasks in the same python file named python_course_examination.py. Provide answers to the questions as comments at the top of the file. Since grading is done anonymously, please do not write your name anywhere in the file. When you are done, upload the file to Canvas:

# Questions
# 1. ...
# 2. ...
# 3. ...
# 4. ...

# %%
# Task 1

...

# %%
# Task 2

...

# %%
# Task 3

...

# %%
# Task 4

...

The deadline is Sunday 5/10 24:00, but the exam should not take more than a few hours to get through. Since the code has to be able to run on our computers for us to grade it, please make sure to read the data directly from the URLs provided in the tasks (ie. put the web address instead of the local file path when reading the data).

Questions

What is the purpose of using virtual environments when using Python for your projects?
What is the command for starting a new project with uv?
What is the command for installing a package with uv (eg. polars)?
What are the main packages for we have learnt about in this course for:
- Data management?
- Visualization?
- Statistical modelling?

Task 1 - Python Basics

Write a simple Python program that goes through a list of scores from an exam and determines how many students have passed the course. Create a second function that calculates the mean score. Finally print a message containing the information from your functions nicely formatted.

Create a function called analyze_scores(scores, passing_score) that takes two parameters:
- scores: a list of student scores
- passing_score: the minimum score required to pass
- Use a for loop (or list comprehension) to iterate through the list of scores
- Count how many students passed (grade >= passing_score)
- Return the count
Create a function called get_mean_score(scores) that takes one parameter:
- scores: a list of student scores
- Calculate the mean score using the builtin sum() and len() functions
- Return the mean
Test your functions with the following data:
- student_scores = [85, 92, 78, 65, 88, 73, 95, 82, 70, 68]
- passing_score = 75 (only for analyze_scores)
Use an f-string to print the results in the format:
- “k out of n students passed the exam (p%), mean score was m”
- k is the count of students passing
- n is the total number of students
- p is the proportion of students passing in percent with one decimal
- m is the mean score with one decimal

Task 2 - Data Management

Import the data in the admissions and diagnoses_icd CSV files using the URLs:
- admissions: https://pythondatascience.dev/data/hosp/admissions.csv.gz
- diagnoses_icd: https://pythondatascience.dev/data/hosp/diagnoses_icd.csv.gz
Filter the diagnoses in diagnoses_icd to keep only ICD version 10
Create a new column in admissions for the duration of the hospital stay in days
Join the two data sets on hadm_id choosing an appropriate join type and the correct validation
Identify the 10 most common ICD codes in the joined data
Filter the joined data to keep only the 10 most common ICD codes identified
Calculate the mean and standard deviation for the duration of the hospital stay stratified (grouped by) by ICD code

Display the stratified mean and standard deviation nicely rounded to one decimal:

┌──────────┬────────────┐
│ icd_code ┆ mean_std   │
╞══════════╪════════════╡
│ code     ┆ xxx (xxx)  │
│ ...      ┆ ...        │
└──────────┴────────────┘

Task 3 - Visualization

Import the data in the patients CSV file using the URL:
- https://pythondatascience.dev/data/hosp/patients.csv.gz
Join the patients data with the admissions data from the previous task (including the new duration of stay column)
Get the year of admission and calculate the age at admission as: age_at_admission = anchor_age + (admission_year - anchor_year)
Plot a scatter plot with age at admission on the x axis and duration of stay on the y-axis
Change the color and alpha of the markers
Set the x and y limit and labels
Give the plot a title
Set the location of the major x and y-ticks
Add a grid underneath the plot
Save the figure as figure-1.svg

Task 4 - Modelling / Analysis

Using the data from the previous task
Estimate the association between age at admission (independent variable/exposure) and duration of stay (dependent variable/outcome) using ordinary least squares regression (OLS, in statsmodels using the formula interface)
Show the summary of the fitted model
Plot the regression line over the scatter plot from the previous task, you can use:
- x = np.linspace(10, 100, 2) from the numpy package, for the x values of the line
- The linear combination of the coefficients (from the fitted model) and x, for y values of the line
Save the updated figure as figure-2.svg