Skip to content

Best Practices

When using Python for academic research and data science, following established best practices ensures reproducible results and maintainable code. Here are some key guidelines to follow.

Structure your research projects with clear organization. Create a dedicated directory for each project with subdirectories:

projects
├── project-1
│ ├── .git
│ ├── docs
│ ├── results
│ ├── src
│ │ └── project-1
│ │ ├── __init__.py
│ │ └── my_code.py
│ ├── scripts
│ ├── tests
│ ├── .python-version
│ ├── .pyproject.toml
│ └── README.md
├── project-2
├── project-3
└── ...

Use version control (Git) for each project and backup the repository to a Git server (GitHub/GitLab or a bare repository in a remote folder).

Us uv (or another tool) to manage your dependecies in virtual environments.

Have a clear README.md and document your code thoroughly. Include docstrings for functions:

def process_dataset(df, threshold=0.05):
"""
Clean and preprocess the input dataset.
Parameters:
df (pandas.DataFrame): Input dataset
threshold (float): Significance threshold for outlier removal
Returns:
pandas.DataFrame: Processed dataset
"""

Type hints improve code readability, catch potential errors early, and provide better IDE support.

def my_function(a: int, b: float) -> str:
return f"{a} + {b} = {a + b}"

Use an automatic code formatter and linter, like ruff.

  • Document in what order the code needs to run
  • Consider using a command-runner tool (make/just) or a separate Python file to run all code in the correct order from a single entrypoing
  • Set random seeds in any code using random number generation