Best Practices

When using Python for academic research and data science, following established best practices ensures reproducible results and maintainable code. Here are some key guidelines to follow.

Structure your project

Structure your research projects with clear organization. Create a dedicated directory for each project with subdirectories:

projects
├── project-1
│   ├── .git
│   ├── docs
│   ├── results
│   ├── src
│   │   └── project-1
│   │       ├── __init__.py
│   │       └── my_code.py
│   ├── scripts
│   ├── tests
│   ├── .python-version
│   ├── .pyproject.toml
│   └── README.md
├── project-2
├── project-3
└── ...

Use version control

Use version control (Git) for each project and backup the repository to a Git server (GitHub/GitLab or a bare repository in a remote folder).

Use virtual environments

Us uv (or another tool) to manage your dependecies in virtual environments.

Have clear documentation

Have a clear README.md and document your code thoroughly. Include docstrings for functions:

def process_dataset(df, threshold=0.05):
    """
    Clean and preprocess the input dataset.

    Parameters:
        df (pandas.DataFrame): Input dataset
        threshold (float): Significance threshold for outlier removal

    Returns:
        pandas.DataFrame: Processed dataset
    """

Use type hints

Type hints improve code readability, catch potential errors early, and provide better IDE support.

def my_function(a: int, b: float) -> str:
  return f"{a} + {b} = {a + b}"

Use tools to write better code

Use an automatic code formatter and linter, like ruff.

Write reproducible code

Document in what order the code needs to run
Consider using a command-runner tool (make/just) or a separate Python file to run all code in the correct order from a single entrypoing
Set random seeds in any code using random number generation