Best Practices
When using Python for academic research and data science, following established best practices ensures reproducible results and maintainable code. Here are some key guidelines to follow.
Structure your project
Section titled “Structure your project”Structure your research projects with clear organization. Create a dedicated directory for each project with subdirectories:
projects├── project-1│ ├── .git│ ├── docs│ ├── results│ ├── src│ │ └── project-1│ │ ├── __init__.py│ │ └── my_code.py│ ├── scripts│ ├── tests│ ├── .python-version│ ├── .pyproject.toml│ └── README.md├── project-2├── project-3└── ...
Use version control
Section titled “Use version control”Use version control (Git) for each project and backup the repository to a Git server (GitHub/GitLab or a bare repository in a remote folder).
Use virtual environments
Section titled “Use virtual environments”Us uv (or another tool) to manage your dependecies in virtual environments.
Have clear documentation
Section titled “Have clear documentation”Have a clear README.md and document your code thoroughly. Include docstrings for functions:
def process_dataset(df, threshold=0.05): """ Clean and preprocess the input dataset.
Parameters: df (pandas.DataFrame): Input dataset threshold (float): Significance threshold for outlier removal
Returns: pandas.DataFrame: Processed dataset """
Use type hints
Section titled “Use type hints”Type hints improve code readability, catch potential errors early, and provide better IDE support.
def my_function(a: int, b: float) -> str: return f"{a} + {b} = {a + b}"
Use tools to write better code
Section titled “Use tools to write better code”Use an automatic code formatter and linter, like ruff
.
Write reproducible code
Section titled “Write reproducible code”- Document in what order the code needs to run
- Consider using a command-runner tool (make/just) or a separate Python file to run all code in the correct order from a single entrypoing
- Set random seeds in any code using random number generation