Best Practices

Doing data analysis with Python is pretty straight forward, but there are some best practices to keep in mind.

Virtual Environments

Virtual environments are a way to separate the Python executable and packages between projects and the default version available on some operating systems. This is important as there might be breaking changes between versions of Python and packages, and upgrading these for one project might break another if not handled properly.

Virtual environments create a separate development environment for each project, with its own specified Python and package versions. Common tools for working with virtual environments are pipenv and conda, where conda and its optimized relative mamba are widely used in data science.

One reason that conda is more popular for data science is that many of the Python packages used here are actually written in C or some language that needs to be compiled for the specific platform that is used. This is especially inconvenient on Windows and conda solves this by having a big library of prebuilt binaries ready for use.

Containers

Containers are the next step up from virtual environments in that even more of the development environment is sealed of and reproducible. The most popular way to work with containers is through Docker.

Read more here: https://towardsdatascience.com/docker-for-data-science-7927fb3c70d3

Code Style

Use an automatic code formatter, like black.

Static Typing

Although not required, static typing can reduce errors and help with debugging.

x: int = 1y: int = 2def my_function(a: int, b: int) -> int:    return a + bz: int = my_function(x, y)