Best Practices
Doing data analysis with Python is pretty straight forward, but there are some best practices to keep in mind.
Virtual Environments
Virtual environments are a way to separate the Python executable and packages between projects and the default version available on some operating systems. This is important as there might be breaking changes between versions of Python and packages, and upgrading these for one project might break another if not handled properly.
Virtual environments create a separate development environment for each project, with its own specified Python and package versions.
Common tools for working with virtual environments are pipenv
and conda
, where conda
and its optimized relative mamba
are widely used in data science.
conda
is more popular for data science is that many of the Python packages used here are actually written in C or some language that needs to be compiled for the specific platform that is used.
This is especially inconvenient on Windows and conda
solves this by having a big library of prebuilt binaries ready for use.Containers
Containers are the next step up from virtual environments in that even more of the development environment is sealed of and reproducible.
The most popular way to work with containers is through Docker
.
Read more here: https://towardsdatascience.com/docker-for-data-science-7927fb3c70d3
Code Style
Use an automatic code formatter, like black
.
Static Typing
Although not required, static typing can reduce errors and help with debugging.
x: int = 1y: int = 2def my_function(a: int, b: int) -> int: return a + bz: int = my_function(x, y)