Introduction

Python is a high-level, general-purpose programming language that is used in many different contexts. It has seen increasing use in “data science” and is today rivalling R, STATA, SAS, and the like in popularity. It is especially common in the fields of machine learning and bioinformatics.

Python is an interpreted language, in contrast to compiled languages like Java or C. This means that there is no extra step where the coder needs to compile (ie. build) their program before it works, as the Python code is executed immediately by the Python interpreter.

Language / Interpreter / Editor

Many people without a programming background, coming from more focused statistical tools such as SAS, STATA, and even R-Studio, have a hard time distinguishing between the programming language, the interpreter, and the editor. These are distinct entities, but are often packaged together in statistical software and integrated development environments (IDEs), blurring the lines between them.

A programming language is notation used to write a computer program, and it is often formalized in a specification document. This is the human-readable text that we write to instruct the computer to do things. Examples are Python, R, SAS, STATA, C(++), Fortran, Haskell, etc.
An interpreter is a piece of software built to implement a programming language specification by translating the human-readable code into machine-readable code that can be executed (understood and acted on by the computer) on block at a time. Programming languages not using an interpreter generally need to be compiled (translated in its entirety) before they can be used.
An editor is a software that we can use to write our code and save them as files. Since code is just text, this can be any text editor (eg. MS Notepad). However, editors specifically designed for coding provide additional features to make coding easier and/or more efficient, such as syntax highlighting, linting, and keyword suggestions. If the software has a lot of extra functionality apart from just the code editor, such as debugging and source control integration, it is often called an integrated development environment (IDE).

We write Python (the programming language) in a text/code editor (or IDE) and we run the code by sending it to the Python interpreter that gives us back the results. In VSCode (an IDE), this is made simple by the integration of the code editor and a connection to the interpreter (among other things), so that we can write and run our Python code within one unified environment.

Why Programming Matters in Research

In research, there is a pervasive notion that programming is just a means to an end; a hurdle to get over, located somewhere between the hypothesis and the results. However, writing good code is not just about making things work; it is about making research reproducible, scalable, and trustworthy.

When we are conducting data science-type research, the code is our experimental setup. Just as a physicist will carefully test and calibrate their lab equipment and follow rigorous protocols, we as data scientist need to write clean, well-documented code; bad code leads to bad science.

Science requires reproducibility, and for us that means that others should be able to run our code and get the same results. This requires more than just working code. Good programming practices like version control, modular design, testing, and comprehensive documentation are not bureaucratic overhead, but the foundation of credible research.

Understanding programming deeply also means knowing how to leverage your tools to greatest effect. The difference between amateur and professional code can be the difference between waiting days for results versus getting them in minutes or seconds. This enables us to iterate faster, test more hypotheses, and ultimately generate more value.

The best data scientists are professionals who understand that the code is the research. Therefore, we should treat our code with the same respect as we would any other scientific instrument.

For more on this topic, check out this YouTube video by Richard McElreath: Science as Amateur Software Development (2023 edition)

Why Python?

Python has become a dominant language for data science by combining accessibility with great versatility. Python reads almost like English, letting beginners start meaningful work immediately. Through its versatility, Python can handle the entire data pipeline, from data management to statistical analysis to machine learning and everything in between, while other more specialized tools are often more limited in scope. Being free and open-source with a massive community spanning academia and industry, Python offers more resources, tutorials, and learning opportunities than most other alternatives.

However, Python’s generalist nature can sometimes be a limitation and not all statistical methods that exist in other software have been implemented for Python. In most situations, Python’s versatility outweighs its limitations, and when something is not implemented, it is easy to take a slight detour into R or another tool for that specific task.

As with all research tools, Python is really only as strong as what functionality has been implemented through different libraries and we owe the creators of these libraries huge thanks.

Python for Data Science

Since Python is a general-purpose programming language, we need to leverage the power of several Python libraries to efficiently use it for data-science/research tasks. The core packages in the data science stack are:

NumPy - General scientific computing and linear algebra
SciPy - Optimization, integration, interpolation, statistics, and more
Pandas - Working with data (older)
Polars - Working with data (modern)
Matplotlib - Plotting and making figures
Statsmodels - Estimation of statistical models and tests
Lifelines - Time-to-event and survival analysis
Scikit Learn - Predictive data analysis