Python

Python is a high-level, general-purpose programming language that is used in many different contexts. It has seen increasing use in Data Science and is today rivalling R, STATA, SAS, and the like in popularity. It is especially common in the fields of machine learning and bioinformatics.

Python is an interpreted language, in contrast to compiled languages such as Java or C. This means that there is no extra step where the coder needs to package their program before it works, as the Python code is executed immediately by the Python interpreter.

Language / Interpreter / Editor

Many people without a programming background, coming from more focused statistical tools such as SAS, STATA, and even R-Studio (now Posit), have a hard time distinguishing between the programming language, the interpreter, and the editor. These are distinct entities, but are often packaged together in statistical software and integrated development environments (IDEs), blurring the lines between them.

  • Programming Language is the notation used to write a computer program, and it is often formalized in a specification document. This is the human-readable text that we write to instruct the computer to do things. Examples are Python, R, SAS, STATA, C(++), Fortran, Haskell, etc.
  • Interpreter is a piece of software, built to implement a programming language specification, by translating the human-readable code into machine-readable code that can be executed. Programming languages not using and interpreter generally need to be compiled.
  • Editor is the software that we use to write our code files. Since code files are just text, this can be any text editor (e.g. MS Notepad). Although, editors designed specifically for coding provide additional features, such as syntax highlighting and keyword suggestions. If the software has a lot of extra functionality apart from the code editor, such as debugging and source control integration, it is called an integrated development environment.

We write Python, the programming language, in a text editor, and we run the code by sending it to the Python interpreter. In VSCode (an IDE), this is made simple by the integration of the code editor and a connection to the interpreter (among other things), so that we can write and run our Python code without leaving the IDE.

Python for Data Science

Since Python is a general-purpose programming language, we need to leverage the power of several Python packages to efficiently use it for data-science tasks. The core packages in the data science stack are:

  • NumPy - For general scientific computing and linear algebra
  • Pandas - For manipulation of DataFrames
  • Matplotlib - For plotting and making figures

See more in the guide to the data science stack and the package reference section.

Other Statistical Software

Comparisons to other tools used for data science are provided in the Pandas Documentation.

R

Programming language for statistical computing and graphics. Rich package ecosystem to extend functionality.

STATA

A statistical software suite developed by StataCorp.

SAS

Previously Statistical Analysis System. A statistical software suite developed by the SAS Institute.

SQL

Structured Query Language. A domain-specific language designed for managing relational data.

Spreadsheets

Microsoft Excel, Google Sheets, LibreOffice Calc, Apple Numbers, etc.