Data Management

The main tool for dealing with tabular data in Python is with the Pandas DataFrame.

import pandas as pddf = pd.DataFrame(    dict(        x=[1, 2, 3, 4, 5],        y=[10, 20, 30, 40, 50],    ))

Which creates a data structure like this.

xy
0110
1220
2330
3440
4550

Every DataFrame has a number of columns (in this example x, y), and rows (0-4).

Data from Files

Pandas can read data from many file formats (although, some might need extra packages to be installed).

df_csv = pd.read_csv("path/to/data.csv")df_excel = pd.read_excel("path/to/data.excel")df_sas = pd.read_sas("path/to/data.sas7bdat")df_stata = pd.read_stata("path/to/data.dta")df_parquet = pd.read_parquet("path/to/data.parquet")

General Information

There are a few methods that let us get more information about a DataFrame or a Series (column).

df.info()df.describe()df.max()df.min()df.mean()df.std()df.median()df.quantile()df.head()df.tail()

Indexing

A DataFrame can be indexed on both columns and rows.

# Indexing by column labelsdf["x"]  # Returns the x columnsdf["y"]  # Returns the y columns# Indexing by column and row labelsdf.loc[2, "x"]  # Returns the 3 in cell with labels [2, "x"]# Indexing by column and row positionsdf.iloc[3, 1]  # Returns the 40 in cell with position [3, 1]

New Columns

df.assign(    a = 1,  # Same value for all rows    b = lambda x: x["x"] ** 2  # Values based on other column)

Sorting

df.sort_values("x")df.sort_index()

Merges and Joins

pd.merge(df_1, df_2, on="shared_key_col", how="inner")# "inner", "outer", "left", "right"df_1.join(df_2, on="shared_key_col")  # Equal to left merge

Group Aggregations

df.groupby("group_col").max()  # Maximum values withon groups# Different aggregations for different columnsdf.groupby("group_col").agg(    {        "x": ["min", "max"],        "y": ["sum"],    })

Counting Values

df["column"].value_counts()df["column"].value_counts(normalize=True)

Plotting

df["column"].hist()df["column"].plot.bar()