Data Management

The main tool for dealing with tabular data in Python is with the Pandas DataFrame.

import pandas as pd

df = pd.DataFrame(
    dict(
        x=[1, 2, 3, 4, 5],
        y=[10, 20, 30, 40, 50],
    )
)

Which creates a data structure like this.

xy
0110
1220
2330
3440
4550

Every DataFrame has a number of columns (in this example x, y), and rows (0-4).

Data from Files

Pandas can read data from many file formats (although, some might need extra packages to be installed).

df_csv = pd.read_csv("path/to/data.csv")
df_excel = pd.read_excel("path/to/data.excel")
df_sas = pd.read_sas("path/to/data.sas7bdat")
df_stata = pd.read_stata("path/to/data.dta")
df_parquet = pd.read_parquet("path/to/data.parquet")

General Information

There are a few methods that let us get more information about a DataFrame or a Series (column).

df.info()
df.describe()

df.max()
df.min()

df.mean()
df.std()

df.median()
df.quantile()

df.head()
df.tail()

Indexing

A DataFrame can be indexed on both columns and rows.

# Indexing by column labels
df["x"]  # Returns the x columns
df["y"]  # Returns the y columns

# Indexing by column and row labels
df.loc[2, "x"]  # Returns the 3 in cell with labels [2, "x"]

# Indexing by column and row positions
df.iloc[3, 1]  # Returns the 40 in cell with position [3, 1]

New Columns

df.assign(
    a = 1,  # Same value for all rows
    b = lambda x: x["x"] ** 2  # Values based on other column
)

Sorting

df.sort_values("x")
df.sort_index()

Merges and Joins

pd.merge(df_1, df_2, on="shared_key_col", how="inner")
# "inner", "outer", "left", "right"

df_1.join(df_2, on="shared_key_col")  # Equal to left merge

Group Aggregations

df.groupby("group_col").max()  # Maximum values withon groups

# Different aggregations for different columns
df.groupby("group_col").agg(
    {
        "x": ["min", "max"],
        "y": ["sum"],
    }
)

Counting Values

df["column"].value_counts()
df["column"].value_counts(normalize=True)

Plotting

df["column"].hist()
df["column"].plot.bar()