Data Management
The main tool for dealing with tabular data in Python is with the Pandas DataFrame
.
import pandas as pd
df = pd.DataFrame(
dict(
x=[1, 2, 3, 4, 5],
y=[10, 20, 30, 40, 50],
)
)
Which creates a data structure like this.
x | y | |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
2 | 3 | 30 |
3 | 4 | 40 |
4 | 5 | 50 |
Every DataFrame
has a number of columns (in this example x, y), and rows (0-4).
Data from Files
Pandas can read data from many file formats (although, some might need extra packages to be installed).
df_csv = pd.read_csv("path/to/data.csv")
df_excel = pd.read_excel("path/to/data.excel")
df_sas = pd.read_sas("path/to/data.sas7bdat")
df_stata = pd.read_stata("path/to/data.dta")
df_parquet = pd.read_parquet("path/to/data.parquet")
General Information
There are a few methods that let us get more information about a DataFrame or a Series (column).
df.info()
df.describe()
df.max()
df.min()
df.mean()
df.std()
df.median()
df.quantile()
df.head()
df.tail()
Indexing
A DataFrame
can be indexed on both columns and rows.
# Indexing by column labels
df["x"] # Returns the x columns
df["y"] # Returns the y columns
# Indexing by column and row labels
df.loc[2, "x"] # Returns the 3 in cell with labels [2, "x"]
# Indexing by column and row positions
df.iloc[3, 1] # Returns the 40 in cell with position [3, 1]
New Columns
df.assign(
a = 1, # Same value for all rows
b = lambda x: x["x"] ** 2 # Values based on other column
)
Sorting
df.sort_values("x")
df.sort_index()
Merges and Joins
pd.merge(df_1, df_2, on="shared_key_col", how="inner")
# "inner", "outer", "left", "right"
df_1.join(df_2, on="shared_key_col") # Equal to left merge
Group Aggregations
df.groupby("group_col").max() # Maximum values withon groups
# Different aggregations for different columns
df.groupby("group_col").agg(
{
"x": ["min", "max"],
"y": ["sum"],
}
)
Counting Values
df["column"].value_counts()
df["column"].value_counts(normalize=True)
Plotting
df["column"].hist()
df["column"].plot.bar()