Data Management
The main tool for dealing with tabular data in Python is with the Pandas DataFrame
.
import pandas as pddf = pd.DataFrame( dict( x=[1, 2, 3, 4, 5], y=[10, 20, 30, 40, 50], ))
Which creates a data structure like this.
x | y | |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
2 | 3 | 30 |
3 | 4 | 40 |
4 | 5 | 50 |
Every DataFrame
has a number of columns (in this example x, y), and rows (0-4).
Data from Files
Pandas can read data from many file formats (although, some might need extra packages to be installed).
df_csv = pd.read_csv("path/to/data.csv")df_excel = pd.read_excel("path/to/data.excel")df_sas = pd.read_sas("path/to/data.sas7bdat")df_stata = pd.read_stata("path/to/data.dta")df_parquet = pd.read_parquet("path/to/data.parquet")
General Information
There are a few methods that let us get more information about a DataFrame or a Series (column).
df.info()df.describe()df.max()df.min()df.mean()df.std()df.median()df.quantile()df.head()df.tail()
Indexing
A DataFrame
can be indexed on both columns and rows.
# Indexing by column labelsdf["x"] # Returns the x columnsdf["y"] # Returns the y columns# Indexing by column and row labelsdf.loc[2, "x"] # Returns the 3 in cell with labels [2, "x"]# Indexing by column and row positionsdf.iloc[3, 1] # Returns the 40 in cell with position [3, 1]
New Columns
df.assign( a = 1, # Same value for all rows b = lambda x: x["x"] ** 2 # Values based on other column)
Sorting
df.sort_values("x")df.sort_index()
Merges and Joins
pd.merge(df_1, df_2, on="shared_key_col", how="inner")# "inner", "outer", "left", "right"df_1.join(df_2, on="shared_key_col") # Equal to left merge
Group Aggregations
df.groupby("group_col").max() # Maximum values withon groups# Different aggregations for different columnsdf.groupby("group_col").agg( { "x": ["min", "max"], "y": ["sum"], })
Counting Values
df["column"].value_counts()df["column"].value_counts(normalize=True)
Plotting
df["column"].hist()df["column"].plot.bar()