Introduction to Python - Introduction to Pandas
Large sections of this were adopted from Analyzing structured data with Pandas by Steve Phelps.
Pandas Overview
- Pandas is object-oriented.
- We create data frames by constructing instances of different classes.
- The two most important classes are:
DataFrame
Series
- Pandas follows the Java convention of starting the name of classes with an upper-case letter, whereas instances are all lower-case.
- The pandas module is usually imported with the alias
pd
.
import pandas as pd
Pandas Series
- One-dimensional array
- Series can be like array, with standard integer index starting at 0
- Series can be dictionary like, with defined index
data = [1,2,3,4,5] #This creates a list
my_series = pd.Series(data) #array-like pandas series, index created automatically
my_series2 = pd.Series(data, index=['a', 'b', 'c', 'd', 'e']) #dict like, index specified
print(my_series, '\n', my_series2)
my_series2['a']
Plotting a Series
-
We can plot a series by invoking the
plot()
method on an instance of aSeries
object. -
The x-axis will autimatically be labelled with the series index.
-
This is the first time we are invoking a
magic
command. Read more about them here.
%matplotlib inline
my_series.plot()
Creating a Series from a dict
d = {'a' : 0., 'b' : 1., 'c' : 2.}
my_series = pd.Series(d)
my_series
Indexing/Slicing a Series with []
or . notation
- Series can be accessed using the same syntax as arrays and dicts.
- We use the labels in the index to access each element.
- We can also use the label like an attribute
my_series.b
- We can specify a range with
my_series[['b', 'c']]
#Notice the different ways that the parts of the series are specified.
print( my_series['b'],'\n', my_series.b, '\n', my_series[['b', 'c']])
Functions on Series
- We can perform calculations using the entire series similar to numpy.
- Methods are called from within
np.Series
, for example np.Series.add - See a variety of series functions here
#These are just a variety of examples of operations on series.
starter = {'a' : 0., 'b' : 1., 'c' : 2.}
a = pd.Series(starter)
print('Print the entire array a: \n', a)
b1=10*a
print('Mulitiply by 10 directly:\n', b1)
b2=a.multiply(10)
print('Mulitiply by 10 using the function:\n', b2)
c1=a+b1
print('Add a and b together directly:\n', c1)
c2=pd.Series.add(a,b1) #Note we are calling the method of the series class. Numpy used us np.add
print('Add a and b together with a function:\n', c2)
suma=pd.Series.sum(a) #Note we are calling the method of the series class. Numpy used us np.add
print('sum all of a:\n', suma)
f=a**2 #This squares the value.
print('square a:\n', f)
x = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
y = pd.Series({'a' : 3., 'b' : 4., 'c' : 5.})
z = x+y
print('Add 2 series together:\n', z)
Time Series
- Time series models link specific times with rows.
dates = pd.date_range('1/1/2000', periods=5)
dates
time_series = pd.Series(data, index=dates)
time_series
Plot Time Series
- With a data and a value, the plot command can be used to provide quick visibility in the form of a line graph.
ax = time_series.plot()
type(time_series)
time_series
DataFrames
-
The
pandas
module provides a powerful data-structure called a data frame. - It is similar, but not identical to:
- a table in a relational database,
- an Excel spreadsheet,
- a dataframe in R.
-
A data frame has multiple columns, each of which can hold a different type of value.
- Like a series, it has an index which provides a label for each and every row.
Creating a DataFrame from Outside Data
- Data frames can be read and written to/from:
- database queries, database tables
- CSV files
- json files
- etc.
- Beware that data frames are memory resident;
- If you read a large amount of data your PC might crash
- With big data, typically you would read a subset or summary of the data via e.g. a select statement.
Creating a DataFrame from Python Data Structures
- Data frames can be constructed from other data structures in memory:
- dict of arrays,
- dict of lists,
- dict of dict
- dict of Series
- 2-dimensional array
- a single Series
- another DataFrame
Example: Creating a DataFrame from Multiple Series
- Pandas codes missing values as
NaN
rather thanNone
- Series should have matching keys for each matching row.
d = {
'x' :
pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'y' :
pd.Series([4., 6., 7.], index=['a', 'c', 'd']),
'z' :
pd.Series([0.2, 0.3, 0.4], index=[ 'b', 'c', 'd'])
}
df = pd.DataFrame(d)
print (df)
type(d)
Plotting DataFrames
-
When plotting a data frame, each column is plotted as its own series on the same graph.
-
The column names are used to label each series.
-
The row names (index) is used to label the x-axis.
ax = df.plot()
df
Functions and DataFrames
- We can do calculations and functions with dataframes just like series.
- Functions will typically return a dataframe or a series, depending.
- To make a copy, don’t set two dataframes equal us the
copy
method:df2= df.copy()
#Info
nulls=df.isnull()
print(nulls, "\n", type(nulls))
nullsum=nulls.sum()
print("\nNull sum for each column \n", nullsum, "\n", type(nullsum))
print("\nWe can slice these results to get the answer for x \n", nullsum.x)
type(nullsum.x)
df2= df.copy()
print(df, '\n', df2)
df2=df ** 2 #This squares all values.
print(df2)
Summary statistics
- To quickly obtain summary statistics on numerical values use the
describe
method. - You will get a warning if there are missing values.
- The result is itself a DataFrame, that we can slice
dfstats.y['mean']
.
dfstats=df.describe()
dfstats
#type(dfstats)
#END HERE.....
xmean = dfstats.x['mean'] #This is the X mean
ystd = dfstats['y']['std'] #This is the Y standardard deviation
print(xmean,'\n',ystd)
Data Types
- Each will have an inferred data type.
print(df.dtypes)
Accessing the Row and Column Labels
- The row labels (index) can be accessed through
df.index
. - The column labels can be accessed through
df.columns
.
df.index
df.columns
print(df.describe())
Loading Files with Pandas
- We used Pandas in an earlier notebook to load the iris data file.
- Whenver you have a dataset with a variety of fields of various types, loading it into Pandas is a good strategy.
- You can load data from Azure, from a local file, or from a url.
!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/iris.csv
# Pulling from a local file
frame2 = pd.read_csv('iris.csv')
frame2
Large Dataframes - Head and Tail
- Many times you just want a sampling of the available data
- The
head()
command can view the start of a data frame. - The
tail()
command can be used to show the end of a data frame.
frame2.head()
frame2.tail()
# Pulling from a url. Notice that this is the raw version of the file.
frame3 = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/iris.csv")
frame3.head()
Indexing/Slicing Rows of DataFrames
- Simple ways of selecting all rows and colu (
df[:]
) - Rows can be accessed via a key or a integer corresponding to the row number.
- Omitting a value generally means all values before or after an item.
- When we retrieve a single or mulitiple rows, the result is a Dataframe.
- Several ways, either directly, with
iloc
, or withloc
. (See Examples). - Read more here
#This is going to create some sample data that we can work with for our analysis.
import pandas as pd
import numpy as np
#Create a dataframe from a random numpy array
#http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html
df = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['a', 'b', 'c', 'd'] )
print (df)
df2 = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['e', 'f', 'g', 'h'] )
print (df2)
- Simple ways of selecting colum(s)
frame[[colname(s)]]
. - Columns can have one (
df['x']
) or multiple (df[['x', 'y']]
) columns. - When specifying one column, one can use simplified dot notation
df.x
. - When we include multiple columns the slice that result is a DataFrame.
- When we retrieve a single column, the result is a Series.
- When we retrieve mulitiple column, the result is a Dataframe.
#Here we can see that there is a similar structure to R, with selecting the desired columns by passing a list.
print (df[['c', 'd']]) #All rows, column c, d
print (df[[ 'c', 'd']]) #All rows, column c, d
print (df.iloc[:,[0,2,3]]) #All rows, column a,c,d
print (df.iloc[:,0:2]) #All rows, column a-b
print (df.iloc[:,[0,2,3]]) #All rows, column 0,2,3
print (df.loc[:,'a':'b']) #All rows, column a-b
print (df.loc[:,['a','c','d']]) #All rows, columns a, c, d
Dropping Columns from Dataframes
- Done using the
drop
syntax. - Drop Documentation
#Here, we can remove columns specifically from a dataframe using the drop method.
df2 = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['e', 'f', 'g', 'h'] )
print (df2)
df2.drop(['e','f'], inplace=True, axis=1)
print (df2)
Selecting Rows
- Similarly, we also might want to select out rows, and we can utilize the same syntax.
- iloc
## Selecting rows
print (df[0:3]) #Select rows 1-3
print (df.iloc[0:3,:]) #Select rows 1-3
print (df.iloc[0:3,]) #Select rows 1-3
print (df.iloc[0:3]) #Select rows 1-3
print (df.iloc[[1,2,4]]) #Select rows 1, 2, and 4
Intro to Filters (Logical Indexing)
- Filters are the selection of rows based on criteria.
- We can select based on specific criteria.
- These criteria can be connected together.
- Most of the time we won’t specfically assign selection critia to a list.
# At the foundation of the filter is a boolean array based on some type of condition.
print(df)
df['a'] >= 20
#notice how the logical statement is inside the dataframe specification. This creates an intermediate boolean array.
df[df['a'] >= 20]
#This is an alternate method where we first set the boolean array.
included=df['a'] >= 20
df[included]
#We can now generate a vector based on a critera and then use this for selection
select = df['a']>=20
print (select,type(select))
print (df.loc[select,'a']) #Notice by including only one variable we are selecting rows and all columns.
select2 = (df['a']>20) & (df['c'] < 30) #More complex criteria
print (select2)
print (df.loc[select2,['a','c']])
#Here we are creating a new variable based on the value of another variable.
df['aboveavg']=0 # We first set the default to 0.
df.loc[df['a']>=20,'aboveavg']=1 #We then change all rows where a is >=20 to 1.
print(df['aboveavg'])
Joining Dataframes
- Often you need to combine dataframe,
- either matching columns for the smae rows (column bind)
- Add rows for the same columns (row bind)
Stacking Dataframes Vertically
- Adds rows vertially with the
concat
function - The index is not automatically reset
- In R referred to as a row bind.
#This first generates 2 dataframes.
df = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['a', 'b', 'c', 'd'] )
df2 = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['a', 'b', 'c', 'd'] )
#This will stack the 2 dataframes vertically on top of one another
dfbyrow=pd.concat([df, df2]) #This is equivalent to a rowbind in R.
print (dfbyrow)
# View how the index here from df has been reset and incremented while in the earlier example the index was kept.
addition = df.append(df2)
print(addition )
addition2 = df.append(df, ignore_index=True)
print(addition2 )
Inner/Outer Joins Dataframes
- Adds rows vertially with the
concat
function - In R referred to as a column bind.
- Can do the equivalent of an inner and outer join.
#Merging additional columns also uses the concat function
#This is equavalent to an inner join in SQL.
df = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['a', 'b', 'c', 'd'] )
df4 = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['e', 'f', 'g', 'h'] )
dfbycolumns = pd.concat([df, df4], axis=1, join='inner')
dfbycolumns
#Here we are generating a small dataframe to be used in merging so you can see the differences in specifying inner & outer,
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='inner')
dfbycolumns
#Here, the outer does the equivalent of a left outer join for this dataset.
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='outer')
dfbycolumns
Credits
Large sections of this were adopted from Analysing structured data with Pandas by Steve Phelps. Thanks Steve!
Copyright AnalyticsDojo 2016 This work is licensed under the Creative Commons Attribution 4.0 International license agreement.