Python Pandas Quick Introduction

We will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, but for coders in particular. We can divide pandas in such features:

  • Introduction to Pandas
  • Series
  • DataFrames
  • Missing Data
  • GroupBy
  • Merging,Joining,and Concatenating
  • Operations
  • Data Input and Output

Series

The first main data type we will learn about for pandas is the Series data type. Let’s import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.

As in the other structures of python axis = 0 is row, axis=1 is column

Though it is unnecessary to mention it to get the habit I wrote the import statements:

import numpy as np
import pandas as pd

Creating a Series

You can convert a list, numpy array, or dictionary to a Series, we need just pass the structure our data as a parameter to Series such as data={list, dict, array}:

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

pd.Series(data=my_list)
0    10
1    20
2    30
dtype: int64
pd.Series(data=my_list,index=labels)
a    10
b    20
c    30
dtype: int64
pd.Series(my_list,labels)
a    10
b    20
c    30
dtype: int64

NumPy Arrays

pd.Series(arr)
0    10
1    20
2    30
dtype: int64
pd.Series(arr,labels)
a    10
b    20
c    30
dtype: int64

Dictionary

We can also use dictionaries as a label-data holder.

pd.Series(d)
a    10
b    20
c    30
dtype: int64

Data in a Series

A pandas Series can hold a variety of object types:

pd.Series(data=labels)
0    a
1    b
2    c
dtype: object

Even functions

pd.Series([sum,print,len])
0    
1    
2    
dtype: object

Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let’s see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
USA        1
Germany    2
USSR       3
Japan      4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
USA        1
Germany    2
Italy      5
Japan      4
dtype: int64
ser1['USA']
1

Operations are then also done based off of index:

ser1 + ser2
Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. DataFrames are a fundemantal tool for pandas so it will be very helpful to understand and get familiar with them.

np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),
                             columns='W X Y Z'.split())
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Selection and Indexing

df['W']
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

Pass a list of column names

df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509

Or you can directly get column name with dot notaiton;  however this is not recommended in big applications.

df.W
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W']) --> pandas.core.series.Series

Creating a new column:

df['new'] = df['W'] + df['Y']
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

Removing Columns

df.drop('new',axis=1)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
Not inplace unless specified!(this does not affect the original dataframe)
df
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
df.drop('new',axis=1,inplace=True)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Can also drop rows this way:

df.drop('E',axis=0)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057

Selecting Rows

df.loc['A']
W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label

df.iloc[2]
W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting subset of rows and columns

df.loc['B','Y']
-0.84807698340363147
df.loc[['A','B'],['W','Y']]
W Y
A 2.706850 0.907969
B 0.651118 -0.848077

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df>0
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
df[df>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]['Y']
A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757

For two conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]
W X Y Z
E 0.190794 1.978757 2.605967 0.683509

More Index Details

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Reset to default 0,1…n index

df.reset_index()
index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States')
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States',inplace=True)
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509

Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:

outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
A B
G1 1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
G2 1 -1.549671 0.435253
2 1.259904 -0.447898
3 0.266207 0.412580

Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1']
A B
1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
df.loc['G1'].loc[1]
A    0.153661
B    0.167638
Name: 1, dtype: float64
df.index.names
FrozenList([None, None])
df.index.names = ['Group','Num']
A B
Group Num
G1 1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
G2 1 -1.549671 0.435253
2 1.259904 -0.447898
3 0.266207 0.412580
df.xs('G1')
A B
Num
1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
df.xs(['G1',1])
A    0.153661
B    0.167638
Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
A B
Group
G1 0.153661 0.167638
G2 -1.549671 0.435253

Handling Missing Data

Let’s show a few convenient methods to deal with Missing Data in pandas:

df = pd.DataFrame({'A':[1,2,np.nan],
                   'B':[5,np.nan,np.nan],
                   'C':[1,2,3]})
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna() # drop rows that has any NaN value
A B C
0 1.0 5.0 1
df.dropna(axis=1)
C
0 1
1 2
2 3
df.dropna(thresh=2)
A B C
0 1.0 5.0 1
1 2.0 NaN 2
df.fillna(value='FILL VALUE')
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
df['A'].fillna(value=df['A'].mean())
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Groupby

The groupby method allows you to group rows of data together and call aggregate functions

Create dataframe

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
        'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
        'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')

You can save this object as a new variable:

by_comp = df.groupby("Company")

And then call aggregate methods off the object:

by_comp.mean()
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
df.groupby('Company').mean()
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0

More examples of aggregate methods:

by_comp.std()
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
by_comp.min()
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
by_comp.max()
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
by_comp.count()
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
by_comp.describe()
Sales
Company
FB count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
GOOG count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
MSFT count 2.000000
mean 232.000000
std 152.735065
min 124.000000
25% 178.000000
50% 232.000000
75% 286.000000
max 340.000000
by_comp.describe().transpose()
Company FB GOOG MSFT
count mean std min 25% 50% 75% max count mean 75% max count mean std min 25% 50% 75% max
Sales 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0 2.0 160.0 180.0 200.0 2.0 232.0 152.735065 124.0 178.0 232.0 286.0 340.0

1 rows × 24 columns

by_comp.describe().transpose()['GOOG']
count mean std min 25% 50% 75% max
Sales 2.0 160.0 56.568542 120.0 140.0 160.0 180.0 200.0

Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.


Example DataFrames

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df2
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df3
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

pd.concat([df1,df2,df3])
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1,df2,df3],axis=1)
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Example DataFrames

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})

 

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
right
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3

Merging

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

pd.merge(left,right,how='inner',on='key')
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3

Or to show a more complicated example:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
pd.merge(left, right, how='outer', on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='right', on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='left', on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN

Joining

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D

Operations

There are lots of operations with pandas that will be really useful to you, but don’t fall into any distinct category. Let’s show them here in this lecture:

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Info on Unique Values

df['col2'].unique()
array([444, 555, 666])
df['col2'].nunique()
3
df['col2'].value_counts()
444    2
555    1
666    1
Name: col2, dtype: int64

Selecting Data

Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
col1 col2 col3
3 4 444 xyz

Applying Functions

def times2(x):
    return x*2
df['col1'].apply(times2)
0    2
1    4
2    6
3    8
Name: col1, dtype: int64
df['col3'].apply(len)
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
df['col1'].sum()
10

Permanently Removing a Column

del df['col1']
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Get column and index names:

df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
df.sort_values(by='col2') #inplace=False by default
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Find Null Values or Check for Null Values

df.isnull()
col2 col3
0 False False
1 False False
2 False False
3 False False
 Drop rows with NaN Values
df.dropna()
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Filling in NaN values with something else:

df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
df.fillna('FILL')
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz
data = {'A':['foo','foo','foo','bar','bar','bar'], 
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN

Data Input and Output

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let’s take a look at the most common data types:

CSV

CSV Input

df = pd.read_csv('example')
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

CSV Output

df.to_csv('example',index=False)

Excel

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

Excel Input

pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

Excel Output

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

conda install lxml
conda install html5lib
conda install BeautifulSoup4

Then restart Jupyter Notebook. (or use pip install if you aren’t using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

HTML Input

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date Loss Share Type Agreement Terminated Termination Date
0 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 July 12, 2016 none NaN NaN
1 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 August 4, 2016 none NaN NaN
2 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 none NaN NaN
3 Hometown National Bank Longview WA 35156 Twin City Bank October 2, 2015 April 13, 2016 none NaN NaN
4 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank October 2, 2015 April 13, 2016 none NaN NaN
Reklamlar

Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın:

WordPress.com Logosu

WordPress.com hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap / Değiştir )

Twitter resmi

Twitter hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap / Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap / Değiştir )

Google+ fotoğrafı

Google+ hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap / Değiştir )

Connecting to %s