pandas学习(一)

导入 pandas、numpy、matplotlib

1
2
3
4
5
In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import matplotlib.pyplot as plt

创造对象

Series 是一个值的序列,它只有一个列,以及索引。下面的例子中,就用默认的整数索引

1
2
3
4
5
6
7
8
9
10
11
In [4]: s = pd.Series([1,3,5,np.nan,6,8])

In [5]: s
Out[5]:
0 1
1 3
2 5
3 NaN
4 6
5 8
dtype: float64

DataFrame 是有多个列的数据表,每个列拥有一个 label,当然,DataFrame 也有索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
In [6]: dates = pd.date_range('20130101', periods=6)

In [7]: dates
Out[7]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
#index 行名,columns,列名。都需要提供一个列表对象

In [9]: df
Out[9]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988

如果参数是一个 dict,每个 dict 的 value 会被转化成一个 Series

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [10]: df2 = pd.DataFrame({ 'A' : 1.,
....: 'B' : pd.Timestamp('20130102'),
....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D' : np.array([3] * 4,dtype='int32'),
....: 'E' : pd.Categorical(["test","train","test","train"]),
....: 'F' : 'foo' })
....:

In [11]: df2
Out[11]:
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo

每列的格式用 dtypes 查看

1
2
3
4
5
6
7
8
9
In [12]: df2.dtypes
Out[12]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

你可以认为,DataFrame 是由 Series 组成的

1
2
3
4
5
6
7
In [13]: df2.A
Out[13]:
0 1
1 1
2 1
3 1
Name: A, dtype: float64

查看数据

用 head 和 tail 查看顶端和底端的几列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [14]: df.head()
Out[14]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401

In [15]: df.tail(3)
Out[15]:
A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988

实际上,DataFrame 内部用 numpy 格式存储数据。你也可以单独查看 index 和 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [16]: df.index
Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')

In [18]: df.values
Out[18]:
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])

describe() 显示数据的概要。

1
2
3
4
5
6
7
8
9
10
11
In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711 -0.431125 -0.687758 -0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849 -2.104569 -1.509059 -1.135632
25% -0.611510 -0.600794 -1.368714 -1.076610
50% 0.022070 -0.228039 -0.767252 -0.386188
75% 0.658444 0.041933 -0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804

和 numpy 一样,可以方便的得到转置

1
2
3
4
5
6
7
In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988

对 axis 按照 index 排序(axis=1 是指第二个维度,即:列,axis=0 是指第一个维度,即:行)

1
2
3
4
5
6
7
8
9
In [21]: df.sort_index(axis=1, ascending=False) #ascending=False 默认为True,升序
Out[21]:
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849
2013-01-04 0.271860 -1.039575 -0.706771 0.721555
2013-01-05 -1.087401 0.276232 0.567020 -0.424972
2013-01-06 0.524988 -1.478427 0.113648 -0.673690

按值排序

1
2
3
4
5
6
7
8
9
In [22]: df.sort_values(by='B') #也可以是by=['A','B'],按2列排序
Out[22]:
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2013-01-05 -0.424972 0.567020 0.276232 -1.087401

选择

注意,以下这些对交互式环境很友好,但是作为 production code 请用优化过的 .at, .iat, .loc, .iloc.ix

获取行/列

从 DataFrame 选择一个列,就得到了 Series

1
2
3
4
5
6
7
8
9
In [23]: df['A']
Out[23]:
2013-01-01 0.469112
2013-01-02 1.212112
2013-01-03 -0.861849
2013-01-04 0.721555
2013-01-05 -0.424972
2013-01-06 -0.673690
Freq: D, Name: A, dtype: float64

和 numpy 类似,这里也能用 []选择行

1
2
3
4
5
6
7
8
9
10
11
12
13
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804

In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860

通过 label 选择

还可以多选

1
2
3
4
5
6
7
8
9
In [27]: df.loc[:,['A','B']] #所有行,AB列
Out[27]:
A B
2013-01-01 0.469112 -0.282863
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
2013-01-06 -0.673690 0.113648

注意那个冒号,用法和 MATLAB 或 NumPy 是一样的!所以也可以这样

1
2
3
4
5
6
In [28]: df.loc['20130102':'20130104',['A','B']] 
Out[28]:
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
1
2
3
4
5
In [29]: df.loc['20130102',['A','B']]
Out[29]:
A 1.212112
B -0.173215
Name: 2013-01-02 00:00:00, dtype: float64

如果对所有的维度都写了标量,不就是选出一个元素吗?

如果对所有的维度都写了标量,不就是选出一个元素吗?

1
2
In [30]: df.loc[dates[0],'A']
Out[30]: 0.46911229990718628

这种情况通常用 at ,速度更快

1
2
In [31]: df.at[dates[0],'A']
Out[31]: 0.46911229990718628

通过整数下标选择

这个就和数组类似啦,直接看例子。选出第3行:(注意有第0行)

1
2
3
4
5
6
7
In [32]: df.iloc[3]
Out[32]:
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64

选出3~4行,0~1列:

1
2
3
4
5
In [33]: df.iloc[3:5,0:2] #注意 3:5 是第3行到第4行,尾数在python中是不选择的(有第0行)
Out[33]:
A B
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020

也能用 list 选择

1
2
3
4
5
6
In [34]: df.iloc[[1,2,4],[0,2]]
Out[34]:
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972 0.276232

对应单个元素

1
2
3
4
In [37]: df.iloc[1,1]
Out[37]: -0.17321464905330858
In [38]: df.iat[1,1]
Out[38]: -0.17321464905330858

总结:df.icol 是按下标选择,df.col是按标签选择。

通过布尔值下标

基本用法

1
2
3
4
5
6
In [39]: df[df.A > 0]
Out[39]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860

多条件

1
2
3
4
5
In [39]: df[(df.A > 0)&(df.A <1)]
Out[39]:
A B         C         D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-04 0.721555 -0.706771 -1.039575 0.271860

不满足条件的填充为: NaN

1
2
3
4
5
6
7
8
9
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988

isin() 函数:是否在集合中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [41]: df2 = df.copy()

In [42]: df2['E'] = ['one', 'one','two','three','four','three']

In [43]: df2
Out[43]:
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three

In [44]: df2[df2['E'].isin(['two','four'])]
Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

contains()函数:是否包含特定字符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [44]:tips
Out[44]:
total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In [45]: tipsoutput = tips[tips.sex.str.contains('Fe*')]
Out[45]:
total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

删除数据

丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象:

1
2
3
4
5
6
7
8
9
10
11
In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])

In [111]: data
Out[111]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

用标签序列调用drop会从行标签(axis 0)删除值:

1
2
3
4
5
In [112]: data.drop(['Colorado', 'Ohio'])
Out[112]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15

通过传递axis=1或axis=’columns’可以删除列的值:

1
2
3
4
5
6
7
In [113]: data.drop('two', axis=1)
Out[113]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15

读取、写入数据

CSV

写入

1
In [136]: df.to_csv('foo.csv')

读取

1
2
3
4
5
6
7
8
In [137]: pd.read_csv('foo.csv')
In [137]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=None)
# 自定义表头,当name设定的时候,header必须显式None,因为默认为0
foo bar baz
0 a b c
1 1 2 3
2 4 5 6
3 7 8 9

table

读取

1
2
data = pd.read_table('example.txt',sep='\t',header=0) 
# 指定分隔符和表头,默认header=0,把第一行作为表头。分割符默认制表符'/t','/s+',可以匹配任何空格。

excel

写入

1
In [140]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取

1
2
In [141]: pd.read_excel('foo.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])
#index_col 行名,na_values 缺失值的形式