pandas DataFrame 和 Series 排序操作（sort_index 和 sort_values）

pandas 的 dataframe 数据对象有两种的排序方式，一种是根据索引标签（index label）排序，另一种是按照指定某一列的值（value）排序，它们分别对应 sort_index 函数和 sort_values 函数。

1按索引标签排序

1.1按行索引标签排序
1.2按列索引标签排序

2按值排序
3排序算法

按索引标签排序

通过行标签和列名称排序通过 sort_index() 方法，函数语法如下：

def sort_index(axis=0, level=None, ascending=True, inplace=False, kind="quicksort", 
        na_position="last", sort_remaining=True, ignore_index=False,)

axis: 指定轴，可选行轴和列轴，0 代表根据行索引排序，1 表示通过列排序，默认如上函数为 0。
level: 指定索引级别，若设置，则按照指定的级别排序，默认为 None，即以索引的值进行排序。
ascending: 是否升序排序，默认是 True，即升序排序。
inplace: 是否改变原对象的实际排序，默认是 False，即不改变原对象的状态。
kind: 指定排序的算法，可选项为 {'quicksort', 'mergesort', 'heapsort'}，默认为 'quicksort' 即使用快排方式。
na_position: 空值（NaN）应该排序的位置，可选项有 {'first', 'last'}，默认为 'last'，即放在最后面。
sort_remaining: 如果为 True，且按级别和索引排序是多层，反之按指定级别排序后也按其他级别（按顺序）排序。
ignore_index: 是否忽略 DataFrame 对象的索引值，默认为 False；1.0.0 版本开始新增的参数。

如下示例，先给出一个 10 行 3 列的标签打乱的数据：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 3), index=[1, 6, 5, 2, 9, 7, 3, 8, 0, 4], columns=['col2', 'col3', 'col1'])
print(df)

代码执行后，打印如下：

       col2      col3      col1
1  1.334649  0.041156 -1.914002
6 -0.581180 -0.020164 -0.162772
5  1.066173 -0.566280 -1.109046
2  0.645073 -0.740513 -2.237406
9  0.766184 -0.409689 -0.164043
7  0.202764  0.691322  0.377901
3  0.470536 -1.321069 -0.650337
8  0.501809  0.467187  1.276347
0 -0.791858 -0.737015 -0.398602
4 -2.048735  1.281895 -0.649708

按行索引标签排序

行标签排序，对应参数 axis=0；若要明确排序方式，可以指定 ascending 参数，如降序设置其未 False；同时，也可以表明是否要改变原始对象，其对应参数 inplace。

下面示例表示，按行索引标签的降序排序，同时改变原始对象状态：

df.sort_index(axis=0, ascending=False, inplace=True)
print(df)

会看到如下打印：

       col2      col3      col1
9  0.766184 -0.409689 -0.164043
8  0.501809  0.467187  1.276347
7  0.202764  0.691322  0.377901
6 -0.581180 -0.020164 -0.162772
5  1.066173 -0.566280 -1.109046
4 -2.048735  1.281895 -0.649708
3  0.470536 -1.321069 -0.650337
2  0.645073 -0.740513 -2.237406
1  1.334649  0.041156 -1.914002
0 -0.791858 -0.737015 -0.398602

按列索引标签排序

列索引标签排序，把对应 axis 参数指定为 1 变可，基于上述代码追加如下即可：

df.sort_index(axis=1, ascending=False, inplace=True)
print(df)

打印如下：

       col3      col2      col1
9 -0.409689  0.766184 -0.164043
8  0.467187  0.501809  1.276347
7  0.691322  0.202764  0.377901
6 -0.020164 -0.581180 -0.162772
5 -0.566280  1.066173 -1.109046
4  1.281895 -2.048735 -0.649708
3 -1.321069  0.470536 -0.650337
2 -0.740513  0.645073 -2.237406
1  0.041156  1.334649 -1.914002
0 -0.737015 -0.791858 -0.398602

按值排序

除了根据行列索引标签方式排序之外，其实主要用到的是 sort_values() 函数的按指定列的值排序，它与 sort_index() 函数相比多接受一个表示指定列名（或索引）的 by 参数。

def sort_values(by, axis=0, level=None, ascending=True, inplace=False, kind="quicksort", 
        na_position="last", sort_remaining=True, ignore_index=False,)

其中 by 参数用来指定要按顺序排序的列名或标签，可以接收指定单列的 str 类型或指定多列的 str 类型组成的 list 类型。

df = pd.DataFrame([[2, 4, 1, 5], [3, 1, 4, 5], [5, 1, 4, 3], [5, 1, 6, 2]], columns=['b', 'a', 'd', 'c'])
print(df)

构造的 DataFrame 对象打印如下：

   b  a  d  c
0  2  4  1  5
1  3  1  4  5
2  5  1  4  3
3  5  1  6  2

指定根据 a、b 列进行升序排序，同时修改原始对象状态：

df.sort_values(by=['a', 'b'], inplace=True)
print(df)

打印结果如下：

   b  a  d  c
1  3  1  4  5
2  5  1  4  3
3  5  1  6  2
0  2  4  1  5

排序算法

sort_index() 和 sort_values() 都提供了 kind 参数来指定排序算法，可选项有 {'quicksort', 'mergesort', 'heapsort'}，分别表示快排、二路归并和堆排序，其中只有二路归并是稳定排序。

pandas 教程

Pandas 数据结构

Pandas 基本操作

Pandas API