Linux 基础教程

Linux 参考手册

Linux 笔记

awk 命令统计指定列值重复出现的次数(相当于聚合),并按照值进行升序

Linux 高级文件编辑命令 awk 详解 Linux 高级文件编辑命令 awk 详解


如何使用 awk 命令统计某列的值的聚合计算出现次数,并以列值的大小进行排序?

推荐方式

首先使用 awk 命令进行聚合统计,然后使用 sort 命令进行排序,示例如下:

有如下文件:

[demo@Linux rank_13]$ cat train_sort.libsvm
0 qid 1 0:246.0 1:1.0 2:43.0 3:0.029358 4:440.0 5:17.0 6:9.05037 7:2.0 8:1.0 9:13.0 10:1.0 12:0.0 13:17.344337 15:1.0 16:17.0 17:2.0 18:1.0 20:1.0 21:6.0 22:0.0 23:0.029358 24:15.0 25:1.0 26:0.105263 27:0.0 29:0.0 30:17.0 31:11.0 32:16.0 33:1.0 34:0.032879 35:0.03736 37:0.023256 39:2.0 40:1.0 41:1.0 42:1.0 44:0.117647 46:0.0 47:0.0 48:19.0 50:0.032879 51:0.0 52:0.117647 53:0.133333 54:2.0 56:0.133333 59:17.0 61:1.0 64:0.0 65:2.0 67:1.0 68:0.0 116:1.0 119:1.0 120:1.0 122:1.0 125:15.0 143:2.0 144:0.0 145:0.0 146:0.0 147:2.0 148:150.58685 185:0.032879 186:0.032879 187:2.0 188:0.10845941 189:15.456458 190:6.0 191:19.0 231:0.117647 232:1.0 236:0.0 237:978.0 238:0.0 239:0.117647 240:0.105263 242:1.0 243:0.03736 285:1.0 1156:1.0 11850:1.0 11852:1.0
0 qid 1 0:97.0 1:1.0 2:12.0 3:0.0 4:237.0 5:2.0 7:0.0 8:1.0 9:19.0 10:1.0 12:3.5 13:14.581183 14:1.0 15:1.0 16:2.0 17:0.0 18:1.0 20:1.0 21:2.0 22:0.0 23:0.0 24:0.0 25:1.0 26:0.0 27:0.0 29:0.0 30:0.0 31:218.0 32:66.0 33:1.0 34:0.0 35:0.01 37:3.583333 39:0.0 40:1.0 41:1.0 42:1.0 44:0.0 46:0.0 47:0.0 48:2.0 50:0.01 51:0.0 52:0.01 53:0.01 54:0.0 56:0.01 59:0.0 61:1.0 62:1.0 64:0.0 65:0.0 67:1.0 68:0.0 116:1.0 119:1.0 120:1.0 122:1.0 125:0.0 143:0.0 144:42.0 145:0.0 146:0.0 147:0.0 148:127.19119 185:0.01 186:0.0 187:0.0 188:0.08285783 189:16.477295 190:4.0 191:2.0 231:0.01 232:1.0 234:1.0 236:0.0 237:384.0 238:0.0 239:0.0 240:0.0 241:1.0 242:43.0 243:0.01
0 qid 2 0:135.0 1:1.0 2:1009.0 3:0.025679 4:15707.0 5:6.0 6:8.1113205 7:1.0 9:11.0 10:1.0 12:0.0 13:8.23656 16:6.0 17:0.0 18:1.0 20:1.0 21:0.0 22:0.0 23:0.025679 24:1.0 26:0.142857 27:0.0 29:0.0 30:1.0 31:16.0 32:14.0 34:0.030053 35:0.0 37:0.003964 39:1.0 41:1.0 42:2.0 44:0.166667 46:0.0 47:0.0 48:7.0 50:0.0 51:0.0 52:0.0 53:0.0 54:1.0 55:3.007975 56:0.0 59:1.0 61:1.0 64:0.0 65:0.0 68:0.0 72:1.0 84:1.0 99:1.0 120:1.0 122:1.0 125:1.0 143:0.0 144:0.0 145:0.0 146:0.0 147:1.0 148:74.17553 180:1.0 185:0.0 186:0.030053 187:0.0 188:0.099465005 189:4.371778 190:22.0 191:7.0 209:1.0 213:1.0 214:1.0 215:1.0 231:0.0 236:0.0 237:6991.0 238:0.0 239:0.166667 240:0.142857 242:4.0 243:0.0 274:1.0 285:1.0 287:1.0 8436:1.0 28769:1.0 28770:1.0
0 qid 2 0:163.0 1:1.0 2:15.0 3:0.036223 4:228.0 5:4.0 6:5.558958 7:1.0 8:1.0 9:27.0 10:1.0 12:0.0 13:5.742188 16:4.0 17:0.0 18:1.0 20:1.0 21:1.0 22:0.0 23:0.036223 24:1.0 26:0.2 27:0.0 29:0.0 30:1.0 31:1.0 32:1.0 34:0.045586 35:0.0 37:0.066667 39:1.0 41:1.0 42:2.0 44:0.25 46:0.0 47:0.0 48:5.0 50:0.0 51:0.0 52:0.0 53:0.0 54:1.0 56:0.0 59:1.0 61:1.0 64:0.0 65:0.0 68:0.0 72:1.0 84:1.0 99:1.0 120:1.0 122:1.0 125:1.0 143:0.0 144:0.0 145:0.0 146:0.0 147:1.0 148:51.712105 180:1.0 185:0.0 186:0.045586 187:0.0 188:0.07833605 190:1.0 191:5.0 209:1.0 213:1.0 214:1.0 215:1.0 231:0.0 236:0.0 237:234.0 238:0.0 239:0.25 240:0.2 242:1.0 243:0.0 274:1.0 285:1.0 287:1.0 7679:1.0 23900:1.0 23901:1.0
0 qid 2 0:205.0 1:1.0 2:53.0 3:0.02222 4:570.0 5:21.0 6:7.957614 7:2.0 8:1.0 9:10.0 10:1.0 12:0.0 13:8.08174 16:21.0 17:0.0 18:1.0 20:1.0 21:1.0 22:0.0 23:0.02222 24:3.0 26:0.08 27:0.0 29:0.0 30:4.0 31:1.0 32:1.0 34:0.026518 35:0.0 37:0.018868 39:2.0 41:1.0 42:2.0 44:0.095238 46:0.0 47:0.0 48:25.0 50:0.0 51:0.0 52:0.0 53:0.0 54:2.0 55:3.6075416 56:0.0 59:4.0 61:1.0 64:0.0 65:0.0 68:0.0 72:1.0 84:1.0 99:1.0 120:1.0 122:1.0 125:3.0 143:0.0 144:0.0 145:0.0 146:0.0 147:2.0 148:72.77203 180:1.0 185:0.0 186:0.026518 187:0.0 188:0.10550207 189:4.452115 190:5.0 191:25.0 209:1.0 213:1.0 214:1.0 215:1.0 231:0.0 236:0.0 237:698.0 238:0.0 239:0.095238 240:0.08 242:1.0 243:0.0 274:1.0 285:1.0 287:1.0 7852:1.0 24463:1.0 24464:1.0

对每行的 qid 后面的编号进行统计值次数,并进行升序排序,命令如下:

[demo@Linux rank_13]$ awk '{a[$3]++}END{for(i in a)print i"\t"a[i]}' train_sort.libsvm | sort -t $'\t' -k 1 -n

输出如下:

1	2
2	3