数字识别
比赛说明
- MNIST("修改后的国家标准与技术研究所")是计算机视觉的事实上的 "hello world" 数据集。自1999年发布以来,手写图像的经典数据集已成为基准分类算法的基础。随着新机器学习技术的出现,MNIST 仍然是研究人员和学习者的可靠资源。
- 在本次比赛中,您的目标是正确 识别数以万计手写图像的数字 。我们策划了一套教程式的内核,涵盖从回归到神经网络的一切。我们鼓励您尝试使用不同的算法来学习第一手什么是有效的,以及技术如何比较。
注意:项目规范
成员角色
角色 | 用户 | 内容 | 代码 |
---|---|---|---|
负责: knn | 诺木人 | knn项目文档 | knn项目代码 |
负责: svm | 马小穆 | svm项目文档 | svm项目代码 |
负责: 随机森林 | 平淡的天 | 随机森林项目文档 | 随机森林项目代码 |
负责: 神经网络 | 平淡的天 | 神经网络项目文档 | 神经网络项目代码 |
负责: cnn | == | cnn项目文档 | cnn项目代码 |
数字识别 第一期(2018-04-18)
组长 | 组员 | 组员 | 组员 | 组员 | 组员 | 组员 |
---|---|---|---|---|---|---|
限定心态 | strengthen | VS53MV | 不会修电脑 | 远心 | 小耀哥_0011 | 丨 |
数字识别 第二期(2018-04-21)
组长 | 组员 | 组员 | 组员 | 组员 | 组员 |
---|---|---|---|---|---|
凌少skier | Blue | Max | 考拉 | Happyorg | 过客 |
数字识别 第三期(2018-05-03)
负责人 | 组员 | 组员 |
---|---|---|
技术负责人-诺木人 辅助负责人-平淡的心 辅助负责人-张凯 活动发起人-片刻 |
ifeng draw Faith ggggggggo 嘿!漆漆 kickfilper Lucien Chen L~Q~W 琴剑蓝天 時間dāń漠 歲寒✅已认证 給力小青年 星尘 |
瑛瑛wang 有一个人很酷 静水流深 ♡稳稳的幸福 Verestràsz vslyu :) 菠菜 QQ小冰 浅紫色 R ROOT |
数字识别 第四期(2018-05-08)
负责人 | 组员 | 组员 | 组员 |
---|---|---|---|
技术负责人-诺木人 辅助负责人-BrianCai 辅助负责人-嘿!漆漆 |
兰博归来 柳生 ZARD Forever 你别理我我没网 666 黄蛟 冬冬 荼蘼 烁今 |
简雨 B0lt1st nickine dying in the sun 王琪琪 常想一二 以朱代墨 Mang0 TonyZERO |
冰花小子 阿铮 zh哲 小菜鸡 电酱prpr 琉璃火 张假飞 HAN Shuai 有人@我 天儿 Jaybo |
开发流程
- 分类问题:0~9 数字
- 常用算法:knn、决策树、朴素贝叶斯、Logistic回归、SVM、集成方法(随机森林和 AdaBoost)
步骤:
一. 数据分析
1. 下载并加载数据
2. 总体预览数据:了解每列数据的含义,数据的格式等
3. 数据初步分析,使用统计学与绘图: 由于特征没有特殊的含义,不需要过多的细致分析
二. 特征工程
1.根据业务,常识,以及第二步的数据分析构造特征工程.
2.将特征转换为模型可以辨别的类型(如处理缺失值,处理文本进行等)
三. 模型选择
1.根据目标函数确定学习类型,是无监督学习还是监督学习,是分类问题还是回归问题等.
2.比较各个模型的分数,然后取效果较好的模型作为基础模型.
四. 模型融合
跳过,这个项目的重点是让大家都了解这个kaggle比赛怎么和算法更好的融合在一起。
五. 修改特征和模型参数
此处不做过多分析,主要是优化各个算法的参数。
* KNN => k值
* SVM => 惩罚系数,核函数
* RF => 树个数,树深度,叶子数
* PCA => 特征数 or 信息熵
一. 数据分析
数据下载和加载
数据集下载地址:https://www.kaggle.com/c/digit-recognizer/data
import os
import csv
import datetime
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
data_dir = '/opt/data/kaggle/getting-started/digit-recognizer/'
# 加载数据
def opencsv():
# 使用 pandas 打开
data = pd.read_csv(os.path.join(data_dir, 'input/train.csv'))
data1 = pd.read_csv(os.path.join(data_dir, 'input/test.csv'))
train_data = data.values[0:, 1:] # 读入全部训练数据, [行,列]
train_label = data.values[0:, 0] # 读取列表的第一列
test_data = data1.values[0:, 0:] # 测试全部测试个数据
return train_data, train_label, test_data
# 加载数据
trainData, trainLabel, testData = opencsv()
总体预览数据(目标变量+数据特征)
- label: 目标变量(分类标签)
- pixel0~pixel783: 数据特征(分类属性),特征之间没有特别的业务联系(所以没必要进行统计分析了)
label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,pixel49,pixel50,pixel51,pixel52,pixel53,pixel54,pixel55,pixel56,pixel57,pixel58,pixel59,pixel60,pixel61,pixel62,pixel63,pixel64,pixel65,pixel66,pixel67,pixel68,pixel69,pixel70,pixel71,pixel72,pixel73,pixel74,pixel75,pixel76,pixel77,pixel78,pixel79,pixel80,pixel81,pixel82,pixel83,pixel84,pixel85,pixel86,pixel87,pixel88,pixel89,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,pixel106,pixel107,pixel108,pixel109,pixel110,pixel111,pixel112,pixel113,pixel114,pixel115,pixel116,pixel117,pixel118,pixel119,pixel120,pixel121,pixel122,pixel123,pixel124,pixel125,pixel126,pixel127,pixel128,pixel129,pixel130,pixel131,pixel132,pixel133,pixel134,pixel135,pixel136,pixel137,pixel138,pixel139,pixel140,pixel141,pixel142,pixel143,pixel144,pixel145,pixel146,pixel147,pixel148,pixel149,pixel150,pixel151,pixel152,pixel153,pixel154,pixel155,pixel156,pixel157,pixel158,pixel159,pixel160,pixel161,pixel162,pixel163,pixel164,pixel165,pixel166,pixel167,pixel168,pixel169,pixel170,pixel171,pixel172,pixel173,pixel174,pixel175,pixel176,pixel177,pixel178,pixel179,pixel180,pixel181,pixel182,pixel183,pixel184,pixel185,pixel186,pixel187,pixel188,pixel189,pixel190,pixel191,pixel192,pixel193,pixel194,pixel195,pixel196,pixel197,pixel198,pixel199,pixel200,pixel201,pixel202,pixel203,pixel204,pixel205,pixel206,pixel207,pixel208,pixel209,pixel210,pixel211,pixel212,pixel213,pixel214,pixel215,pixel216,pixel217,pixel218,pixel219,pixel220,pixel221,pixel222,pixel223,pixel224,pixel225,pixel226,pixel227,pixel228,pixel229,pixel230,pixel231,pixel232,pixel233,pixel234,pixel235,pixel236,pixel237,pixel238,pixel239,pixel240,pixel241,pixel242,pixel243,pixel244,pixel245,pixel246,pixel247,pixel248,pixel249,pixel250,pixel251,pixel252,pixel253,pixel254,pixel255,pixel256,pixel257,pixel258,pixel259,pixel260,pixel261,pixel262,pixel263,pixel264,pixel265,pixel266,pixel267,pixel268,pixel269,pixel270,pixel271,pixel272,pixel273,pixel274,pixel275,pixel276,pixel277,pixel278,pixel279,pixel280,pixel281,pixel282,pixel283,pixel284,pixel285,pixel286,pixel287,pixel288,pixel289,pixel290,pixel291,pixel292,pixel293,pixel294,pixel295,pixel296,pixel297,pixel298,pixel299,pixel300,pixel301,pixel302,pixel303,pixel304,pixel305,pixel306,pixel307,pixel308,pixel309,pixel310,pixel311,pixel312,pixel313,pixel314,pixel315,pixel316,pixel317,pixel318,pixel319,pixel320,pixel321,pixel322,pixel323,pixel324,pixel325,pixel326,pixel327,pixel328,pixel329,pixel330,pixel331,pixel332,pixel333,pixel334,pixel335,pixel336,pixel337,pixel338,pixel339,pixel340,pixel341,pixel342,pixel343,pixel344,pixel345,pixel346,pixel347,pixel348,pixel349,pixel350,pixel351,pixel352,pixel353,pixel354,pixel355,pixel356,pixel357,pixel358,pixel359,pixel360,pixel361,pixel362,pixel363,pixel364,pixel365,pixel366,pixel367,pixel368,pixel369,pixel370,pixel371,pixel372,pixel373,pixel374,pixel375,pixel376,pixel377,pixel378,pixel379,pixel380,pixel381,pixel382,pixel383,pixel384,pixel385,pixel386,pixel387,pixel388,pixel389,pixel390,pixel391,pixel392,pixel393,pixel394,pixel395,pixel396,pixel397,pixel398,pixel399,pixel400,pixel401,pixel402,pixel403,pixel404,pixel405,pixel406,pixel407,pixel408,pixel409,pixel410,pixel411,pixel412,pixel413,pixel414,pixel415,pixel416,pixel417,pixel418,pixel419,pixel420,pixel421,pixel422,pixel423,pixel424,pixel425,pixel426,pixel427,pixel428,pixel429,pixel430,pixel431,pixel432,pixel433,pixel434,pixel435,pixel436,pixel437,pixel438,pixel439,pixel440,pixel441,pixel442,pixel443,pixel444,pixel445,pixel446,pixel447,pixel448,pixel449,pixel450,pixel451,pixel452,pixel453,pixel454,pixel455,pixel456,pixel457,pixel458,pixel459,pixel460,pixel461,pixel462,pixel463,pixel464,pixel465,pixel466,pixel467,pixel468,pixel469,pixel470,pixel471,pixel472,pixel473,pixel474,pixel475,pixel476,pixel477,pixel478,pixel479,pixel480,pixel481,pixel482,pixel483,pixel484,pixel485,pixel486,pixel487,pixel488,pixel489,pixel490,pixel491,pixel492,pixel493,pixel494,pixel495,pixel496,pixel497,pixel498,pixel499,pixel500,pixel501,pixel502,pixel503,pixel504,pixel505,pixel506,pixel507,pixel508,pixel509,pixel510,pixel511,pixel512,pixel513,pixel514,pixel515,pixel516,pixel517,pixel518,pixel519,pixel520,pixel521,pixel522,pixel523,pixel524,pixel525,pixel526,pixel527,pixel528,pixel529,pixel530,pixel531,pixel532,pixel533,pixel534,pixel535,pixel536,pixel537,pixel538,pixel539,pixel540,pixel541,pixel542,pixel543,pixel544,pixel545,pixel546,pixel547,pixel548,pixel549,pixel550,pixel551,pixel552,pixel553,pixel554,pixel555,pixel556,pixel557,pixel558,pixel559,pixel560,pixel561,pixel562,pixel563,pixel564,pixel565,pixel566,pixel567,pixel568,pixel569,pixel570,pixel571,pixel572,pixel573,pixel574,pixel575,pixel576,pixel577,pixel578,pixel579,pixel580,pixel581,pixel582,pixel583,pixel584,pixel585,pixel586,pixel587,pixel588,pixel589,pixel590,pixel591,pixel592,pixel593,pixel594,pixel595,pixel596,pixel597,pixel598,pixel599,pixel600,pixel601,pixel602,pixel603,pixel604,pixel605,pixel606,pixel607,pixel608,pixel609,pixel610,pixel611,pixel612,pixel613,pixel614,pixel615,pixel616,pixel617,pixel618,pixel619,pixel620,pixel621,pixel622,pixel623,pixel624,pixel625,pixel626,pixel627,pixel628,pixel629,pixel630,pixel631,pixel632,pixel633,pixel634,pixel635,pixel636,pixel637,pixel638,pixel639,pixel640,pixel641,pixel642,pixel643,pixel644,pixel645,pixel646,pixel647,pixel648,pixel649,pixel650,pixel651,pixel652,pixel653,pixel654,pixel655,pixel656,pixel657,pixel658,pixel659,pixel660,pixel661,pixel662,pixel663,pixel664,pixel665,pixel666,pixel667,pixel668,pixel669,pixel670,pixel671,pixel672,pixel673,pixel674,pixel675,pixel676,pixel677,pixel678,pixel679,pixel680,pixel681,pixel682,pixel683,pixel684,pixel685,pixel686,pixel687,pixel688,pixel689,pixel690,pixel691,pixel692,pixel693,pixel694,pixel695,pixel696,pixel697,pixel698,pixel699,pixel700,pixel701,pixel702,pixel703,pixel704,pixel705,pixel706,pixel707,pixel708,pixel709,pixel710,pixel711,pixel712,pixel713,pixel714,pixel715,pixel716,pixel717,pixel718,pixel719,pixel720,pixel721,pixel722,pixel723,pixel724,pixel725,pixel726,pixel727,pixel728,pixel729,pixel730,pixel731,pixel732,pixel733,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,141,139,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,106,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,185,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,146,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,156,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,255,255,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,254,254,62,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,220,179,6,0,0,0,0,0,0,0,0,9,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,247,17,0,0,0,0,0,0,0,0,27,202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,242,155,0,0,0,0,0,0,0,0,27,254,63,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,160,207,6,0,0,0,0,0,0,0,27,254,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,127,254,21,0,0,0,0,0,0,0,20,239,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,77,254,21,0,0,0,0,0,0,0,0,195,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,70,254,21,0,0,0,0,0,0,0,0,195,142,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,56,251,21,0,0,0,0,0,0,0,0,195,227,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,222,153,5,0,0,0,0,0,0,0,120,240,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,67,251,40,0,0,0,0,0,0,0,94,255,69,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,234,184,0,0,0,0,0,0,0,19,245,69,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,234,169,0,0,0,0,0,0,0,3,199,182,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,154,205,4,0,0,26,72,128,203,208,254,254,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,61,254,129,113,186,245,251,189,75,56,136,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,216,233,233,159,104,52,0,0,0,38,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,206,106,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,186,159,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,209,101,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,25,130,155,254,254,254,157,30,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,103,253,253,253,253,253,253,253,253,114,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,208,253,253,253,253,253,253,253,253,253,253,107,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,253,253,253,253,253,253,253,253,253,253,253,215,101,3,0,0,0,0,0,0,0,0,0,0,0,0,23,210,253,253,253,248,161,222,222,246,253,253,253,253,253,39,0,0,0,0,0,0,0,0,0,0,0,0,136,253,253,253,229,77,0,0,0,70,218,253,253,253,253,215,91,0,0,0,0,0,0,0,0,0,0,5,214,253,253,253,195,0,0,0,0,0,104,224,253,253,253,253,215,29,0,0,0,0,0,0,0,0,0,116,253,253,253,247,75,0,0,0,0,0,0,26,200,253,253,253,253,216,4,0,0,0,0,0,0,0,0,254,253,253,253,195,0,0,0,0,0,0,0,0,26,200,253,253,253,253,5,0,0,0,0,0,0,0,0,254,253,253,253,99,0,0,0,0,0,0,0,0,0,25,231,253,253,253,36,0,0,0,0,0,0,0,0,254,253,253,253,99,0,0,0,0,0,0,0,0,0,0,223,253,253,253,129,0,0,0,0,0,0,0,0,254,253,253,253,99,0,0,0,0,0,0,0,0,0,0,127,253,253,253,129,0,0,0,0,0,0,0,0,254,253,253,253,99,0,0,0,0,0,0,0,0,0,0,139,253,253,253,90,0,0,0,0,0,0,0,0,254,253,253,253,99,0,0,0,0,0,0,0,0,0,78,248,253,253,253,5,0,0,0,0,0,0,0,0,254,253,253,253,216,34,0,0,0,0,0,0,0,33,152,253,253,253,107,1,0,0,0,0,0,0,0,0,206,253,253,253,253,140,0,0,0,0,0,30,139,234,253,253,253,154,2,0,0,0,0,0,0,0,0,0,16,205,253,253,253,250,208,106,106,106,200,237,253,253,253,253,209,22,0,0,0,0,0,0,0,0,0,0,0,82,253,253,253,253,253,253,253,253,253,253,253,253,253,209,22,0,0,0,0,0,0,0,0,0,0,0,0,1,91,253,253,253,253,253,253,253,253,253,253,213,90,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,18,129,208,253,253,253,253,159,129,90,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
二. 特征工程(特征之间关系不大,此步骤跳过)
- 当然可以做一些归一化的操作
- 做降维:因为有些列一直都为了0,信息熵几乎为0,没有使用的必要
# 数据预处理-降维 PCA主成成分分析
def dRPCA(x_train, x_test, COMPONENT_NUM):
print('dimensionality reduction...')
trainData = np.array(x_train)
testData = np.array(x_test)
'''
使用说明:https://www.cnblogs.com/pinard/p/6243025.html
n_components>=1
n_components=NUM 设置占特征数量比
0 < n_components < 1
n_components=0.99 设置阈值总方差占比
'''
pca = PCA(n_components=COMPONENT_NUM, whiten=False)
pca.fit(trainData) # Fit the model with X
pcaTrainData = pca.transform(trainData) # Fit the model with X and 在X上完成降维.
pcaTestData = pca.transform(testData) # Fit the model with X and 在X上完成降维.
# pca 方差大小、方差占比、特征数量
# print("方差大小:\n", pca.explained_variance_, "方差占比:\n", pca.explained_variance_ratio_)
print("特征数量: %s" % pca.n_components_)
print("总方差占比: %s" % sum(pca.explained_variance_ratio_))
return pcaTrainData, pcaTestData
# 降维处理
trainDataPCA, testDataPCA = dRPCA(trainData, testData, 0.8)
三. 模型选择
- 根据目标函数确定学习类型,是无监督学习还是监督学习,是分类问题还是回归问题等.
-
比较各个模型的分数,然后取效果较好的模型作为基础模型.
-
分类问题:0~9 数字
- 常用算法:knn、决策树、朴素贝叶斯、Logistic回归、SVM、集成方法(随机森林和 AdaBoost)
knn
def trainModel(trainData, trainLabel):
clf = KNeighborsClassifier() # default:k = 5,defined by yourself:KNeighborsClassifier(n_neighbors=10)
clf.fit(trainData, np.ravel(trainLabel)) # ravel Return a contiguous flattened array.
return clf
# 模型训练
clf = trainModel(trainDataPCA, trainLabel)
# 结果预测
testLabel = clf.predict(testDataPCA)
svm
# 训练模型
def trainModel(trainData, trainLabel):
print('Train SVM...')
clf = SVC(C=4, kernel='rbf')
clf.fit(trainData, trainLabel) # 训练SVM
return clf
# 模型训练
clf = trainModel(trainDataPCA, y_train)
# 结果预测
testLabel = clf.predict(pcaTestData)
RF - Random Forest
# 训练模型
def trainModel(X_train, y_train):
print('Train RF...')
clf = RandomForestClassifier(
n_estimators=10,
max_depth=10,
min_samples_split=2,
min_samples_leaf=1,
random_state=34)
clf.fit(X_train, y_train) # 训练rf
return clf
# 模型训练
clf = trainModel(trainDataPCA, y_train)
# 结果预测
testLabel = clf.predict(pcaTestData)
结果导出
def saveResult(result, csvName):
with open(csvName, 'wb') as myFile:
myWriter = csv.writer(myFile)
myWriter.writerow(["ImageId", "Label"])
index = 0
for i in result:
tmp = []
index = index+1
tmp.append(index)
# tmp.append(i)
tmp.append(int(i))
myWriter.writerow(tmp)
# 结果的输出
saveResult(testLabel, '/opt/data/kaggle/getting-started/digit-recognizer/output/Result_xxx.csv')
四. 模型融合
跳过,这个项目的重点是让大家都了解这个kaggle比赛怎么和算法更好的融合在一起。
五. 修改特征和模型参数
此处不做过多分析,主要是优化各个算法的参数。