技能训练

[b](1)什么是支持向量机?[/b]
[b](2)简述硬间隔支持向量机。[/b] [br]
[b](3)简述非线性支持向量机。[/b]
[b](4)支持向量机(SVM)大数据算法操作实践。[/b][br][br] ① 作业目的。[br][br] 旨在让学生了解硬间隔支持向量机、软间隔支持向量机及非线性支持向量机的算法含义及应用场景,了解四类不同核函数,即 Linear Kernel、Polynomial Kernel、Gaussian Kernel和 Sigmoid Kernel 对学习机(Learner)性状的影响,体会其中的异同点,从而加深对 Orange平台中各种支持向量机对分类功能的实现。[br][br] ② 作业准备。[br][br] [color=#0000ff][b][url=https://orangedatamining.com/download/]Orange3 软件下载[/url][icon]/images/ggb/toolbar/mode_zoomin.png[/icon][/b][/color]并安装。[br] [br] [color=#0000ff][b][url=https://pan.baidu.com/disk/main#/index?category=all&path=%2FSVM%2FSVM%E6%93%8D%E4%BD%9C%E5%AE%9E%E8%B7%B5]源数据[/url][icon]/images/ggb/toolbar/mode_zoomin.png[/icon][/b][/color]包含三个文件,adult-data.txt(训练集)、adult-test.txt(测试集)、adult-attribute.txt(数据来源及属性说明)。
[br]| This data was extracted from the census bureau database found at [br][br]| http://www.census.gov/ftp/pub/DES/www/welcome.html [br][br]| Split into train-test using MLC++ GenCVFiles (2/3,1/3 random). [br][br]| 48842 instances,mix of continuous and discrete (train=32561,test=16281)[br][br]| 45222 if instances with unknown values are removed (train=30162,test=15060)[br][br]| Duplicate or conflicting instances : 6 [br][br]| Class probabilities for adult.all file [br][br]| Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)[br][br]| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)[br][br]| [br][br]| Extraction was done by Barry Becker from the 1994 Census database. A set of| reasonably clean [br][br]records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)[br][br]&& (HRSWK>0)). [br][br]|| Prediction task is to determine whether a person makes over 50K| a year. [br][br]| [br][br]| C4.5 : 84.46+-0.30 [br][br]| Naive-Bayes : 83.88+-0.30 [br][br]| NBTree : 85.90+-0.28 [br][br]| [br][br]|| Following algorithms were later run with the following error rates,all after removal of unknowns [br][br]and using the original train/test split. All these numbers are straight runs using MLC++ with default [br][br]values. [br][br]| [br][br]| Algorithm Error [br][br]| -- ---------------- ----- [br] [br]| 1 C4.5 15.54[br] [br][br]| 2 C4.5-auto 14.46 [br][br]| 3 C4.5 rules 14.94 [br][br]| 4 Voted ID3 (0.6) 15.64 [br][br]| 5 Voted ID3 (0.8) 16.47 [br][br]| 6 T2 16.84 [br][br]| 7 1R 19.54 [br][br]| 8 NBTree 14.10 [br][br]| 9 CN2 16.00 [br][br]| 10 HOODG 14.82 [br][br]| 11 FSS Naive Bayes 14.05 [br][br]| 12 IDTM (Decision table) 14.46 [br][br]| 13 Naive-Bayes 16.12 [br][br]| 14 Nearest-neighbor (1) 21.42 [br][br]| 15 Nearest-neighbor (3) 20.35 [br][br]| 16 OC1 15.04 [br][br]| Description of fnlwgt (final weight):| The weights on the CPS files are controlled to independent estimates of the| civilian noninstitutional population of the US. These are prepared monthly| for us by Population Division here at the Census Bureau. We use 3 sets of| controls. [br][br]| These are: [br][br]| 1. A single cell estimate of the population 16+ for each state. [br][br]| 2. Controls for Hispanic Origin by age and sex. [br][br]| 3. Controls by Race,age and sex. [br][br]age: continuous. [br][br]workclass: Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked. [br][br]fnlwgt: continuous. [br][br]education: Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool. [br][br]education-num: continuous. [br][br]marital-status: Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse. [br][br]occupation: Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners,Machine-op-inspct,Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces. [br][br]relationship: Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried. [br][br]race: White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black.[br][br]sex: Female,Male. [br][br]capital-gain: continuous. [br][br]capital-loss: continuous. [br][br]hours-per-week: continuous. [br][br]native-country: United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
a. 数据源分析。[br][br] 首先分析数据源的数据出处,属于开源的机器学习数据库网站。[br][br] b. 数据属性及数据配置解析。[br][br] 数据配置:[br][br] Split into train-test using MLC++ GenCVFiles (2/3,1/3 random). [br][br] 48842 instances,mix of continuous and discrete (train=32561,test=16281)[br][br] 45222 if instances with unknown values are removed (train=30162,test=15060)[br][br] Class probabilities for adult.all file [br][br] Probability for the label '>50K' :23.93% / 24.78% (without unknowns)[br][br] Probability for the label '<=50K' :76.07% / 75.22% (without unknowns)
[br]特征属性(共 15 个):[br][br]age:continuous. [br][br]workclass:Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,[br]Without-pay,Never-worked. [br][br]fnlwgt:continuous. [br][br]education:Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool. [br][br]education-num:continuous. [br][br]marital-status:Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse. [br][br]occupation:Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces. [br][br]relationship:Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried. [br][br]race:White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black. [br][br]sex:Female,Male. [br][br]capital-gain:continuous. [br][br]capital-loss:continuous. [br][br]hours-per-week:continuous. [br][br]native-country:United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines , Italy , Poland , Jamaica , Vietnam , Mexico , Portugal , Ireland , France ,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
③ 作业内容。[br][br] 作业包括三个部分:[br][br] ● 数据整理及转换;[br][br] ● Orange 平台上机操作;[br][br] ● 撰写分析报告。[br][br] a. 数据整理与转换。[br][br] 一般来讲,下载数据采用的是 txt 文件,而 txt 是一种纯文本文档,里面不会有任何字体[br]格式,直观性较差,同时也不便于 Orange 平台操作,因此需要进行转换并预处理。[br][br] ● 在 Excel 中打开 txt 文件。[br][br] 要求:创建训练集及测试集两个 Excel 数据集,文件名自定。[br][br] ● 预处理数据。[br][br] 要求:加标题表头,通过筛选批量删除含有“?”字符的记录。[br][br] b. Orange 平台上机操作。[br][br] 总要求是对四个核函数分别建立学习器,并比较各学习器的优劣。工作流完整,逻辑清[br]晰,产出合理。关键内容如下:[br][br] ▶ 设置四个核函数的学习器;[br][br] ▶ 训练集及测试集部署合理;[br][br] ▶ 调整惩罚项及参数设置,调优学习器;[br][br] ▶ 数据集在线端配属正确,不报错;[br][br] ▶ 调用可视化模块,对支持向量进行展示。[br][center][img]https://s21.ax1x.com/2025/02/20/pEQ3JfA.png[/img][/center][br][br] c. 撰写数据分析报告。[br] [br] 以上[color=#0000ff][b][url=https://pan.baidu.com/disk/main#/index?category=all&path=%2F%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E6%8A%A5%E5%91%8A%E6%A8%A1%E6%9D%BF]资源下载[/url][icon]/images/ggb/toolbar/mode_zoomin.png[/icon][/b][/color]。[br]
Fermer

Information: 技能训练