機器學習之決策樹(ID3)算法與Python實現-CDA數據分析師官網

熱線電話：13121318867

登錄

首頁精彩閱讀機器學習之決策樹(ID3)算法與Python實現

機器學習之決策樹(ID3)算法與Python實現

2017-07-23

收藏

機器學習之決策樹(ID3)算法與Python實現

機器學習中，決策樹是一個預測模型；他代表的是對象屬性與對象值之間的一種映射關系。樹中每個節點表示某個對象，而每個分叉路徑則代表的某個可能的屬性值，而每個葉結點則對應從根節點到該葉節點所經歷的路徑所表示的對象的值。決策樹僅有單一輸出，若欲有復數輸出，可以建立獨立的決策樹以處理不同輸出。數據挖掘中決策樹是一種經常要用到的技術，可以用于分析數據，同樣也可以用來作預測。

一、決策樹與ID3概述1.決策樹

決策樹，其結構和樹非常相似，因此得其名決策樹。決策樹具有樹形的結構，其中每個內部節點表示一個屬性上的測試，每個分支代表一個測試輸出，每個葉節點代表一種類別。

例如：

按照豆腐腦的冷熱、甜咸和是否含有大蒜構建決策樹，對其屬性的測試，在最終的葉節點決定該豆腐腦吃還是不吃。

分類樹（決策樹）是一種十分常用的將決策樹應用于分類的機器學習方法。他是一種監管學習，所謂監管學習就是給定一堆樣本，每個樣本都有一組屬性（特征）和一個類別（分類信息/目標），這些類別是事先確定的，那么通過學習得到一個分類器，這個分類器能夠對新出現的對象給出正確的分類。
其原理在于，每個決策樹都表述了一種樹型結構，它由它的分支來對該類型的對象依靠屬性進行分類。每個決策樹可以依靠對源數據庫的分割進行數據測試。這個過程可以遞歸式的對樹進行修剪。當不能再進行分割或一個單獨的類可以被應用于某一分支時，遞歸過程就完成了。

機器學習中，決策樹是一個預測模型；他代表的是對象屬性與對象值之間的一種映射關系。樹中每個節點表示某個對象，而每個分叉路徑則代表的某個可能的屬性值，而每個葉結點則對應從根節點到該葉節點所經歷的路徑所表示的對象的值。決策樹僅有單一輸出，若欲有復數輸出，可以建立獨立的決策樹以處理不同輸出。數據挖掘中決策樹是一種經常要用到的技術，可以用于分析數據，同樣也可以用來作預測。從數據產生決策樹的機器學習技術叫做決策樹學習, 通俗說就是決策樹。

目前常用的決策樹算法有ID3算法、改進的C4.5算法和CART算法。

決策樹的特點

1.多層次的決策樹形式易于理解；

2.只適用于標稱型數據，對連續性數據處理得不好；

2、ID3算法

ID3算法最早是由羅斯昆（J. Ross Quinlan）于1975年在悉尼大學提出的一種分類預測算法，算法以信息論為基礎，其核心是“信息熵”。ID3算法通過計算每個屬性的信息增益，認為信息增益高的是好屬性，每次劃分選取信息增益最高的屬性為劃分標準，重復這個過程，直至生成一個能完美分類訓練樣例的決策樹。

信息熵（Entropy）:

,其中p(xi)是選擇i的概率。
熵越高，表示混合的數據越多。
信息增益（Information Gain）:

T是劃分之后的分支集合，p(t)是該分支集合在原本的父集合中出現的概率，H(t)是該子集合的信息熵。
3.ID3算法與決策樹的流程
（1）數據準備：需要對數值型數據進行離散化
（2）ID3算法構建決策樹：
如果數據集類別完全相同，則停止劃分
否則，繼續劃分決策樹：
計算信息熵和信息增益來選擇最好的數據集劃分方法；
劃分數據集
創建分支節點：
對每個分支進行判定是否類別相同，如果相同停止劃分，不同按照上述方法進行劃分。
二、Python算法實現

創建 trees.py文件，在其中創建構建決策樹的函數。
首先構建一組測試數據：

0. 構造函數createDataSet：

def createDataSet():
    dataSet=[[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels=['no surfacing','flippers']
    return dataSet,labels

在Python控制臺測試構造函數

#測試下構造的數據
import trees
myDat,labels = trees.createDataSet()
myDat
Out[4]: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
labels

Out[5]: ['no surfacing', 'flippers']

2.1 計算信息熵

from math import log

def calcShannonEnt(dataSet):
    numEntries = len(dataSet) #nrows
    #為所有的分類類目創建字典
    labelCounts ={}
    for featVec in dataSet:
        currentLable=featVec[-1] #取得最后一列數據
        if currentLable not in labelCounts.keys():
            labelCounts[currentLable]=0
        labelCounts[currentLable]+=1
    #計算香農熵
    shannonEnt=0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

利用構造的數據測試calcShannonEnt：

#Python console
In [6]: trees.calcShannonEnt(myDat)
...:
Out[6]: 0.9709505944546686

2.2 按照最大信息增益劃分數據集

#定義按照某個特征進行劃分的函數splitDataSet
#輸入三個變量（待劃分的數據集，特征，分類值）
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value :
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    return retDataSet #返回不含劃分特征的子集

#定義按照最大信息增益劃分數據的函數
def chooseBestFeatureToSplit(dataSet):
    numFeature=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)#香農熵
    bestInforGain=0
    bestFeature=-1
    for i in range(numFeature):
        featList=[number[i] for number in dataSet] #得到某個特征下所有值（某列）
        uniqualVals=set(featList) #set無重復的屬性特征值
        newEntropy=0
        for value in uniqualVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet)) #即p(t)
            newEntropy+=prob*calcShannonEnt(subDataSet)#對各子集香農熵求和
        infoGain=baseEntropy-newEntropy #計算信息增益
        #最大信息增益
        if (infoGain>bestInforGain):
            bestInforGain=infoGain
            bestFeature=i
    return bestFeature #返回特征值

在控制臺中測試這兩個函數：

#測試按照特征劃分數據集的函數
In [8]: from imp import reload
In [9]: reload(trees)
Out[9]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>
In [10]: myDat,labels=trees.createDataSet()
    ...:
In [11]: trees.splitDataSet(myDat,0,0)
    ...:
Out[11]: [[1, 'no'], [1, 'no']]
In [12]: trees.splitDataSet(myDat,0,1)
    ...:
Out[12]: [[1, 'yes'], [1, 'yes'], [0, 'no']]

#測試chooseBestFeatureToSplit函數
In [13]: reload(trees)
    ...:
Out[13]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [14]: trees.chooseBestFeatureToSplit(myDat)
    ...:

Out[14]: 0

2.3 創建決策樹構造函數createTree

import operater
#投票表決代碼
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount[vote]+=1
    sortedClassCount=sorted(classCount.items,key=operator.itemgetter(1),reversed=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    #類別相同，停止劃分
    if classList.count(classList[-1])==len(classList):
        return classList[-1]
    #長度為1，返回出現次數最多的類別
    if len(classList[0])==1:
        return majorityCnt(classList)
    #按照信息增益最高選取分類特征屬性
    bestFeat=chooseBestFeatureToSplit(dataSet)#返回分類的特征序號
    bestFeatLable=labels[bestFeat] #該特征的label
    myTree={bestFeatLable:{}} #構建樹的字典
    del(labels[bestFeat]) #從labels的list中刪除該label
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLables=labels[:] #子集合
        #構建數據的子集合，并進行遞歸
        myTree[bestFeatLable][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLables)
    return myTree

以之前構造的測試數據為例，對決策樹構造函數進行測試，在python控制臺進行輸入：

#決策樹構造函數測試
In [15]: reload(trees)
...:
Out[15]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [16]: myTree=trees.createTree(myDat,labels)
...:

In [17]: myTree
Out[17]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

可以看到，最后生成的決策樹myTree是一個多層嵌套的字典。

2.4 決策樹運用于分類

運用決策樹進行分類，首先構建一個決策樹分類函數：

#輸入三個變量（決策樹，屬性特征標簽，測試的數據）
def classify(inputTree,featLables,testVec):
    firstStr=list(inputTree.keys())[0] #獲取樹的第一個特征屬性
    secondDict=inputTree[firstStr] #樹的分支，子集合Dict
    featIndex=featLables.index(firstStr) #獲取決策樹第一層在featLables中的位置
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLabel=classify(secondDict[key],featLables,testVec)
            else:classLabel=secondDict[key]
    return classLabel

對決策樹分類函數進行測試：

In [29]: reload(trees)
    ...:
Out[29]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [30]: myDat,labels=trees.createDataSet()
    ...:

In [31]: labels
    ...:
Out[31]: ['no surfacing', 'flippers']

In [32]: myTree=treeplotter.retrieveTree(0)
    ...:

In [33]: myTree
    ...:
Out[33]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

In [34]: trees.classify(myTree,labels,[1,0])
    ...:
Out[34]: 'no'

In [35]: trees.classify(myTree,labels,[1,1])
    ...:

Out[35]: 'yes'

2.5 決策樹的存儲

如果每次都需要訓練樣本集來構建決策樹，費時費力，特別是數據很大的時候，每次重新構建決策樹浪費時間。因此可以將已經創建的決策樹（如字典形式）保存在硬盤上，需要使用的時候直接讀取就好。
（1）存儲函數

    1def storeTree(inputTree,filename):
    import pickle
    fw=open(filename,'wb') #pickle默認方式是二進制，需要制定'wb'
    pickle.dump(inputTree,fw)
    fw.close()

（2）讀取函數

def grabTree(filename):
    import pickle
    fr=open(filename,'rb')#需要制定'rb'，以byte形式讀取
    return pickle.load(fr)

對這兩個函數進行測試（Python console）：

In [36]: myTree
Out[36]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

In [37]: trees.storeTree(myTree,'classifierStorage.txt')

In [38]: trees.grabTree('classifierStorage.txt')
Out[38]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

在工作目錄下存在一個名為’classifierStorage.txt’的txt文檔，該文檔保存了myTree的決策樹信息，需要使用的時候直接調出使用。

三、使用Matplotlib繪制決策樹

import matplotlib.pyplot as plt

from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei'] #否則中文無法正常顯示

decisionNode=dict(boxstyle='sawtooth',fc='0.8') #決策點樣式
leafNode=dict(boxstyle='round4',fc='0.8')　#葉節點樣式
arrow_args=dict(arrowstyle='<-') #箭頭樣式

def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',
                            xytext=centerPt,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrow_args)

def createPlot():
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    createPlot.ax1=plt.subplot(111,frameon=False)
    plotNode('決策節點',(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode('葉節點',(0.8,0.1),(0.3,0.8),leafNode)
    plt.show()

#測試
#獲取葉節點數量（廣度）
def getNumLeafs(myTree):
    numLeafs=0
    firstStr=list(myTree.keys())[0]#'dict_keys' object does not support indexing
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            numLeafs+=getNumLeafs(secondDict[key])
        else:numLeafs+=1
    return numLeafs

#獲取樹的深度的函數（深度）
def getTreeDepth(myTree):
    maxDepth=0
    firstStr=list(myTree.keys())[0]
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth=1+getTreeDepth(secondDict[key])
        else: thisDepth=1
        if thisDepth > maxDepth:
            maxDepth=thisDepth
    return maxDepth
#定義一個預先創建樹的函數
def retrieveTree(i):
    listOfTrees=[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                 {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head':{0:'no', 1: 'yes'}},1:'no'}}}}
                 ]
    return listOfTrees[i]

#定義在父子節點之間填充文本信息的函數
def plotMidText(cntrPt,parentPt,txtString):
    xMid=(parentPt[0]-cntrPt[0])/2+cntrPt[0]
    yMid=(parentPt[1]-cntrPt[1])/2+cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)

#定義樹繪制的函數
def plotTree(myTree,parentPt,nodeTxt):
    numLeafs=getNumLeafs(myTree)
    depth=getTreeDepth(myTree)
    firstStr=list(myTree.keys())[0]
    cntrPt=(plotTree.xOff+(1.0+float(numLeafs))/2/plotTree.totalW,plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict=myTree[firstStr]
    plotTree.yOff=plotTree.yOff -1/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff=plotTree.xOff+1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff=plotTree.yOff+1/plotTree.totalD

#定義主函數，來調用其它函數
def createPlot(inTree):
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    axprops=dict(xticks=[],yticks=[])
    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW=float(getNumLeafs(inTree))
    plotTree.totalD=float(getTreeDepth(inTree))
    plotTree.xOff=-0.5/plotTree.totalW;plotTree.yOff=1.0;
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

對繪制決策樹圖的函數進行測試（控制臺）：

In [26]: reload(treeplotter)
    ...:
Out[26]: <module 'treeplotter' from 'G:\\Workspaces\\MachineLearning\\treeplotter.py'>

In [27]: myTree=treeplotter.retrieveTree(0)
    ...:

In [28]: treeplotter.createPlot(myTree)
    ...:

得到決策樹圖：

四、實例（使用決策樹預測隱形眼鏡類型）

隱形眼鏡的數據集包含了患者的四個屬性age，prescript，stigmatic，tearRate，利用這些數據構建決策樹，并通過Matplotlib繪制出決策樹的樹狀圖。
附lenses.txt數據：

young   myope   no reduced no lenses
young   myope   no normal soft
young   myope   yes reduced no lenses
young   myope   yes normal hard
young   hyper   no reduced no lenses
young   hyper   no normal soft
young   hyper   yes reduced no lenses
young   hyper   yes normal hard
pre myope   no reduced no lenses
pre myope   no normal soft
pre myope   yes reduced no lenses
pre myope   yes normal hard
pre hyper   no reduced no lenses
pre hyper   no normal soft
pre hyper   yes reduced no lenses
pre hyper   yes normal no lenses
presbyopic myope   no reduced no lenses
presbyopic myope   no normal no lenses
presbyopic myope   yes reduced no lenses
presbyopic myope   yes normal hard
presbyopic hyper   no reduced no lenses
presbyopic hyper   no normal soft
presbyopic hyper   yes reduced no lenses
presbyopic hyper   yes normal no lenses

In [40]: fr=open('machinelearninginaction/Ch03/lenses.txt')

In [41]: lenses=[inst.strip().split('\t') for inst in fr.readlines()]

In [42]: lensesLabels=['age','prescript','astigmatic','tearRate']

In [43]: lensesTree=trees.createTree(lenses,lensesLabels)

In [44]: lensesTree
Out[44]:
{'tearRate': {'normal': {'astigmatic': {'no': {'age': {'pre': 'soft',
      'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}},
      'young': 'soft'}},
    'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses',
        'presbyopic': 'no lenses',
        'young': 'hard'}},
      'myope': 'hard'}}}},
'reduced': 'no lenses'}}
In [45]: treeplotter.createPlot(lensesTree)

得到圖

CDA數據分析師考試相關入口一覽（建議收藏）：

? 想報名CDA認證考試，點擊>>> “CDA報名” 了解CDA考試詳情；

? 想學習CDA考試教材，點擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫，點擊>>> “CDA題庫” 了解CDA考試詳情；

? 想了解CDA考試含金量，點擊>>> “CDA含金量” 了解CDA考試詳情；

決策樹特征 D3 機器學習數據挖掘 python matplotlib

數據分析咨詢請掃描二維碼

若不方便掃碼，搜微信號：CDAshujufenxi

上一篇圖論在大數據分析中的作用！

下一篇CDA認證再升一檔！與國家共同推進大數據人才培養標準教育事業！

數據分析師考試動態

考試介紹
考試大綱
考試內容
考試地點

CDA報考指南

報考流程
考試時間
報名費用
聯系我們

數據分析學習

數據分析師資訊

更多

Copyright © 2015-2021, www.ruiqisteel.com All Rights Reserved. CDA數據分析師(北京國富如荷網絡科技有限公司) 版權所有京ICP備11001960號-9

京公網安備 11010802034615號經營許可證編號：京B2-20210330

聯系電話：13321103290 (微信同號)

OK

免費資料
免費試聽
訂制課程
職業規劃
認證考試

客服在線

日韩人妻系列无码专区视频,先锋高清无码,无码免费视欧非,国精产品一区一区三区无码

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗證后自動注冊