R語言做K-means聚類分析時確定類的個數-CDA數據分析師官網

熱線電話：13121318867

登錄

首頁精彩閱讀R語言做K-means聚類分析時確定類的個數

R語言做K-means聚類分析時確定類的個數

2020-05-20

收藏

方法一：

K平均算法（K-means聚類分析）
在下面的誤差平方和圖中，拐點（bend or elbow）的位置對應的x軸即k-means聚類給出的合適的類的個數。

> n = 100
> g=6
> set.seed(g)
> d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
> plot(d)
>

> mydata <- d
>
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
> for (i in 2:15)
+ wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
> ###這里的wss(within-cluster sum of squares)是組內平方和
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
>

由上圖可以看出，該方法給出合理的類別個數是4個。
方法二：

K中心聚類算法（K-mediods）
使用fpc包里的pamk函數來估計類的個數：

> library(cluster)
Warning message:
程輯包‘cluster’是用R版本3.2.3 來建造的
> library(fpc)
> pamk.best <- pamk(d)
> cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
number of clusters estimated by optimum average silhouette width: 4
> plot(pam(d, pamk.best$nc))

sihouette值是用來表示某一個對象和它所屬類的凝合力強度以及和其他類分離強度的，值范圍為-1到1，值越大表示該對象越匹配所屬類以及和鄰近類有多不匹配。
所以從上圖sihouette plot中可以看出，該方法給出的合理類的個數為4個。
方法三：
基于Calinsky Criterion
> require(vegan)
載入需要的程輯包：vegan
載入需要的程輯包：permute
載入需要的程輯包：lattice
This is vegan 2.4-0
Warning messages:
1: 程輯包‘vegan’是用R版本3.2.5 來建造的
2: 程輯包‘permute’是用R版本3.2.5 來建造的
3: 程輯包‘lattice’是用R版本3.2.3 來建造的
> fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
> plot(fit, sortg = TRUE, grpmts.plot = TRUE)
> calinski.best <- as.numeric(which.max(fit$results[2,]))
> cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
Calinski criterion optimal number of clusters: 5
>

由上圖我們可以看到，根據Calinsky標準，得到類的個數是5個。
方法四：

基于模型假設的聚類，利用的是mclust包：

> library(mclust)
    __ ___________    __ _____________
   / |/ / ____/ /   / / / / ___/_ __/
/ /|_/ / /   / /   / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/    version 5.1
Type 'citation("mclust")' for citing this R package in publications.
Warning message:
程輯包‘mclust’是用R版本3.2.4 來建造的
> d_clust <- Mclust(as.matrix(d), G=1:20)
> m.best <- dim(d_clust$z)[2]
> cat("model-based optimal number of clusters:", m.best, "\n")
model-based optimal number of clusters: 4
> plot(d_clust)
Model-based clustering plots:

1: BIC
2: classification
3: uncertainty
4: density

方法五：

基于AP算法的聚類

> library(apcluster)

載入程輯包：‘apcluster’

The following object is masked from ‘package:stats’:

heatmap

Warning message:
程輯包‘apcluster’是用R版本3.2.5 來建造的
> d.apclus <- apcluster(negDistMat(r=2), d)
> cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
affinity propogation optimal number of clusters: 4
> #4 得出的分類個數
> heatmap(d.apclus)
> plot(d.apclus, d)
>

CDA數據分析師考試相關入口一覽（建議收藏）：

? 想報名CDA認證考試，點擊>>> “CDA報名” 了解CDA考試詳情；

? 想學習CDA考試教材，點擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫，點擊>>> “CDA題庫” 了解CDA考試詳情；

? 想了解CDA考試含金量，點擊>>> “CDA含金量” 了解CDA考試詳情；

K-means聚類

數據分析咨詢請掃描二維碼

若不方便掃碼，搜微信號：CDAshujufenxi

上一篇R語言混合型數據聚類分析案例

下一篇python解析模塊(ConfigParser)使用方法

數據分析師考試動態

考試介紹
考試大綱
考試內容
考試地點

CDA報考指南

報考流程
考試時間
報名費用
聯系我們

數據分析學習

數據分析師資訊

更多

Copyright © 2015-2021, www.ruiqisteel.com All Rights Reserved. CDA數據分析師(北京國富如荷網絡科技有限公司) 版權所有京ICP備11001960號-9

京公網安備 11010802034615號經營許可證編號：京B2-20210330

聯系電話：13321103290 (微信同號)

OK

免費資料
免費試聽
訂制課程
職業規劃
認證考試

客服在線

日韩人妻系列无码专区视频,先锋高清无码,无码免费视欧非,国精产品一区一区三区无码

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗證后自動注冊