loading...

Perform a k-means clustering analysis using the stats::kmeans() function in stats but creating a k_means object that possibly embeds the original data with the analysis for a richer set of methods.

k_means(
  x,
  k,
  centers = k,
  iter.max = 10L,
  nstart = 1L,
  algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"),
  trace = FALSE,
  keep.data = TRUE
)

profile_k(x, fun = kmeans, method = "wss", k.max = NULL, ...)

# S3 method for kmeans
augment(x, data, ...)

# S3 method for k_means
predict(object, ...)

# S3 method for k_means
plot(
  x,
  y,
  data = x$data,
  choices = 1L:2L,
  col = NULL,
  c.shape = 8,
  c.size = 3,
  ...
)

# S3 method for k_means
autoplot(
  object,
  data = object$data,
  choices = 1L:2L,
  alpha = 1,
  c.shape = 8,
  c.size = 3,
  theme = NULL,
  use.chart = FALSE,
  ...
)

# S3 method for k_means
chart(data, ..., type = NULL, env = parent.frame())

Arguments

x

A data frame or a matrix with numeric data

k

The number of clusters to create, or a set of initial cluster centers. If a number, a random set of initial centers are computed first.

centers

Idem (centers is synonym to k)

iter.max

Maximum number of iterations (10 by default)

nstart

If k is a number, how many random sets should be chosen?

algorithm

The algorithm to use. May be abbreviated. See stats::kmeans() for more details about available algorithms.

trace

Logical or integer. Should process be traced. Higher value produces more tracing information.

keep.data

Do we keep the data in the object? If TRUE (by default), a richer set of methods could be applied to the resulting object, but it takes more space in memory. Use FALSE if you want to save RAM.

fun

The kmeans clustering function to use, kmeans() by default.

method

The method used in profile_k(): "wss" (by default, total within sum of square), "silhouette" (average silhouette width) or "gap_stat" (gap statistics).

k.max

Maximum number of clusters to consider (at least two). If not provided, a reasonable default is calculated.

...

Other arguments transmitted to factoextra::fviz_nbclust().

data

The original data frame

object

The k_means* object

y

Not used

choices

The axes (variables) to plot (first and second by default)

col

Color to use

c.shape

The shape to represent cluster centers

c.size

The size of the shape representing cluster centers

alpha

Semi-transparency to apply to points

theme

The ggplot theme to apply to the plot

use.chart

If TRUE use chart(), otherwise, use ggplot().

type

Not used here

env

Not used here

Value

k_means() creates an object of classes k_means and kmeans. profile_k() is used for its side-effect of creating a plot that should help to chose the best value for k.

Examples

data(iris, package = "datasets")
iris_num <- iris[, -5] # Only numerical variables
library(chart)

# Profile k is to be taken only asx a (useful) indication!
profile_k(iris_num) # 2, maybe 3 clusters

iris_k2 <- k_means(iris_num, k = 2)
chart(iris_k2)


iris_k3 <- k_means(iris_num, k = 3, nstart = 20L) # Several random starts
chart(iris_k3)


# Get clusters and compare with Species
iris3 <- augment(iris_k3, iris) # Use predict() to just get clusters
head(iris3)
#> # A tibble: 6 × 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .cluster
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct>   
#> 1          5.1         3.5          1.4         0.2 setosa  3       
#> 2          4.9         3            1.4         0.2 setosa  3       
#> 3          4.7         3.2          1.3         0.2 setosa  3       
#> 4          4.6         3.1          1.5         0.2 setosa  3       
#> 5          5           3.6          1.4         0.2 setosa  3       
#> 6          5.4         3.9          1.7         0.4 setosa  3       
table(iris3$.cluster, iris3$Species) # setosa OK, the other are mixed a bit
#>    
#>     setosa versicolor virginica
#>   1      0         48        14
#>   2      0          2        36
#>   3     50          0         0