How to resolve the algorithm K-means++ clustering step by step in the Kotlin programming language

Problem Statement
Step by Step Solution
Sourcecode

Problem Statement

The task is to implement the K-means++ algorithm. Produce a function which takes two arguments: the number of clusters K, and the dataset to classify. K is a positive integer and the dataset is a list of points in the Cartesian plane. The output is a list of clusters (related sets of points, according to the algorithm). For extra credit (in order): Extra credit is only awarded if the examples given demonstrate the feature in question. To earn credit for 1. and 2., visualize 6 clusters of 30,000 points in ℝ2. It is not necessary to provide visualization for spaces higher than ℝ2 but the examples should demonstrate features 3. and 4. if the solution has them.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm K-means++ clustering step by step in the Kotlin programming language

This code implements the K-means++ algorithm, which is a clustering algorithm that aims to find K clusters in a set of data points. The code is written in Kotlin programming language.

The algorithm starts by selecting K initial cluster centers. These centers are chosen randomly, but with a probability that is proportional to the distance between the center and the other data points. This ensures that the initial centers are spread out across the data set.

Once the initial centers have been chosen, the algorithm assigns each data point to the closest center. The data points are then used to update the cluster centers by computing the average position of all the data points in each cluster.

This process is repeated until the cluster centers no longer change, or until a maximum number of iterations has been reached.

The code uses a variety of helper functions to perform these tasks. The dist2 function computes the squared distance between two points. The nearest function finds the closest cluster center to a given data point. The kpp function selects the initial cluster centers using the K-means++ algorithm. The lloyd function implements the Lloyd's algorithm, which is a variant of the K-means algorithm that uses a centroid-based approach to update the cluster centers.

The printEps function prints the results of the clustering algorithm in the PostScript format. This allows the results to be visualized using a PostScript viewer.

The main function generates a set of data points and then runs the K-means++ algorithm on the data set. The results are printed to the standard output in the PostScript format.

Here is a more detailed explanation of the code:

The Point data class represents a point in a 2-dimensional space. It has three properties: x, y, and group. The x and y properties store the coordinates of the point, and the group property stores the cluster that the point belongs to.
The genXY function generates a list of points in a 2-dimensional space. The points are generated randomly, but with a probability that is proportional to the distance between the point and the center of the space. This ensures that the points are spread out across the space.
The dist2 function computes the squared distance between two points.
The nearest function finds the closest cluster center to a given data point. It does this by computing the squared distance between the data point and each cluster center, and then selecting the cluster center with the smallest squared distance.
The kpp function selects the initial cluster centers using the K-means++ algorithm. It does this by selecting the first cluster center randomly, and then selecting the subsequent cluster centers with a probability that is proportional to the distance between the cluster center and the other data points.
The lloyd function implements the Lloyd's algorithm, which is a variant of the K-means algorithm that uses a centroid-based approach to update the cluster centers. It does this by computing the average position of all the data points in each cluster, and then setting the cluster center to the average position.
The printEps function prints the results of the clustering algorithm in the PostScript format. This allows the results to be visualized using a PostScript viewer.
The main function generates a set of data points and then runs the K-means++ algorithm on the data set. The results are printed to the standard output in the PostScript format.

Source code in the kotlin programming language

// version 1.2.21

import java.util.Random
import kotlin.math.*

data class Point(var x: Double, var y: Double, var group: Int)

typealias LPoint = List<Point>
typealias MLPoint = MutableList<Point>

val origin get() = Point(0.0, 0.0, 0)
val r = Random()
val hugeVal = Double.POSITIVE_INFINITY

const val RAND_MAX = Int.MAX_VALUE
const val PTS = 100_000
const val K = 11
const val W = 400
const val H = 400

fun rand() = r.nextInt(RAND_MAX)

fun randf(m: Double) = m * rand() / (RAND_MAX - 1)

fun genXY(count: Int, radius: Double): LPoint {
    val pts = List(count) { origin }

    /* note: this is not a uniform 2-d distribution */
    for (i in 0 until count) {
        val ang = randf(2.0 * PI)
        val r = randf(radius)
        pts[i].x = r * cos(ang)
        pts[i].y = r * sin(ang)
    }
    return pts
}

fun dist2(a: Point, b: Point): Double {
    val x = a.x - b.x
    val y = a.y - b.y
    return x * x + y * y
}

fun nearest(pt: Point, cent: LPoint, nCluster: Int): Pair<Int, Double> {
    var minD = hugeVal
    var minI = pt.group
    for (i in 0 until nCluster) {
        val d = dist2(cent[i], pt)
        if (minD > d) {
            minD = d
            minI = i
        }
    }
    return minI to minD
}

fun kpp(pts: LPoint, len: Int, cent: MLPoint) {
    val nCent = cent.size
    val d = DoubleArray(len)
    cent[0] = pts[rand() % len].copy()
    for (nCluster in 1 until nCent) {
        var sum = 0.0
        for (j in 0 until len) {
            d[j] = nearest(pts[j], cent, nCluster).second
            sum += d[j]
        }
        sum = randf(sum)
        for (j in 0 until len) {
            sum -= d[j]
            if (sum > 0.0) continue
            cent[nCluster] = pts[j].copy()
            break
        }
    }
    for (j in 0 until len) pts[j].group = nearest(pts[j], cent, nCent).first
}

fun lloyd(pts: LPoint, len: Int, nCluster: Int): LPoint {
    val cent = MutableList(nCluster) { origin }
    kpp(pts, len, cent)
    do {
        /* group element for centroids are used as counters */
        for (i in 0 until nCluster) {
            with (cent[i]) { x = 0.0; y = 0.0; group = 0 }
        }
        for (j in 0 until len) {
            val p = pts[j]
            val c = cent[p.group]
            with (c) { group++; x += p.x; y += p.y }
        }
        for (i in 0 until nCluster) {
            val c = cent[i]
            c.x /= c.group
            c.y /= c.group
        }
        var changed = 0

        /* find closest centroid of each point */
        for (j in 0 until len) {
            val p = pts[j]
            val minI = nearest(p, cent, nCluster).first
            if (minI != p.group) {
                changed++
                p.group = minI
            }
        }
    }
    while (changed > (len shr 10))  /* stop when 99.9% of points are good */

    for (i in 0 until nCluster) cent[i].group = i
    return cent
}

fun printEps(pts: LPoint, len: Int, cent: LPoint, nCluster: Int) {
    val colors = DoubleArray(nCluster * 3)
    for (i in 0 until nCluster) {
        colors[3 * i + 0] = (3 * (i + 1) % 11) / 11.0
        colors[3 * i + 1] = (7 * i % 11) / 11.0
        colors[3 * i + 2] = (9 * i % 11) / 11.0
    }
    var minX = hugeVal
    var minY = hugeVal
    var maxX = -hugeVal
    var maxY = -hugeVal
    for (j in 0 until len) {
        val p = pts[j]
        if (maxX < p.x) maxX = p.x
        if (minX > p.x) minX = p.x
        if (maxY < p.y) maxY = p.y
        if (minY > p.y) minY = p.y
    }
    val scale = minOf(W / (maxX - minX), H / (maxY - minY))
    val cx = (maxX + minX) / 2.0
    val cy = (maxY + minY) / 2.0

    print("%%!PS-Adobe-3.0\n%%%%BoundingBox: -5 -5 %${W + 10} ${H + 10}\n")
    print("/l {rlineto} def /m {rmoveto} def\n")
    print("/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def\n")
    print("/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath ")
    print("	gsave 1 setgray fill grestore gsave 3 setlinewidth")
    print(" 1 setgray stroke grestore 0 setgray stroke }def\n")
    val f1 = "%g %g %g setrgbcolor"
    val f2 = "%.3f %.3f c"
    val f3 = "\n0 setgray %g %g s"
    for (i in 0 until nCluster) {
        val c = cent[i]
        println(f1.format(colors[3 * i], colors[3 * i + 1], colors[3 * i + 2]))
        for (j in 0 until len) {
            val p = pts[j]
            if (p.group != i) continue
            println(f2.format((p.x - cx) * scale + W / 2, (p.y - cy) * scale + H / 2))
        }
        println(f3.format((c.x - cx) * scale + W / 2, (c.y - cy) * scale + H / 2))
    }
    print("\n%%%%EOF")
}

fun main(args: Array<String>) {
    val v = genXY(PTS, 10.0)
    val c = lloyd(v, PTS, K)
    printEps(v, PTS, c, K)
}

You may also check:How to resolve the algorithm Discordian date step by step in the F# programming language
You may also check:How to resolve the algorithm Base64 decode data step by step in the SenseTalk programming language
You may also check:How to resolve the algorithm Search a list step by step in the Smalltalk programming language
You may also check:How to resolve the algorithm String length step by step in the Modula-3 programming language
You may also check:How to resolve the algorithm Middle three digits step by step in the Oforth programming language

How to resolve the algorithm K-means++ clustering step by step in the Kotlin programming language

Table of Contents

Problem Statement

Step by Step solution about How to resolve the algorithm K-means++ clustering step by step in the Kotlin programming language

Source code in the kotlin programming language