K-means Clustering Algorithm: Ordering Objects to Select Centroids based on Textual Criticism Process
Existing methods of selecting initial centroids in k-means clustering are mostly random, and based on distance functions.
In this study, a new method is proposed for choosing the initial centroids, thereby enhancing the generic k-means clustering algorithm. The method is a computational technique for automatically identifying a base text among several textual documents (in textual criticism). It consists of selecting a base text successively and collating it with the remaining documents also known as witnesses. Each completed collation cycle stores the selected base text and its closest witness, along with a weighted score of their similarities and dissimilarities. Witnesses’ scores are weighted using a weighting system, based on the effects of types of textual alterations on the process of reconstructing a given original document.
At the end of the collation process, a list of arranged documents by degree of proximity is generated. Depending on the number of desired clusters, some documents at equal intervals could be selected as initial centroids.
The proposed method was implemented in Java, and tested on clustering the Quranic chapters. The generation of the vector space representation of the chapters was made in PHP, and the clustering phase was achieved in R language.
The result obtained in this study is encouraging, and shows that the proposed method is superior to methods based on random selection of initial centroids.