neoml.Clustering¶
The neoml module provides several methods for clustering data.
Kmeans¶
Kmeans method is the most popular clustering algorithm.
On each step, the mass center for each cluster is calculated, and then the vectors are reassigned to clusters with the nearest center. The algorithm stops on the step when the incluster distance does not change.
Class description¶

class
neoml.Clustering.
KMeans
(max_iteration_count, cluster_count, algo='lloyd', init='default', distance='euclid', thread_count=1, run_count=1, seed=3306)¶ Kmeans clustering.
 Parameters
max_iteration_count (int) – the maximum number of algorithm iterations.
cluster_count (int) – the number of clusters.
algo (str, {'elkan', 'lloyd'}, default='lloyd') – the algorithm used during clustering.
init (str, {'k++', 'default'}, default='default') – the algorithm used for selecting initial centers.
distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function.
thread_count (int, > 0, default=1) – number of threads
run_count (int, > 0, default=1) – number of runs, the result is the best of the runs (based on inertia)
seed (int, default=3306) – the initial seed for random

clusterize
(X, weight=None)¶ Performs clustering of the given data.
 Parameters
X (arraylike or sparse matrix of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to
dtype=np.float32
, and if a sparse matrix is provided  to a sparsescipy.csr_matrix
.weight (arraylike of shape (n_samples,)) – sample weights. If None, then samples are equally weighted. None by default.
 Returns
clusters  array of integers with cluster indices for each object of X;
centers  cluster centers;
vars  cluster variances.
 Return type
tuple(clusters, centers, vars)
clusters  numpy.ndarray(numpy.int32) of shape (n_samples,)
centers  numpy.ndarray(numpy.float32) of shape (cluster_count, n_features)
vars  numpy.ndarray(numpy.float32) of shape (cluster_count, n_features)
Example¶
import numpy as np
import neoml
data = np.rand(1000, 5)
kmeans = neoml.Clustering.KMeans(cluster_count=4, init='k++', algo='elkan')
labels, centers, disps = kmeans.clusterize(data)
ISODATA¶
ISODATA clustering algorithm is based on geometrical proximity of the data points. The clustering result will depend greatly on the initial settings.
See Ball, Geoffrey H., Hall, David J. Isodata: a method of data analysis and pattern classification. (1965)
Class description¶

class
neoml.Clustering.
IsoData
(init_cluster_count, max_cluster_count, min_cluster_size, max_iteration_count, min_cluster_distance, max_cluster_diameter, mean_diameter_coef)¶ IsoData clustering. A heuristic algorithm based on geometrical proximity of the data points.
 Parameters
init_cluster_count (int) – the number of initial clusters. The initial cluster centers are randomly selected from the input data.
max_cluster_count (int) – the maximum number of clusters.
min_cluster_size (int) – the minimum cluster size.
max_iteration_count (int) – the maximum number of algorithm iterations.
min_cluster_distance (float) – the minimum distance between the clusters. Whenever two clusters are closer they are merged.
max_cluster_diameter (float) – the maximum cluster diameter. Whenever a cluster is larger it may be split.
mean_diameter_coef (float) – indicates how much the cluster diameter may exceed the mean diameter across all the clusters. If a cluster diameter is larger than the mean diameter multiplied by this value it may be split.

clusterize
(X, weight=None)¶ Performs clustering of the given data.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to
dtype=np.float32
, and if a sparse matrix is provided  to a sparsescipy.csr_matrix
.weight (arraylike of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.
 Returns
clusters  cluster indices for each object of X;
centers  cluster centers;
vars  cluster variances.
 Return type
tuple(clusters, centers, vars)
clusters  numpy.ndarray(numpy.int32) of shape (n_samples,)
centers  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
vars  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
Example¶
import numpy as np
import neoml
data = np.rand(1000, 5)
isodata = neoml.Clustering.IsoData(init_cluster_count=2, max_cluster_count=10,
max_iteration_count=100, min_cluster_distance=1.,
max_cluster_diameter=10., mean_diameter_coef=1.)
labels, centers, disps = isodata.clusterize(data)
Hierarchical clustering¶
The library provides a “naive” implemetation of upward hierarchical clustering. The initial state has a cluster for every element. On each step, the two closest clusters are merged. Once the target number of clusters is reached, or all clusters are too far from each other to be merged, the process ends.
Class description¶

class
neoml.Clustering.
Hierarchical
(max_cluster_distance, min_cluster_count, distance='euclid', linkage='centroid')¶ Hierarchical clustering. First, it creates a cluster per element, the merges closest clusters on each step until the final cluster is achieved.
 Parameters
max_cluster_distance (float) – the maximum distance between two clusters that still may be merged.
min_cluster_count (int) – the minimum number of clusters in the result.
distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function.
linkage (str, {'centroid', 'single', 'average', 'complete', 'ward'}, default='centroid') – the approach used for distance calculation between clusters

clusterize
(X, weight=None)¶ Performs clustering of the given data.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to
dtype=np.float32
, and if a sparse matrix is provided  to a sparsescipy.csr_matrix
.weight (arraylike of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.
 Returns
clusters  cluster indices for each object of X;
centers  cluster centers;
vars  cluster variances.
 Return type
tuple(clusters, centers, vars)
clusters  numpy.ndarray(numpy.int32) of shape (n_samples,)
centers  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
vars  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
Example¶
import numpy as np
import neoml
data = np.rand(1000, 5)
hierarchical = neoml.Clustering.Hierarchical(max_cluster_distance=2., min_cluster_count=2,
distance='euclid')
labels, centers, disps = hierarchical.clusterize(data)
First come clustering¶
A simple clustering algorithm that works with only one run through the data set. Each new vector is added to the nearest cluster, or if all the clusters are too far, a new cluster will be created for this vector. At the end, the clusters that are too small are destroyed and their vectors redistributed.
Class description¶

class
neoml.Clustering.
FirstCome
(min_vector_count=4, default_variance=1.0, threshold=0.0, min_cluster_size_ratio=0.05, max_cluster_count=100, distance='euclid')¶ First come clustering creates a new cluster for each new vector that is far enough from the clusters already existing.
 Parameters
min_vector_count (int, > 0, default=4) – the smallest number of vectors in a cluster to consider that the variance is valid.
default_variance (float, default=1.0) – the default variance (for when the number of vectors is smaller than min_vector_count).
threshold (float, default=0.0) – the distance threshold for creating a new cluster.
min_cluster_size_ratio (float, default=0.05) – the minimum ratio of the number elements in a cluster to the total number of vectors.
max_cluster_count (int, default=100) – the maximum number of clusters to prevent algorithm divergence in case of great differences in data.
distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function to measure cluster size.

clusterize
(X, weight=None)¶ Performs clustering of the given data.
 Parameters
X ({arraylike, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to
dtype=np.float32
, and if a sparse matrix is provided  to a sparsescipy.csr_matrix
.weight (arraylike of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.
 Returns
clusters  cluster indices for each object of X;
centers  cluster centers;
vars  cluster variances.
 Return type
tuple(clusters, centers, vars)
clusters  numpy.ndarray(numpy.int32) of shape (n_samples,)
centers  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
vars  numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)
Example¶
import numpy as np
import neoml
data = np.rand(1000, 5)
first_come = neoml.Clustering.FirstCome(min_vector_count=5, default_variance=2.,
threshold=0., min_cluster_size_ratio=0.1,
max_cluster_count=25, distance='euclid')
labels, centers, disps = first_come.clusterize(data)