neoml.Clustering

The neoml module provides several methods for clustering data.

K-means

K-means method is the most popular clustering algorithm.

On each step, the mass center for each cluster is calculated, and then the vectors are reassigned to clusters with the nearest center. The algorithm stops on the step when the in-cluster distance does not change.

Class description

class neoml.Clustering.KMeans(max_iteration_count, cluster_count, algo='lloyd', init='default', distance='euclid', thread_count=1, run_count=1, seed=3306)

K-means clustering.

Parameters
  • max_iteration_count (int) – the maximum number of algorithm iterations.

  • cluster_count (int) – the number of clusters.

  • algo (str, {'elkan', 'lloyd'}, default='lloyd') – the algorithm used during clustering.

  • init (str, {'k++', 'default'}, default='default') – the algorithm used for selecting initial centers.

  • distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function.

  • thread_count (int, > 0, default=1) – number of threads

  • run_count (int, > 0, default=1) – number of runs, the result is the best of the runs (based on inertia)

  • seed (int, default=3306) – the initial seed for random

clusterize(X, weight=None)

Performs clustering of the given data.

Parameters
  • X (array-like or sparse matrix of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse scipy.csr_matrix.

  • weight (array-like of shape (n_samples,)) – sample weights. If None, then samples are equally weighted. None by default.

Returns

  • clusters - array of integers with cluster indices for each object of X;

  • centers - cluster centers;

  • vars - cluster variances.

Return type

  • tuple(clusters, centers, vars)

  • clusters - numpy.ndarray(numpy.int32) of shape (n_samples,)

  • centers - numpy.ndarray(numpy.float32) of shape (cluster_count, n_features)

  • vars - numpy.ndarray(numpy.float32) of shape (cluster_count, n_features)

Example

import numpy as np
import neoml

data = np.rand(1000, 5)
kmeans = neoml.Clustering.KMeans(cluster_count=4, init='k++', algo='elkan')
labels, centers, disps = kmeans.clusterize(data)

ISODATA

ISODATA clustering algorithm is based on geometrical proximity of the data points. The clustering result will depend greatly on the initial settings.

See Ball, Geoffrey H., Hall, David J. Isodata: a method of data analysis and pattern classification. (1965)

Class description

class neoml.Clustering.IsoData(init_cluster_count, max_cluster_count, min_cluster_size, max_iteration_count, min_cluster_distance, max_cluster_diameter, mean_diameter_coef)

IsoData clustering. A heuristic algorithm based on geometrical proximity of the data points.

Parameters
  • init_cluster_count (int) – the number of initial clusters. The initial cluster centers are randomly selected from the input data.

  • max_cluster_count (int) – the maximum number of clusters.

  • min_cluster_size (int) – the minimum cluster size.

  • max_iteration_count (int) – the maximum number of algorithm iterations.

  • min_cluster_distance (float) – the minimum distance between the clusters. Whenever two clusters are closer they are merged.

  • max_cluster_diameter (float) – the maximum cluster diameter. Whenever a cluster is larger it may be split.

  • mean_diameter_coef (float) – indicates how much the cluster diameter may exceed the mean diameter across all the clusters. If a cluster diameter is larger than the mean diameter multiplied by this value it may be split.

clusterize(X, weight=None)

Performs clustering of the given data.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse scipy.csr_matrix.

  • weight (array-like of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.

Returns

  • clusters - cluster indices for each object of X;

  • centers - cluster centers;

  • vars - cluster variances.

Return type

  • tuple(clusters, centers, vars)

  • clusters - numpy.ndarray(numpy.int32) of shape (n_samples,)

  • centers - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

  • vars - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

Example

import numpy as np
import neoml

data = np.rand(1000, 5)
isodata = neoml.Clustering.IsoData(init_cluster_count=2, max_cluster_count=10,
                                   max_iteration_count=100, min_cluster_distance=1.,
                                   max_cluster_diameter=10., mean_diameter_coef=1.)
labels, centers, disps = isodata.clusterize(data)

Hierarchical clustering

The library provides a “naive” implemetation of upward hierarchical clustering. The initial state has a cluster for every element. On each step, the two closest clusters are merged. Once the target number of clusters is reached, or all clusters are too far from each other to be merged, the process ends.

Class description

class neoml.Clustering.Hierarchical(max_cluster_distance, min_cluster_count, distance='euclid', linkage='centroid')

Hierarchical clustering. First, it creates a cluster per element, the merges closest clusters on each step until the final cluster is achieved.

Parameters
  • max_cluster_distance (float) – the maximum distance between two clusters that still may be merged.

  • min_cluster_count (int) – the minimum number of clusters in the result.

  • distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function.

  • linkage (str, {'centroid', 'single', 'average', 'complete', 'ward'}, default='centroid') – the approach used for distance calculation between clusters

clusterize(X, weight=None)

Performs clustering of the given data.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse scipy.csr_matrix.

  • weight (array-like of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.

Returns

  • clusters - cluster indices for each object of X;

  • centers - cluster centers;

  • vars - cluster variances.

Return type

  • tuple(clusters, centers, vars)

  • clusters - numpy.ndarray(numpy.int32) of shape (n_samples,)

  • centers - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

  • vars - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

Example

import numpy as np
import neoml

data = np.rand(1000, 5)
hierarchical = neoml.Clustering.Hierarchical(max_cluster_distance=2., min_cluster_count=2,
                                             distance='euclid')
labels, centers, disps = hierarchical.clusterize(data)

First come clustering

A simple clustering algorithm that works with only one run through the data set. Each new vector is added to the nearest cluster, or if all the clusters are too far, a new cluster will be created for this vector. At the end, the clusters that are too small are destroyed and their vectors redistributed.

Class description

class neoml.Clustering.FirstCome(min_vector_count=4, default_variance=1.0, threshold=0.0, min_cluster_size_ratio=0.05, max_cluster_count=100, distance='euclid')

First come clustering creates a new cluster for each new vector that is far enough from the clusters already existing.

Parameters
  • min_vector_count (int, > 0, default=4) – the smallest number of vectors in a cluster to consider that the variance is valid.

  • default_variance (float, default=1.0) – the default variance (for when the number of vectors is smaller than min_vector_count).

  • threshold (float, default=0.0) – the distance threshold for creating a new cluster.

  • min_cluster_size_ratio (float, default=0.05) – the minimum ratio of the number elements in a cluster to the total number of vectors.

  • max_cluster_count (int, default=100) – the maximum number of clusters to prevent algorithm divergence in case of great differences in data.

  • distance (str, {'euclid', 'machalanobis', 'cosine'}, default='euclid') – the distance function to measure cluster size.

clusterize(X, weight=None)

Performs clustering of the given data.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse scipy.csr_matrix.

  • weight (array-like of shape (n_samples,) or None, default=None) – sample weights. If None, then samples are equally weighted.

Returns

  • clusters - cluster indices for each object of X;

  • centers - cluster centers;

  • vars - cluster variances.

Return type

  • tuple(clusters, centers, vars)

  • clusters - numpy.ndarray(numpy.int32) of shape (n_samples,)

  • centers - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

  • vars - numpy.ndarray(numpy.float32) of shape (init_cluster_count, n_features)

Example

import numpy as np
import neoml

data = np.rand(1000, 5)
first_come = neoml.Clustering.FirstCome(min_vector_count=5, default_variance=2.,
                                        threshold=0., min_cluster_size_ratio=0.1,
                                        max_cluster_count=25, distance='euclid')
labels, centers, disps = first_come.clusterize(data)