neoml.ClassificationRegression

In neoml module, you can find various methods for solving classification and regression problems.

Each of these algorithms accepts the training data and input data in sparse and dense formats.

Gradient tree boosting

Gradient boosting method creates an ensemble of decision trees using random subsets of features and input data. The algorithm only accepts continuous features. If your data is characterized by discrete features you will need to transform them into continuous ones (for example, using binarization).

Classification

class neoml.GradientBoost.GradientBoostClassifier(*args: Any, **kwargs: Any)

Gradient boosting for classification. Gradient boosting method creates an ensemble of decision trees using random subsets of features and input data.

Parameters:
  • loss (str, {'exponential', 'binomial', 'squared_hinge', 'l2'}, default='binomial') – the loss function to be optimized. ‘binomial’ refers to deviance (= logistic regression) for classification with probabilistic outputs. ‘exponential’ is similar to the AdaBoost algorithm.

  • iteration_count (int, default=100) – the maximum number of iterations (that is, the number of trees in the ensemble).

  • learning_rate (float, default=0.1) – the multiplier for each classifier tree. There is a trade-off between learning_rate and iteration_count.

  • subsample (float, [0..1], default=1.0) – the fraction of input data that is used for building one tree.

  • subfeature (float, [0..1], default=1.0) – the fraction of features that is used for building one tree.

  • random_seed (int, default=0) – the random generator seed number.

  • max_depth (int, default=10) – the maximum depth of a tree in ensemble.

  • max_node_count (int, default=-1) – the maximum number of nodes in a tree. -1 means no limitation.

  • l1_reg (float, default=0.0) – the L1 regularization factor.

  • l2_reg (float, default=1.0) – the L2 regularization factor.

  • prune (float, default=0.0) – the value of criterion difference when the nodes should be merged. The 0 default value means never merge nodes.

  • thread_count (int, default=1) – the number of processing threads to be used while training the model.

  • builder_type (str, {'full', 'hist', 'multi_full'}, default='full') – the type of tree builder used. full means all feature values are used for splitting nodes. hist means the steps of a histogram created from feature values multi_full means ‘full’ with multiclass trees will be used for splitting nodes.

  • max_bins (int, default=32) – the largest possible histogram size to be used in hist mode.

  • min_subtree_weight (float, default=0.0) – the minimum subtree weight. The 0 default value means no lower limit.

train(X, Y, weight=None)

Trains the gradient boosting model for classification.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct class labels (int) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained classification model.

Return type:

neoml.GradientBoost.GradientBoostClassificationModel

class neoml.GradientBoost.GradientBoostClassificationModel(value)

Gradient boosting classification model.

classify(X)

Gets the classification results for the input sample.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input vectors, put into a matrix. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

Returns:

the predictions of class probability for each input vector.

Return type:

generator of ndarray of shape (n_samples, n_classes)

store(path)

Saves the model at the given location.

Parameters:

path (str) – the full path to where the model should be saved.

Regression

class neoml.GradientBoost.GradientBoostRegressor(*args: Any, **kwargs: Any)

Gradient boosting for regression. Gradient boosting method creates an ensemble of decision trees using random subsets of features and input data.

Parameters:
  • loss (str, {'l2'}, default='l2') – the loss function to be optimized. The quadratic loss L2 is the only one supported.

  • iteration_count (int, default=100) – the maximum number of iterations (that is, the number of trees in the ensemble).

  • learning_rate (float, default=0.1) – the multiplier for each tree. There is a trade-off between learning_rate and iteration_count.

  • subsample (float, [0..1], default=1.0) – the fraction of input data that is used for building one tree.

  • subfeature (float, [0..1], default=1.0) – the fraction of features that is used for building one tree.

  • random_seed (int, default=0) – the random generator seed number.

  • max_depth (int, default=10) – the maximum depth of a tree in ensemble.

  • max_node_count (int, default=-1) – the maximum number of nodes in a tree. -1 means no limitation.

  • l1_reg (float, default=0.0) – the L1 regularization factor.

  • l2_reg (float, default=1.0) – the L2 regularization factor.

  • prune (float, default=0.0) – the value of criterion difference when the nodes should be merged. The 0 default value means never merge nodes.

  • thread_count (int, default=1) – the number of processing threads to be used while training the model.

  • builder_type (str, {'full', 'hist'}, default='full') – the type of tree builder used. full means all feature values are used for splitting nodes. hist means the steps of a histogram created from feature values will be used for splitting nodes.

  • max_bins (int, default=32) – the largest possible histogram size to be used in hist mode.

  • min_subtree_weight (float, default=0.0) – the minimum subtree weight. The 0 default value means no lower limit.

train(X, Y, weight=None)

Trains the gradient boosting model for regression.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct function values (float) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained regression model.

Return type:

neoml.GradientBoost.GradientBoostRegressionModel

class neoml.GradientBoost.GradientBoostRegressionModel(value)

Gradient boosting regression model.

predict(X)

Predicts the value of the function.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input vectors. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

Returns:

the predictions of the function value on each input vector.

Return type:

generator of ndarray of shape (n_samples)

store(path)

Saves the model at the given location.

Parameters:

path (str) – the full path to where the model should be saved.

Linear

A linear classifier finds a hyperplane that divides the feature space in half.

Classification

class neoml.Linear.LinearClassifier(*args: Any, **kwargs: Any)

Linear binary classifier.

Parameters:
  • loss (str, {'binomial', 'squared_hinge', 'smoothed_hinge'}, default='binomial') – the loss function to be optimized. binomial refers to deviance (= logistic regression) for classification with probabilistic outputs.

  • max_iteration_count (int, default=1000) – the maximum number of iterations.

  • error_weight (float, default=1.0) – the error weight relative to the regularization coefficient.

  • sigmoid (array of 2 float, default=(0.0, 0.0)) – the predefined sigmoid function coefficients.

  • tolerance (float, default=-1.0) – the stop criterion. -1 means calculate stop criterion automatically, from the amount of vectors in each class in the training sample.

  • normalizeError (bool, default=False) – specifies if the error should be normalized.

  • l1_reg (float, default=0.0) – the L1 regularization coefficient. If 0, L2 regularization will be used instead.

  • thread_count (int, default=1) – the number of threads to be used while training the model.

  • multiclass_mode (str, ['one_vs_all', 'one_vs_one'], default='one_vs_all') – determines how to handle multi-class classification

train(X, Y, weight=None)

Trains the linear classification model.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct class labels (int) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained classification model.

Return type:

neoml.Linear.LinearClassificationModel

class neoml.Linear.LinearClassificationModel(internal)

Linear binary classification model.

classify(X)

Gets the classification results for the input sample.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse csr_matrix.

Returns:

predictions of the input samples.

Return type:

generator of ndarray of shape (n_samples, n_classes)

Regression

class neoml.Linear.LinearRegressor(*args: Any, **kwargs: Any)

Linear regressor.

Parameters:
  • loss (str, {'l2'}, default='l2') – the loss function to be optimized. The quadratic loss L2 is the only one supported.

  • max_iteration_count (int, default=1000) – the maximum number of iterations.

  • error_weight (float, default=1.0) – the error weight relative to the regularization coefficient.

  • sigmoid (array of 2 float, default=(0.0, 0.0)) – the predefined sigmoid function coefficients.

  • tolerance (float, default=-1.0) – the stop criterion. -1 means calculate stop criterion automatically, from the amount of vectors in each class in the training sample.

  • normalizeError (bool, default=False) – specifies if the error should be normalized.

  • l1_reg (float, default=0.0) – the L1 regularization coefficient. If 0, L2 regularization will be used instead.

  • thread_count (int, default=1) – the number of threads to be used while training the model.

train(X, Y, weight=None)

Trains the linear regression model.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct function values (float) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained regression model.

Return type:

neoml.Linear.LinearRegressionModel

class neoml.Linear.LinearRegressionModel(internal)

Linear regression model.

predict(X)

Predicts the value of the function.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input vectors. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

Returns:

the predictions of the function value on each input vector.

Return type:

generator of ndarray of shape (n_samples)

Support-vector machine

Support-vector machine (SVM) translates the input data into vectors in a high-dimensional space and searches for a maximum-margin dividing hyperplane.

class neoml.SVM.SvmClassifier(*args: Any, **kwargs: Any)

Support-vector machine (SVM) classifier.

Parameters:
  • kernel (str, {'linear', 'poly', 'rbf', 'sigmoid'}, default='linear') – the kernel function to be used.

  • max_iteration_count (int, default=1000) – the maximum number of iterations.

  • error_weight (float, default=1.0) – the error weight relative to the regularization function.

  • degree (int, default=1) – the degree for the gaussian kernel.

  • gamma (float, default=1.0) – the kernel coefficient for poly, rbf, sigmoid.

  • coeff0 (float, default=1.0) – the kernel free term for poly, sigmoid.

  • tolerance (float, default=0.1) – the algorithm precision.

  • thread_count (int, default=1) – The number of processing threads to be used while training the model.

  • multiclass_mode (str, ['one_vs_all', 'one_vs_one'], default='one_vs_all') – determines how to handle multi-class classification

train(X, Y, weight=None)

Trains the SVM classification model.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct class labels (int) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained classification model.

Return type:

neoml.SVM.SvmClassificationModel

class neoml.SVM.SvmClassificationModel(internal)

Support-vector machine (SVM) classification model.

classify(X)

Gets the classification results for the input sample.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input sample. Internally, it will be converted to dtype=np.float32, and if a sparse matrix is provided - to a sparse csr_matrix.

Returns:

predictions of the input samples.

Return type:

generator of ndarray of shape (n_samples, n_classes)

Decision tree

Decision tree is a classification method that involves comparing the object features with a set of threshold values; the result tells us to move to one of the children nodes. Once we reach a leaf node we assign the object to the class this node represents.

class neoml.DecisionTree.DecisionTreeClassifier(*args: Any, **kwargs: Any)

Decision tree classifier.

Parameters:
  • criterion (str, {'gini', 'information_gain'}, default='gini') – the type of criterion to be used for subtree splitting.

  • min_subset_size (int, default=1) – the minimum number of vectors corresponding to a node subtree.

  • min_subset_part (float, [0..1], default=0.0) – the minimum weight of the vectors in a subtree relative to the parent node weight.

  • min_split_size (int, default=1) – the minimum number of vectors in a node subtree when it may be divided further.

  • max_tree_depth (int, default=32) – the maximum depth of the tree.

  • max_node_count (int, default=4096) – the maximum number of nodes in the tree.

  • const_threshold (float, [0..1], default=0.99) – if the ratio of same class elements in the subset is greater than this value, a constant node will be created.

  • random_selected_feature_count (int, default=-1) – no more than this number of randomly selected features will be used for each node. -1 means use all features every time.

  • available_memory (int, default=1024*1024*1024) – the memory limit for the algorithm (default is 1 Gigabyte)

  • multiclass_mode (str, ['single_tree', 'one_vs_all', 'one_vs_one'], default='single_tree') – determines how to handle multi-class classification

train(X, Y, weight=None)

Trains the decision tree.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the training sample. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct class labels (int) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then samples are equally weighted.

Returns:

the trained classification model.

Return type:

neoml.DecisionTree.DecisionTreeClassificationModel

class neoml.DecisionTree.DecisionTreeClassificationModel(internal)

Decision tree classification model.

classify(X)

Gets the classification results for the input sample.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input vectors, put into a matrix. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

Returns:

the predictions of class probability for each input vector.

Return type:

generator of ndarray of shape (n_samples, n_classes)

Cross-validation

This method performs cross-validation for any of these classifiers.

CrossValidation.cross_validation_score(X, Y, weight=None, score='accuracy', parts=5, stratified=False)

Performs cross-validation of the given classifier on a set of data. The input sample is divided into the specified number of parts, then each of them in turn serves as the testing set while all the others are taken for the training set. Can calculate either accuracy or F-measure.

Parameters:
  • classifier (object) – the classifier to be tested.

  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – the input vectors, put into a matrix. The values will be converted to dtype=np.float32. If a sparse matrix is passed in, it will be converted to a sparse csr_matrix.

  • Y (array-like of shape (n_samples,)) – correct function values (float) for the training set vectors.

  • weight (array-like of shape (n_samples,), default=None) – sample weights. If None, then all vectors are equally weighted.

  • score (str, {'accuracy', 'f1'}, default='accuracy') – the metric that should be calculated.

  • parts (int, default=5) – the number of parts into which the input sample should be divided.

  • stratified (bool, default=False) – specifies if the input set should be divided so that the ratio of classes in each part is (almost) the same as in the input data.

Returns:

the calculated metrics.

Return type:

array-like of shape (parts,)