Copyright © 2017-2021 ABBYY Production LLC

[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Linear classifier

Download the tutorial as a Jupyter notebook

In this tutorial, we’ll use a NeoML linear classifier to process the 20newsgroups dataset. We’ll look for the best parameter configuration by trying out every combination over a fixed parameter grid. NeoML also provides a cross-validation function, which we will use to evaluate each configuration’s performance.

The tutorial includes the following steps:

Download the dataset

Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.

The 20newsgroups vectorized dataset can be downloaded from scikit-learn, ready divided into training and testing subsets.

[2]:
from sklearn.datasets import fetch_20newsgroups_vectorized

train_data = fetch_20newsgroups_vectorized(subset='train')
test_data = fetch_20newsgroups_vectorized(subset='test')

Look for optimal parameters

We’ll take a brute-force approach and just check all possible combinations of parameters over a fixed parameter grid.

For each combination, we’ll use the neoml.CrossValidation.cross_validation_score method to evaluate the classifier performance on the training set.

Once the optimal parameter combination is found, we’ll train the classifier with these parameters on the whole training set.

[3]:
import neoml
import itertools

def grid_search(init_classifier, X, y, param_grid, n_folds=5):
    """Searches for the most optimal parameters in the grid
    Returns trained model and optimal parameters
    """
    best_params = {}

    if param_grid:  # avoid the corner case when param_grid is empty
        param_names, param_values_lists = zip(*param_grid.items())
        best_acc = -1.
        for param_values in itertools.product(*param_values_lists):
            params = dict(zip(param_names, param_values))
            classifier = init_classifier(**params)
            acc = neoml.CrossValidation.cross_validation_score(classifier, X, y, parts=n_folds).mean()
            if acc > best_acc:
                best_acc = acc
                best_params = params

    # Train the classifier on the whole training set with the best params
    # and return the trained model
    best_classifier = init_classifier(**best_params)
    return best_classifier.train(X, y), best_params
[4]:
%%time

param_grid = {
    'loss': ['binomial', 'squared_hinge', 'smoothed_hinge'],
    'l1_reg': [0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
    'thread_count': [4]  # this is only for training, the cross-validation itself is single-threaded
}

# It will take some time...
# IMPORTANT: we're using only the training subset here
model, params = grid_search(neoml.Linear.LinearClassifier, train_data.data,
                            train_data.target, param_grid)
Wall time: 7min 1s

Let’s see which parameter set gave the best accuracy.

[5]:
print('Best params: ', params)
Best params:  {'loss': 'smoothed_hinge', 'l1_reg': 1e-06, 'thread_count': 4}

Evaluate the best model

Now we can run the trained model on the test subset.

[6]:
probs = model.classify(test_data.data)

print(type(probs))
print(probs.shape)
print(probs.dtype)
<class 'numpy.ndarray'>
(7532, 20)
float64

As you can see, for each object the model returns a probability distribution over classes.

Let’s also calculate the accuracy of the model on the test subset.

[7]:
import numpy as np

y_pred = np.argmax(probs, axis=1)
correct = sum(1 for true_class, pred_class in zip(test_data.target, y_pred)
              if true_class == pred_class)
print(f'Test accuracy: {float(correct)/len(y_pred):.4f}')
Test accuracy: 0.8236