Copyright © 2017-2021 ABBYY Production LLC
[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Linear classifier
Download the tutorial as a Jupyter notebook
In this tutorial, we’ll use a NeoML linear classifier to process the 20newsgroups dataset. We’ll look for the best parameter configuration by trying out every combination over a fixed parameter grid. NeoML also provides a cross-validation function, which we will use to evaluate each configuration’s performance.
The tutorial includes the following steps:
Download the dataset
Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.
The 20newsgroups vectorized dataset can be downloaded from scikit-learn, ready divided into training and testing subsets.
[2]:
from sklearn.datasets import fetch_20newsgroups_vectorized
train_data = fetch_20newsgroups_vectorized(subset='train')
test_data = fetch_20newsgroups_vectorized(subset='test')
Look for optimal parameters
We’ll take a brute-force approach and just check all possible combinations of parameters over a fixed parameter grid.
For each combination, we’ll use the neoml.CrossValidation.cross_validation_score
method to evaluate the classifier performance on the training set.
Once the optimal parameter combination is found, we’ll train the classifier with these parameters on the whole training set.
[3]:
import neoml
import itertools
def grid_search(init_classifier, X, y, param_grid, n_folds=5):
"""Searches for the most optimal parameters in the grid
Returns trained model and optimal parameters
"""
best_params = {}
if param_grid: # avoid the corner case when param_grid is empty
param_names, param_values_lists = zip(*param_grid.items())
best_acc = -1.
for param_values in itertools.product(*param_values_lists):
params = dict(zip(param_names, param_values))
classifier = init_classifier(**params)
acc = neoml.CrossValidation.cross_validation_score(classifier, X, y, parts=n_folds).mean()
if acc > best_acc:
best_acc = acc
best_params = params
# Train the classifier on the whole training set with the best params
# and return the trained model
best_classifier = init_classifier(**best_params)
return best_classifier.train(X, y), best_params
[4]:
%%time
param_grid = {
'loss': ['binomial', 'squared_hinge', 'smoothed_hinge'],
'l1_reg': [0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
'thread_count': [4] # this is only for training, the cross-validation itself is single-threaded
}
# It will take some time...
# IMPORTANT: we're using only the training subset here
model, params = grid_search(neoml.Linear.LinearClassifier, train_data.data,
train_data.target, param_grid)
Wall time: 7min 1s
Let’s see which parameter set gave the best accuracy.
[5]:
print('Best params: ', params)
Best params: {'loss': 'smoothed_hinge', 'l1_reg': 1e-06, 'thread_count': 4}
Evaluate the best model
Now we can run the trained model on the test subset.
[6]:
probs = model.classify(test_data.data)
print(type(probs))
print(probs.shape)
print(probs.dtype)
<class 'numpy.ndarray'>
(7532, 20)
float64
As you can see, for each object the model returns a probability distribution over classes.
Let’s also calculate the accuracy of the model on the test subset.
[7]:
import numpy as np
y_pred = np.argmax(probs, axis=1)
correct = sum(1 for true_class, pred_class in zip(test_data.target, y_pred)
if true_class == pred_class)
print(f'Test accuracy: {float(correct)/len(y_pred):.4f}')
Test accuracy: 0.8236