Copyright © 2017-2021 ABBYY Production LLC
[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Gradient tree boosting classifier
Download the tutorial as a Jupyter notebook
In this tutorial, we’ll use a NeoML gradient boosting classifier to process the 20newsgroups dataset. We’ll compare different modes for building the decision trees, looking at the time it takes to train each one and the accuracy of its performance on the testing set.
The tutorial includes the following steps:
Download the dataset
Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.
The 20newsgroups vectorized dataset can be downloaded from scikit-learn, ready divided into training and testing subsets.
[2]:
from sklearn.datasets import fetch_20newsgroups_vectorized
train_data = fetch_20newsgroups_vectorized(subset='train')
test_data = fetch_20newsgroups_vectorized(subset='test')
Compare different builder types
To compare different boosting builder types, we need first to:
Prepare an error function
Set up the other boosting parameters, the same for each builder
[3]:
import numpy as np
def accuracy(model, X, y):
"""Returns the accuracy of model on the given data"""
correct = sum(1 for label, probs in zip(y, model.classify(X))
if label == np.argmax(probs))
return float(correct)/len(y)
# These arguments will be used for every builder_type
shared_kwargs = {
'loss' : 'binomial',
'iteration_count' : 100,
'learning_rate' : 0.1,
'subsample' : 1.,
'subfeature' : 0.25,
'random_seed' : 1234,
'max_depth' : 6,
'max_node_count' : -1,
'l1_reg' : 0.,
'l2_reg' : 1.,
'prune' : 0.,
'thread_count' : 1,
}
Now we’ll compare training speed and accuracy of different decision tree builders.
NeoML has several builder types for gradient boosting:
full
- classic algorithm. If the dataset has multiple classes it uses one-versus-all approach.hist
uses histograms of feature values when deciding to split nodes.multi_full
- classic with one modification: for multiple classes it has multiple values in leaf nodes of one tree ensemble, instead of multiple one-versus-all ensembles.
[4]:
import time
import neoml
# Train and test gradient boosting for every builder type
for builder in ['full', 'hist', 'multi_full']:
start = time.time()
boost_kwargs = { **shared_kwargs, 'builder_type' : builder}
classifier = neoml.GradientBoost.GradientBoostClassifier(**boost_kwargs)
model = classifier.train(train_data.data, train_data.target)
run_time = time.time() - start
acc = accuracy(model, test_data.data, test_data.target)
print(f'{builder} Accuracy: {acc:.4f} Time: {run_time:.2f} sec.')
full Accuracy: 0.7868 Time: 111.54 sec.
hist Accuracy: 0.7926 Time: 198.95 sec.
multi_full Accuracy: 0.6609 Time: 209.08 sec.