Copyright © 2017-2021 ABBYY Production LLC

```
[1]:
```

```
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

# Identity recurrent neural network (IRNN)¶

Download the tutorial as a Jupyter notebook

In this tutorial, we’ll demonstrate that an identity recurrent neural network (IRNN) can efficiently process long temporal sequences, reproducing one of the experiments described in the Identity RNN article.

The experiment tests the IRNN on the MNIST dataset, first transforming its 28 x 28 images into 784-pixel-long sequences. The article claims that IRNN can achieve 0.9+ accuracy in these conditions.

The tutorial includes the following steps:

## Download and prepare the dataset¶

We will download the MNIST dataset from scikit-learn.

```
[2]:
```

```
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
```

Now we need to normalize it and convert to 32-bit datatypes for NeoML.

```
[3]:
```

```
import numpy as np
# Normalize
X = (255 - X) * 2 / 255 - 1
# Fix data types
X = X.astype(np.float32)
y = y.astype(np.int32)
```

Finally, we’ll split the data into subsets used for training and for testing.

```
[4]:
```

```
# Split into train/test
train_size = 60000
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
del X, y
```

## Build the network¶

### Choose the device¶

We need to create a math engine that will perform all calculations and allocate data for the neural network. The math engine is tied to the processing device.

In this tutorial we’ll use a single-threaded CPU math engine.

```
[5]:
```

```
import neoml
math_engine = neoml.MathEngine.CpuMathEngine(1)
```

### Create the network and connect layers¶

Create a `neoml.Dnn.Dnn`

object that represents a neural network (a directed graph of layers). The network requires a math engine to perform its operations; it must be specified at creation and can’t be changed later.

```
[6]:
```

```
dnn = neoml.Dnn.Dnn(math_engine)
```

A `neoml.Dnn.Source`

layer feeds the data into the network.

```
[7]:
```

```
data = neoml.Dnn.Source(dnn, 'data') # source for data
```

Now we need to transpose this data into sequences of 784 pixels each. We can do that using the `neoml.Dnn.Transpose`

layer, which swaps 2 dimensions of the blob.

Original data will be wrapped into a 2-dimensional blob with `BatchWidth`

equal to batch size and `Channels`

equal to image size. (We’re creating blobs before training the network, see below.) This layer will transform it into sequences (`BatchLength`

) of image size, where each element of the sequence will be of size `1`

.

```
[8]:
```

```
transpose = neoml.Dnn.Transpose(data, first_dim='batch_length',
second_dim='channels', name='transpose')
```

We add the `neoml.Dnn.Irnn`

layer, connecting its input to the output of the transposition layer.

```
[9]:
```

```
hidden_size = 100
irnn = neoml.Dnn.Irnn(transpose, hidden_size, identity_scale=1.,
input_weight_std=1e-3, name='irnn')
```

But recurrent layers in NeoML usually return whole sequences. To reproduce the experiment, we only need the last element of each. The `neoml.Dnn.SubSequence`

layer will help us here.

```
[10]:
```

```
subseq = neoml.Dnn.SubSequence(irnn, start_pos=-1,
length=1, name='subseq')
```

Now we use a fully-connected layer to form logits (non-normalized distribution) over MNIST classes.

```
[11]:
```

```
n_classes = 10
fc = neoml.Dnn.FullyConnected(subseq, n_classes, name='fc')
```

To train the network, we also need to define a loss function to be optimized. In this tutorial we’ll be optimizing cross-entropy loss.

A loss function needs to compare the network output with the correct labels, so we’ll add another source layer to pass the correct labels in.

```
[12]:
```

```
labels = neoml.Dnn.Source(dnn, 'labels') # Source for labels
loss = neoml.Dnn.CrossEntropyLoss((fc, labels), name='loss')
```

NeoML also provides a `neoml.Dnn.Accuracy`

layer to calculate network accuracy. Let’s connect this layer and create an additional `neoml.Dnn.Sink`

layer for extracting its output.

```
[13]:
```

```
# Auxilary layers in order to get statistics
accuracy = neoml.Dnn.Accuracy((fc, labels), name='accuracy')
# accuracy layers writes its result to its output
# We need additional sink layer to extract it
accuracy_sink = neoml.Dnn.Sink(accuracy, name='accuracy_sink')
```

### Create a solver¶

Solver is an object that optimizes the weights using gradient values. It is necessary for training the network. In this sample we’ll use a `neoml.Dnn.AdaptiveGradient`

solver, which is the NeoML implementation of Adam.

```
[14]:
```

```
lr = 1e-6
# Create solver
dnn.solver = neoml.Dnn.AdaptiveGradient(math_engine, learning_rate=lr,
l1=0., l2=0., # no regularization
max_gradient_norm=1., # clip gradients
moment_decay_rate=0.9,
second_moment_decay_rate=0.999)
```

## Train the network and evaluate the results¶

NeoML networks accept data only as `neoml.Blob.Blob`

.

Blobs are 7-dimensional arrays located in device memory. Each dimension has a specific purpose:

`BatchLength`

- temporal axis (used in recurrent layers)`BatchWidth`

- classic batch`ListSize`

- list axis, used when objects are related to the same entity, but without ordering (unlike`BatchLength`

)`Height`

- height of the image`Width`

- width of the image`Depth`

- depth of the 3-dimensional image`Channels`

- channels of the image (also used when object is a 1-dimensional vector)

We will use `ndarray`

to split data into batches, then create blobs from these batches right before feeding them into the network.

```
[15]:
```

```
def irnn_data_iterator(X, y, batch_size, math_engine):
"""Slices numpy arrays into batches and wraps them in blobs"""
def make_blob(data, math_engine):
"""Wraps numpy data into neoml blob"""
shape = data.shape
if len(shape) == 2: # data
# Wrap 2-D array into blob of (BatchWidth, Channels) shape
return neoml.Blob.asblob(math_engine, data,
(1, shape[0], 1, 1, 1, 1, shape[1]))
elif len(shape) == 1: # dense labels
# Wrap 1-D array into blob of (BatchWidth,) shape
return neoml.Blob.asblob(math_engine, data,
(1, shape[0], 1, 1, 1, 1, 1))
else:
assert(False)
start = 0
data_size = y.shape[0]
while start < data_size:
yield (make_blob(X[start : start+batch_size], math_engine),
make_blob(y[start : start+batch_size], math_engine))
start += batch_size
```

To train the network, call `dnn.learn`

with data as its argument.

To run the network without training, call `dnn.run`

with data as its argument.

The input data is a `dict`

where each key is a `neoml.Dnn.Source`

layer name and the corresponding value is the `neoml.Blob.Blob`

that should be passed in to this layer.

```
[16]:
```

```
def run_net(X, y, batch_size, dnn, is_train):
"""Runs dnn on given data"""
start = time.time()
total_loss = 0.
run_iter = dnn.learn if is_train else dnn.run
math_engine = dnn.math_engine
layers = dnn.layers
loss = layers['loss']
accuracy = layers['accuracy']
sink = layers['accuracy_sink']
accuracy.reset = True # Reset previous statistics
# Iterate over batches
for X_batch, y_batch in irnn_data_iterator(X, y, batch_size, math_engine):
# Run the network on the batch data
run_iter({'data': X_batch, 'labels': y_batch})
total_loss += loss.last_loss * y_batch.batch_width # Update epoch loss
accuracy.reset = False # Don't reset statistics within one epoch
avg_loss = total_loss / y.shape[0]
avg_acc = sink.get_blob().asarray()[0]
run_time = time.time() - start
return avg_loss, avg_acc, run_time
```

*Note*: It will take 3-4 hours to train. You may uncomment print statements to see the progress.

```
[17]:
```

```
%%time
import time
batch_size = 40
n_epoch = 200
for epoch in range(n_epoch):
# Train
train_loss, train_acc, run_time = run_net(X_train, y_train, batch_size,
dnn, is_train=True)
# print(f'Train #{epoch}\tLoss: {train_loss:.4f}\t'
# f'Accuracy: {train_acc:.4f}\tTime: {run_time:.2f} sec')
# Test
test_loss, test_acc, run_time = run_net(X_test, y_test, batch_size,
dnn, is_train=False)
# print(f'Test #{epoch}\tLoss: {test_loss:.4f}\t'
# f'Accuracy: {test_acc:.4f}\tTime: {run_time:.2f} sec')
print(f'Final test acc: {test_acc:.4f}')
```

```
Final test acc: 0.9050
Wall time: 3h 54min 34s
```

As we can see, this model actually has achieved 0.9+ accuracy on these long sequences, confirming the paper’s results.