Backpropagation through time (BPTT) may be considered as the earliest
learning algorithm for DTRNN. The most commonly used reference for
BPTT is the book chapter by Rumelhart et al. (1986), although an earlier
description of BPTT may be found in
Werbos (1974)'s PhD
dissertation (see also Werbos (1990)). The central idea to
BPTT is the unfolding of the discrete-time recurrent neural
network
into a multilayer feedforward neural network (FFNN)
each time a sequence is processed. The FFNN has a layer for each ``time step'' in the sequence; each layer has
units, that is, as many as there are state units in the original networks. It is as if we are using time to
index layers in the FFNN. Next state is implemented by connecting
state units in layer
and inputs in time
to state units in
layer
. Output units (which are also repeated in each ``time step''
where targets are available) are connected to state units (and input
lines when the DTRNN is a Mealy NSM) as in the DTRNN itself.
The resulting FFNN is trained using the standard backpropagation (BP) algorithm, but with one restriction: since layers have been obtained by replicating the DTRNN over and over, weights in all layers should be the same. To achieve this, BPTT updates all equivalent weights using the sum of the gradients obtained for weights in equivalent layers, which may be shown to be the exact gradient of the error function for the DTRNN.
In BPTT, weights can only be updated after a complete forward step and a complete backward step, just as in regular backpropagation. When processing finite sequences, weights are usually updated after a complete presentation of the sequence.
The time complexity of BPTT is one
of its most attractive features: for first-order
DTRNN in which the number of states is larger than
the number of inputs (), the temporal cost of the backward
step used to compute the derivatives grows
as
, that is, the same as the cost of the forward step used to
process the sequence and compute the outputs. The main drawback of
BPTT is its space complexity,
which comes from the need to replicate the DTRNN for each step of the
sequence. This also makes it a bit trickier to program than RTRL.
For more details on BPTT the reader is referred to Haykin (1998, 751) and Hertz et al. (1991, 182).