Real-time recurrent learning

Next: Other derivative-based methods Up: Gradient-based algorithms Previous: Backpropagation through time Contents Index

Real-time recurrent learning

Real-time recurrent learning (RTRL) has been independently derived by many authors, although the most commonly cited reference for it is Williams and Zipser (1989b) (for more details see also Hertz et al. (1991, 184) and Haykin (1998, 756)). This algorithm computes the derivatives of states and outputs with respect to all weights as the network processes the sequence, that is, during the forward step. No unfolding is performed or necessary. For instance, if the network has a simple next-state dynamics such as the one described in eq. (3.10), derivatives may be computed together with the next state. The derivative of states with respect to, say, state-state weights at time , would be computed from the states and derivatives at time and the input at time as follows:

$\begin{displaymath} \frac{\partial x_i[t]}{\partial W^{xx}_{kl}}=g'(\Xi_i[t]) \l... ...^{xx} \frac{\partial x_j[t-1]}{\partial W^{xx}_{kl}} \right), \end{displaymath}$

(4.27)

with

the derivative of the activation function, $\delta_{ik}$ Kronecker's delta (1 if

and zero otherwise) and

$\begin{displaymath} \Xi_i[t]= \sum_{j=1}^{n_X} W_{ij}^{xx} x_j[t-1] + \sum_{j=1}^{n_U} W_{ij}^{xu} u_j[t] + W^x_i \end{displaymath}$

(4.28)

the net input to state unit

. The derivatives of states with respect to weights at

are initialized to zero.^4.12

Since derivatives of outputs are easily defined in terms of state derivatives for all architectures, the learnable parameters of the DTRNN may be updated after every time step in which output targets are defined, (using the derivatives of the error for each output), therefore even after having processed only part of a sequence. This is one of the main advantages of RTRL in applications where online learning is necessary; the other one is the ease with which it may be derived and programmed for a new architecture; however, its time complexity is much higher than that of BPTT; for first-order DTRNNs such as the above with more state units than input lines () the dominant term in the time complexity is . A detailed derivation of RTRL for a second-order DTRNN architecture may be found in (Giles et al., 1992).

The reader should be aware that the name RTRL (Williams and Zipser, 1989c) is applied to two different concepts: it may be viewed solely as a method to compute the derivatives or as a method to compute derivatives and to update weights (in each cycle). One may use RTRL to compute derivatives and update the weights after processing a complete learning set made up of a number of sequences (batch update), after processing each sequence (pattern update), and after processing each item in each sequence (online update). In these last two cases, the derivatives are not exact but approximate (they would be exact for a zero learning rate). For batch and pattern weight updates, RTRL and BPTT are equivalent, since they compute the same derivatives. The reader is referred to Williams and Zipser (1995) for a more detailed discussion.

Next: Other derivative-based methods Up: Gradient-based algorithms Previous: Backpropagation through time Contents Index

Debian User 2002-01-21