When we want to train a DTRNN as a sequence processor, the usual procedure is to choose the architecture and parameters of the architecture: the number of input lines and the number of output neurons will usually be determined by the nature of the input sequence itself and by the nature of the processing we want to perform4.9; the number of state neurons will have to be determined through experimentation or used to act as a computational bias restricting the computational power of the DTRNN when we have a priori knowledge about the computational requirements of the task. Since DTRNN are state-based sequence processors (see section 3.1.1), the choice of the number of state units is crucial: the resulting state space has to be ample enough to store all the information about an input sequence that is necessary to produce a correct output for it, assuming that the DTRNN architecture is capable of extracting that information from inputs and computing correct outputs from states; it is also possible to modify the architecture as training proceeds (see e.g. Fahlman (1991)), as will be mentioned later.
Then we train the DTRNN on examples of processed sequences; training a DTRNN as a discrete-time sequence processor involves adjusting its learnable parameters. In a DTRNN these are the weights, biases and initial states4.10(). To train the network we usually need an error measure which describes how far the actual outputs are from their desired targets; the learnable parameters are modified to minimize the error measure. It is very convenient that the error is a differentiable function of the learnable parameters which is to be minimized (this is usually the case with most sigmoid-like activation functions, as has been discussed in the previous section). A number of different problems may occur when training DTRNN --and, in general, any neural network-- by error minimization. These problems are reviewed in section 3.5.
Learning algorithms (also called training algorithms) for DTRNN may be classified according to diverse criteria. All learning algorithms (except trivial algorithms such as a random search) implement a heuristic to search the many-dimensional space of learnable parameters for minima of the error function chosen; the nature of this heuristic may be used to classify them. Some of the divisions that will be described in the following may also apply to non-recurrent neural networks.
A major division occurs between gradient-based algorithms, which compute the gradient of the error function with respect to the learnable parameters at the current search point and use this vector to define the next point in the search sequence, and non-gradient-based algorithms which use other (usually local) information to decide the next point. Obviously, gradient-based algorithms require that the error function be differentiable, whereas most non-gradient-based algorithms may dispense with this requirement. In the following, this will be used as the main division.
Another division relates to the schedule used to decide the next set of learnable parameters. Batch algorithms compute the total error function for all of the patterns in the current learning set and update the learnable parameters only after a complete evaluation of the total error function has been performed. Pattern algorithms compute the contribution of a single pattern to the error function and update the learnable parameters after computing this contribution. This formulation of the division may be applied to most neural network learning algorithms; however, in the case of DTRNN used as sequence processors, targets may be available not only for a whole sequence (as, for instance, in a classification task) but also for parts of a sequence (as would be the case in a synchronous translation task in which the targets are known after each item of the sequence). In the second case, a third learning mode, online learning, is possible: the contribution to the error function of each partial target may be used to update some of the learnable parameters even before the complete sequence has been processed. Online learning is the only possible choice when the learning set consists of a single sequence without a defined endpoint or when patterns can only be presented once.4.11
A third division has already been mentioned. Most learning algorithms for DTRNN do not change the architecture during the learning process. However, there are some algorithms that modify the architecture of the DTRNN while training it (for example, the recurrent cascade correlation algorithm by Fahlman (1991) adds neurons to the network during training).