In addition to the weight itself, the connection type in Bp, BpCon, has two additional variables:
float dwt
UpdateWeights
function.
float dEdW
Compute_dWt
function. It will
accumulate until the UpdateWeights
function is called, which will
either be on a trial-by-trial or epoch-wise (batch mode) basis.
The connection specifications control the behavior and updating of connections (see section 10.5 Connections). Thus, in Bp, this is where you will find thinks like the learning rate and momentum parameters. A detailed description of the parameters is given below:
float lrate
float momentum
MomentumType momentum_type
AFTER_LRATE
, momentum is added to the weight
change after the learning rate has been applied:
cn->dwt = lrate * cn->dEdW + momentum * cn->dwt; cn->wt += cn->dwt;This was used in the original pdp software. The
BEFORE_LRATE
model holds that momentum is something to be applied to the gradient
computation itself, not to the actual weight changes made. Thus,
momentum is computed before the learning rate is applied to the
weight gradient:
cn->dwt = cn->dEdW + momentum * cn->dwt; cn->wt += lrate * cn->dwt;Finally, both of the previous forms of momentum introduce a learning rate confound since higher momentum values result in larger effective weight changes when the previous weight change points in the same direction as the current one. This is controlled for in the
NORMALIZED
momentum update, which normalizes the total
contribution of the previous and current weight changes (it also uses
the BEFORE_LRATE
model of when momentum should be applied):
cn->dwt = (1.0 - momentum) * cn->dEdW + momentum * cn->dwt; cn->wt += lrate * cn->dwt;Note that normalized actually uses a variable called
momentum_c
which is pre-computed to be 1.0 - momentum, so that this extra
computation is not incurred needlessly during actual weight updates.
float decay
decay_fun
is NULL
weight decay is not
performed. However, if it is set, then the weight decay will be scaled
by this parameter. Note that weight decay is applied before
either momentum or the learning rate is applied, so that its effects are
relatively invariant with respect to manipulations of these other
parameters.
decay_fun
Bp_Simple_WtDecay
, which simply subtracts a fraction of the
current weight value, and Bp_WtElim_WtDecay
, which uses the
"weight elimination" procedure of Weigand, Rumelhart, and
Huberman, 1991. This procedure allows large weights to avoid a strong
decay pressure, but small weights are encouraged to be eliminated:
float denom = 1.0 + (cn->wt * cn->wt); cn->dEdW -= spec->decay * ((2 * cn->wt) / (denom * denom);The ratio of the weight to the
denom
value is roughly
proportional to the weight itself for small weights, and is constant for
weights larger than 1.