Suppose you have a model whose final layer is a dot product between a vector produced only from context and a vector produced only from response. I use models of this form as “level 1” models because they facilitate precomputation of a fast serving index, but note the following trick will not apply to architectures like bidirectional attention. Anyway for these models you can be more efficient during training by drawing the negatives from the same mini-batch. This is a well-known trick but I couldn't find anybody talking about how to do this explicitly in pytorch.
Structure your model to have a leftforward and a rightforward like this:
1 2 3 4 5 6 7 | class MyModel(nn.Module): ... def forward(leftinput, rightinput): leftvec = self .leftforward(leftinput) rightvec = self .rightforward(rightinput) return torch.mul(leftvec, rightvec). sum (dim = 1 , keepdim = True ) |
At training time, compute the leftforward and rightforward for your mini-batch distinctly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ... criterion = BatchPULoss() model = MyModel() ... leftvec = model.leftforward(batch.leftinput) rightvec = model.rightforward(batch.rightinput) (loss, preds) = criterion.fortraining(leftvectors, rightvectors) loss.backward() # "preds" contains the highest score right for each left # so for instance, calculate "mini-batch precision at 1" gold_labels = torch.arange( 0 , batch.batch_size). long ().cuda() n_correct + = (preds.data = = gold_labels). sum () ... |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import torch class BatchPULoss(): def __init__( self ): self .loss = torch.nn.CrossEntropyLoss() def fortraining( self , left, right): outer = torch.mm(left, right.t()) labels = torch.autograd.Variable(torch.arange( 0 ,outer.shape[ 0 ]). long ().cuda(), requires_grad = False ) loss = self .loss(outer, labels) _, preds = torch. max (outer, dim = 1 ) return (loss, preds) def __call__( self , * args, * * kwargs): return self .loss( * args, * * kwargs) |