Common Mistakes:

  1. Forgot to overfit on a single batch first
  2. Forgot to toggle train/eval mode for the net.
  3. Forgot to .zero_grad() before .backward().
  4. you passed softmaxed outputs to a loss that expects raw logits.
  5. you didn’t use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer.
  6. thinking view() and permute() are the same thing (& incorrectly using view)
  7. forgetting to specify the dim/axis when calling sum, avg or max. fails silently e.g. when you’re averaging batch errors for your loss function :/
  8. Not shuffling training data, or otherwise using batches that have too much correlation between the examples in each batch.
  9. you forgot that pytorch’s .view() function reads from the last dimension first and fills the last dimension first too and are sending wrong input to model but aren’t getting an error since the shape is right.
  10. softmax or other loss operation over wrong dim
  11. you left a relu activation before a softmax
  12. using softmax instead of sigmoid in multilabel classification
  13. Not shuffling training data, or otherwise using batches that have too much correlation between the examples in each batch.
  14. You initialize parameters to zero as opposed to a truncated normal or xavier.

  15. Using BatchNorm in a place other than before an activation function,
  16. using softmax with a negative log-likelihood loss instead of logged softmax
  17. Not shuffling the data every epoch
  18. Not masking your variable length sequences

In using LSTMs:

  1. not setting shuffle to false after each epoch of training,
  2. not making the network stateful across batches, only resetting the state after each epoch,
  3. not checking if increasing the regularization strength increases the loss as well
  4. not checking the activation/gradient distributions per layer to ensure a uniform distribution of activations in the applicable range of activation function,
  5. ensuring final learning rate is not at the edge of an interval in which case might be missing optimal hyoerparaneters
  6. Not performing gradient clipping / normalization on LSTM networks

Beginner Notes

  1. Input to a Linear layer must be of type float32. If the original numpy array is in float32 then torch tensor conversions are also float32 and everything will work out.
  2. Default CrossEntropy Loss already performs Softmax before calculating the loss. The target variable should contain a 1 dim array of ground truth indexes; Not the OH vectors.
  3. MSE Loss calculates loss elementwise. difference -> square -> mean