The four stages derived from existing STR models are as follows:
- Transformation (Trans.) normalizes the input text image using the Spatial Transformer Network (STN [12]) to ease downstream stages.
 - Feature extraction (Feat.) maps the input image to a representation that focuses on the attributes relevant for character recognition, while suppressing irrelevant features such as font, color, size, and background.
 - Sequence modeling (Seq.) captures the contextual information within a sequence of characters for the next stage to predict each character more robustly, rather than doing it independently.
 - Prediction (Pred.) estimates the output character sequence from the identified features of an image.
 
Can I use a transformer block to generate contextual features.
CRNN
