The four stages derived from existing STR models are as follows:
- Transformation (Trans.) normalizes the input text image using the Spatial Transformer Network (STN [12]) to ease downstream stages.
- Feature extraction (Feat.) maps the input image to a representation that focuses on the attributes relevant for character recognition, while suppressing irrelevant features such as font, color, size, and background.
- Sequence modeling (Seq.) captures the contextual information within a sequence of characters for the next stage to predict each character more robustly, rather than doing it independently.
- Prediction (Pred.) estimates the output character sequence from the identified features of an image.
Can I use a transformer block to generate contextual features.