Training Auto-regressive model
Feeding the correct output and learning to decode from it.
Greedy Decoding
Exhaustive Search Decoding
V^t possible partial translation. Far too expensive.
Beam Search
Keep track of K most probable partial translations(aka hypothesis). Score of a hypothesis is just its log probability
A Hypothesis y1,…,yt has a score which is its log probability:
score(y1,…,yt)=logPLM(y1,…,yt|x)=t∑i=1logPLM(yi|y1,…,yi−1,x)- Scores are all negative and higher score is better
- We search for high scoring hypothesis, tracking top k on each step
- Not guarranteed to find the optimal solution, but it is much more efficient
Problem with Beam Search Decoding
When to stop?
Hypothesis that produce