Hypothesis
- Do the language modelling objective implicitly encode/learn the entire parse tree?
- Can I detect a path from the root of the
Previous Approach
- Define tasks which require syntactic informations and benchmark the information absorbed based on the its performance.
- ccg supertagging
- predict parent and grandparent for each word in a sentence
Contribution
The L2 distnce between the words embeddings in the vector space for each pair of words in a sentence roughly approximates the number of hops between them in the dependency tree. They show ELMO and BERT representations roughly captures this information.
It is interesting because ELMO and BERT pretraninig does not involve these information but the model somehow implictly learns them.
The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence approximates the number of hops between them in the dependency tree. The proposed method shows that ELMo and BERT representations, trained with no syntactic supervision, embed many of the unlabeled, undirected dependency attachments between words in the same sentence.
Possible Extension
- does this claim hold accross language?
- comparison between stanford and universal dependency
paper: https://nlp.stanford.edu/pubs/hewitt2019structural.pdf