Structural Probing

Hypothesis

Do the language modelling objective implicitly encode/learn the entire parse tree?
Can I detect a path from the root of the

Previous Approach

Define tasks which require syntactic informations and benchmark the information absorbed based on the its performance.
ccg supertagging
predict parent and grandparent for each word in a sentence

Contribution

The L2 distnce between the words embeddings in the vector space for each pair of words in a sentence roughly approximates the number of hops between them in the dependency tree. They show ELMO and BERT representations roughly captures this information.

It is interesting because ELMO and BERT pretraninig does not involve these information but the model somehow implictly learns them.

The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence approximates the number of hops between them in the dependency tree. The proposed method shows that ELMo and BERT representations, trained with no syntactic supervision, embed many of the unlabeled, undirected dependency attachments between words in the same sentence.

Possible Extension

does this claim hold accross language?
comparison between stanford and universal dependency

paper: https://nlp.stanford.edu/pubs/hewitt2019structural.pdf