Missing normalization based on phrase length?

According to the paper (section 2.2), constituent word representations are taken to be the *average* of token representations of the phrase (non-terminal) tokens. 

[The code](https://github.com/dheerajrajagopal/SelfExplain/blob/master/model/SE_XLNet.py#L139) actually does a batch matrix multiplication, and therefore achieves the *sum* of hidden token representations. This may affects both the magnitude and the direction of the phrase level representation after applying the activation.

Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing normalization based on phrase length? #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Missing normalization based on phrase length? #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions