Skip to content

Missing normalization based on phrase length? #7

@akshaylive

Description

@akshaylive

According to the paper (section 2.2), constituent word representations are taken to be the average of token representations of the phrase (non-terminal) tokens.

The code actually does a batch matrix multiplication, and therefore achieves the sum of hidden token representations. This may affects both the magnitude and the direction of the phrase level representation after applying the activation.

Am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions