Skip to content

Support for Structured Languages like Code and Context-Free Grammars #54

@QinengWang-Aiden

Description

@QinengWang-Aiden

I've been exploring SONAR's multilingual capabilities and am impressed by its ability to handle diverse languages through its encoder-decoder architecture. I'm wondering if it would be possible to extend SONAR to support structured languages, such as programming languages or other context-free grammars, by treating them as new languages in the system.

Given SONNAR's language-agnostic design and the use of SentencePiece tokenization, it seems theoretically possible to train SONAR to handle structured languages by defining them as new language codes (e.g., "py_Code" for Python, "java_Code" for Java, or "cfg_Form" for formal grammars).

Therefore, may I ask if it is possible to do the following:

  1. training the structured language as a new 'language' to encode and decode the expression?
  2. Will any modifications be needed to handle strict syntactic rules?
    If so, may I further ask how I can possibly add a new language to SONNAR, i.e., the training recipe?

Looking forward to hearing from you :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions