Labels and problem with Classification model #91

FernandoFS18 · 2023-05-14T19:12:49Z

FernandoFS18
May 14, 2023

Hi Lambeq community,

I started a small investigation project and wanted to include some quantum-base model to compare some results and study the current state. When starting building my own model using the tutorials provided I had two main doubts.

What is the idea behind defining the labels of the data in this two-dimensional binary way? and if I were to build a model for multiple labels, what would be the correct way to define the label for using lambeq?
I replicated the steps of the classical case but with my own dataset, which contains over 900 sentences classified in two categories: 1 - User is addresing the bot dirctly, 0 - User is not addresing the bot. The idea is to build a model using lambeq that is able to make this kind of classification. The problem is that I get really poor results (rarely above 55% accuracy), no matter how I set the hyperparemeters. Comparing with the example notebook, the only part that is changed is the addition of the atomic type: PREPOSITIONAL_PHRASE= Ty('p').

*Edit: Here is an example of the sentences I am working with:
0 i really do not like horror games .
0 I believe that exercise is crucial for staying healthy .
0 i love comic books they keep me entertain .
0 i like a lot of different music .
1 Do you like coffee ?
1 what is your name ?
1 can we talk about the batman film ?
1 What are your thoughts on animal testing ?

Any ideas/Tipps to improve my model?
Thanks a lot for help.

dimkart · 2023-05-15T10:51:30Z

dimkart
May 15, 2023

Hi, regarding your first question. If you have $n$ classes, the number of qubits you are going to need for your S wires to represent them is $log_2(n)$. For a binary classification case (i.e. 2 labels), you require 1 qubit (remember that 1 qubit is equivalent to a 2-dimensional vector). If you have 4 labels you need 2 qubits, for 8 labels 3 qubits and so on.

Regarding the losses: For a binary classification task you should use binary cross entropy loss, while for a multi-class classification task the generalisation of this function to many labels, known as categorical cross entropy. Have a look at lambeq's tutorial for classification for more information.

Regarding your second question: From your description, it's really hard to say what could be wrong with your model. Your dataset seems very simple, and actually classifying correctly a sentence only depends on the last token (is it question mark or not). My understanding is that probably there's something wrong with your code. Here are a few suggestions:

Make sure the two classes in your dataset are balanced, both for train and test parts.
Try to use F-score to evaluate your dataset (more info in this tutorial)
Visualise the diagrams to make sure the text is converted properly.
If you are using a syntax-based model try a simpler one, e.g. a bag-of-words model or a word-sequence model (spiders or stairs-reader).

Let us know how it goes.

2 replies

FernandoFS18 May 21, 2023
Author

Hi @dimkart,

First of all, thanks a lot for all the answers! I still have some doubts remaining:

Regarding your first responde, I understand now the meaning behind the lavel definition, but a question that I have now is, what would happen with the number of qubits requiered in the case that we have a 6 categories classification task? would we have to round up ~2.59 to 3 (so a 6-dimensional vector)?

Then for the second part of my problem, there are also sentences in my data with label 1 that do not end with '?' symbol like: 'play the last song from yesterday .' so it is not just identifying the last token I think.
I have also tried to apply some of your ideas, I made sure that the classes were balanced, which they are now and I have changed from the Syntax-based model using BobcatParser to the bag-of-words model using spiders_reader, this is how some of the diagrams looks like:

With the cahnges I managed at least to reach almost 70% accuracy, so at least it is not stuck around 50 like before, but the loss curve keeps growing for the validations set and I do not know why...

Again, any idea is totally welcome and Thanks a lot for the help.

dimkart May 27, 2023

Really sorry for the delay on this. For your first question, you are correct you'll have to round up to the closest integer. This means you will have more outputs than the actual classes, so you should be careful to mask out the redundant values and to not include them in the loss calculation.

From your diagrams, it's obvious the model is overfitting. If you classes are balanced, as you say (all of your datasets, train, dev and test must be balanced), the first thing to try is to use a sequence model, such as StairsReader or CupsReader; a model that respects the order of the words might be more useful for your task, than the bag-of-words model you're using. Now, there are other things you can also do to avoid overfitting (google the terms):

Use regularisation, i.e. add a penalty term to your loss function
Use cross-validation
Use early stopping
If you have more data, augment your datasets

Hope this helps.

FernandoFS18 · 2023-06-17T16:11:19Z

FernandoFS18
Jun 17, 2023
Author

hi @dimkart,

Sorry for not answering you before. I wanted first to thank you, that tips about using a sequence model got the accuracy up to 80%.
But, the thing is, when revisiting the work that I did on the syntax-based model, I noticed that the the BobcatParser reader that I was using for creating the Diagrams (following the Classical_pipeline tutorial) was omiting the last symbol (either the '?' or '.') which as you said it should be essential when classifying this sentences. Can you tell what I have to change in order to have this included in the diagrams? or is it not possible with BobcatParser?
An Example of what I mean here:

Again, Thanks a lot for help.

0 replies

dimkart · 2023-06-17T20:14:54Z

dimkart
Jun 17, 2023

Yes, it's the default behaviour of Bobcat to ignore punctuation rules and tokens since they are not standard CCG. I've written a short method to show you how to fix this, by replacing the punctuation rule with backward application.

from lambeq import CCGTree, CCGRule, BobcatParser, diagram2str
from discopy import Ty

def to_tree_with_punct(tree: CCGTree) -> CCGTree:
    s = Ty('s')
    if (len(tree.children) == 2 and tree.children[0].biclosed_type == s
            and tree.children[1].biclosed_type == Ty('punc')):
        tree.children[1].biclosed_type = s >> s
        tree.rule = CCGRule.BACKWARD_APPLICATION
    return tree

parser = BobcatParser()

# We now start by getting the CCG tree of the sentence, 
# not directly the diagram.
t = parser.sentence2tree("What is the meaning of life ?")
print(t.deriv())

Output (without using the method) is:

 What      is     the  meaning    of     life   ?  
═══════  ═══════  ═══  ═══════  ═══════  ════  ════
s/(s\n)  (s\n)/n  n/n     n     (n\n)/n   n    punc
                  ───────────>           ─<U>      
                       n                  n        
                                ────────────>      
                                     n\n           
                  ──────────────────────────<      
                               n                   
         ───────────────────────────────────>      
                         s\n                       
────────────────────────────────────────────>      
                      s                            
─────────────────────────────────────────────────>p
                         s

Note the non-standard punc rule. This is not visible in the diagram, as you noticed:

print(diagram2str(t.to_diagram()))

Output:

  What       is      the   meaning      of     life
───────  ─────────  ─────  ───────  ─────────  ────
s·s.l·n  n.r·s·n.l  n·n.l     n     n.r·n·n.l   n
│  │  ╰───╯  │  │   │  ╰──────╯      │  │  ╰────╯
│  ╰─────────╯  │   ╰────────────────╯  │
│               ╰───────────────────────╯

Now if you use the function:

new_tree = to_tree_with_punct(t)
print(diagram2str(new_tree.to_diagram()))

Output:

  What       is      the   meaning      of     life    ?
───────  ─────────  ─────  ───────  ─────────  ────  ─────
s·s.l·n  n.r·s·n.l  n·n.l     n     n.r·n·n.l   n    s.r·s
│  │  ╰───╯  │  │   │  ╰──────╯      │  │  ╰────╯     │  │
│  ╰─────────╯  │   ╰────────────────╯  │             │  │
│               ╰───────────────────────╯             │  │
╰─────────────────────────────────────────────────────╯  │

Hope this helps.

0 replies

FernandoFS18 · 2023-06-18T13:14:35Z

FernandoFS18
Jun 18, 2023
Author

Hi @dimkart Thanks for the method and the explanation.
Unfortunately, I see no improvement on the model performance when using the syntax-based model. I have already make sure to have balanced data and I am using more sentences than in the example dataset tha you guys use for the binary classification. I have been following the classical pipeline example step by step and the only parts that differs from what you guys have is that for my dataset I have to introduce the prepositional_phrase atomic type in the parametrization step. Apart from that, all is the same at this point and I do not know what is wrong and why the model is not learning.
Does the inclusion of another atomic type impact the performance that much??

Thanks for the help.

0 replies

Labels and problem with Classification model #91

Uh oh!

Uh oh!

FernandoFS18 May 14, 2023

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

dimkart May 15, 2023

Uh oh!

FernandoFS18 May 21, 2023 Author

Uh oh!

Uh oh!

dimkart May 27, 2023

Uh oh!

Uh oh!

FernandoFS18 Jun 17, 2023 Author

Uh oh!

Uh oh!

dimkart Jun 17, 2023

Uh oh!

FernandoFS18 Jun 18, 2023 Author

FernandoFS18
May 14, 2023

Replies: 4 comments 2 replies

dimkart
May 15, 2023

FernandoFS18 May 21, 2023
Author

FernandoFS18
Jun 17, 2023
Author

dimkart
Jun 17, 2023

FernandoFS18
Jun 18, 2023
Author