Skip to content

stjiris/metadata-knowledge-distillation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metadata-knowledge-distillation

Codacy Badge

Trial Technique to improve information retrieval through dense vectors: Metadata Knowledge Distillation

In our dataset, multiple documents have associated with them “Descritores”, brief tags manually annotated by experts. These tags intend to identify the main document subjects. These tags could indicate if a crime was committed with knives or even if it is related to COVID-19. With such annotation, we assumed that the documents are, in a way, related to one another. Thus, the sentences from each document have some trim level of entailment between each other.

We started by identifying the documents related to a subject, COVID-19, i.e. and we proceeded to encode those documents’ sentences. The generated embeddings form a cluster. We processed to calculate the centroid of those embeddings and adjusted the embeddings slightly to the centroid. (1-5%) This minor adjustment is based on the assumption that those sentences are related and, thus, they should be closer to one another. This process is done through the tags we have available. This ideology can be shown in the following figure:

Metadata Knowledge Distillation Ideology

Finally, the updated embeddings will serve as gold labels for what the embeddings of the same model should look like. We then applied the mean-squared error loss, similar to Multilingual Knowledge Distillation, to train the model. The process is illustrated in the following figure:

Metadata Knowledge Distillation

About

Trial Technique to improve information retrieval through dense vectors: Metadata Knowledge Distillation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%