How you evaluate the in-domain generalization of the honesty probe?

I am curious how you evaluate the in-domain generalization of the honesty probe. I found this in the paper

`With this setup, the resulting LAT reading vector reaches a classification accuracy of over 90% in distinguishing between held-out examples where the model is instructed to be honest or dishonest.`

The trained honesty probe can predict the honesty score at each token position in the response and the honesty scores associated with a complete response should consist of many scores. How do you determine whether the model is instructed to be honest or dishonest (i.e., a binary prediction) based on the long sequence of honest scores?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How you evaluate the in-domain generalization of the honesty probe? #54

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How you evaluate the in-domain generalization of the honesty probe? #54

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions