Skip to content

feat: expression to match pronoun-be, word pair or contraction#2944

Open
hippietrail wants to merge 8 commits intoAutomattic:masterfrom
hippietrail:pronoun-be
Open

feat: expression to match pronoun-be, word pair or contraction#2944
hippietrail wants to merge 8 commits intoAutomattic:masterfrom
hippietrail:pronoun-be

Conversation

@hippietrail
Copy link
Collaborator

Issues

N/A

Description

An expression that matches pronouns followed by inflections of the word "be", including when they're joined as contractions.

Using this Expr will avoid some potential pitfalls:

  • Only includes subject pronouns ("I", "we"; not "me", "us"; etc.)
  • Only includes non-finite forms of the verb ("am", "is", was", etc. And not "be", "been", "being".)

But it does not avoid mismatched pairs such as "I are" or "you is".
Nor does it support common "wrong apostrophes": ; or ´

Potential improvements:

  • A constructor that includes common mistakes, especially omitted apostrophes.
  • A constructor that doesn't include "it is" or "it's", which are more prone to false positives.
  • A constructor to only include matching pairs: "I am" but not "I is".

How Has This Been Tested?

Unit tests for all standard pairs and contractions using sentences harvested from GitHub.
Handcrafted unit tests for edge cases.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

Copy link
Contributor

@86xsk 86xsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm wondering is whether it might make sense to implement this as a function of SequenceExpr rather than creating a dedicated Expr.

On one hand, having a dedicated type does help to organize things a bit, on the other, it is just a wrapper for a preset SequnceExpr under the hood.

Comment on lines +14 to +24
let expr = SequenceExpr::default().then_any_of(vec![
Box::new(
SequenceExpr::default()
.then_subject_pronoun()
.t_ws()
.t_set(&["am", "are", "is", "was", "were"]),
),
Box::new(WordSet::new(&[
"i'm", "we're", "you're", "he's", "she's", "it's", "they're",
])),
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider LazyLocking this or constructing and storing this inside the struct, especially since constructing a WordSet (currently) requires O(n2) string comparisons.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider LazyLocking this or constructing and storing this inside the struct, especially since constructing a WordSet (currently) requires O(n2) string comparisons.

Hmm even more like an implementation detail? You make it sound like there's a more efficient way to implement WordSet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest cost with WordSet is the duplicate checking it does whenever a value is inserted. You can get a sizable performance benefit (~10%) on the lint_essay benchmark simply by removing those checks in WordSet::add and WordSet::add_chars. (Since WordSet stores its words in a SmallVec, checking for duplicates requires scanning through all words already contained.)

I've dabbled locally with trying to optimize it before, but I didn't spend too much time on it and didn't finish anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dedicated time to that problem because the vast majority of WordSet constructions are done at startup, and thus don't have a tangible impact on linting latency.


use super::{Expr, SequenceExpr};

#[derive(Default)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you strictly need the Default derive here since it's a ZST, though I guess it doesn't hurt to have.

My one concern is that it may lead to usage of PronounBe::default() even though PronounBe alone would suffice as long as it remains a ZST.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you strictly need the Default derive here since it's a ZST, though I guess it doesn't hurt to have.

Oh I thought I deleted that! I was doing lots of experimentation with different ctors for optional inclusions such as the common misspellings without the apostrophes.

My one concern is that it may lead to usage of PronounBe::default() even though PronounBe alone would suffice as long as it remains a ZST.

Removed it. Thanks.

@hippietrail
Copy link
Collaborator Author

hippietrail commented Mar 16, 2026

One thing I'm wondering is whether it might make sense to implement this as a function of SequenceExpr rather than creating a dedicated Expr.

Hmm I hadn't considered that. I did it this way because in my head it's similar to the 'spelled number', 'time unit', etc. expressions. Though I'm not sure there were any similar ones not made by me so I wonder what I based the decision on with my first one. I think 'fixed expression' might've been the OG.

On one hand, having a dedicated type does help to organize things a bit, on the other, it is just a wrapper for a preset SequnceExpr under the hood.

I think the idea was that semantically it's its own expression so it made semantic sense. On my one hand SequenceExpr is designed to be usable for components like this and on my other hand it's an implementation detail. Might be time to get an opinion from @elijah-potter

Copy link
Collaborator

@elijah-potter elijah-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is preferable to have it as its own type. The organization is worth any potential performance trade off. I can't imagine there is a trade-off, though, since the compiler and the downstream branch predictor tends be pretty good at optimizing these situations.

However, as @86xsk alluded to, you should not be constructing those WordSet instances in the hot loop. Please refactor this to construct them as part of a new constructor.

@hippietrail
Copy link
Collaborator Author

It is preferable to have it as its own type. The organization is worth any potential performance trade off. I can't imagine there is a trade-off, though, since the compiler and the downstream branch predictor tends be pretty good at optimizing these situations.

However, as @86xsk alluded to, you should not be constructing those WordSet instances in the hot loop. Please refactor this to construct them as part of a new constructor.

Oh I didn't even realize! Fix is in...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants