feat: expression to match pronoun-be, word pair or contraction by hippietrail · Pull Request #2944 · Automattic/harper

hippietrail · 2026-03-16T10:48:53Z

Issues

N/A

Description

An expression that matches pronouns followed by inflections of the word "be", including when they're joined as contractions.

Using this Expr will avoid some potential pitfalls:

Only includes subject pronouns ("I", "we"; not "me", "us"; etc.)
Only includes non-finite forms of the verb ("am", "is", was", etc. And not "be", "been", "being".)

But it does not avoid mismatched pairs such as "I are" or "you is".
Nor does it support common "wrong apostrophes": ; or ´

Potential improvements:

A constructor that includes common mistakes, especially omitted apostrophes.
A constructor that doesn't include "it is" or "it's", which are more prone to false positives.
A constructor to only include matching pairs: "I am" but not "I is".

How Has This Been Tested?

Unit tests for all standard pairs and contractions using sentences harvested from GitHub.
Handcrafted unit tests for edge cases.

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

86xsk

One thing I'm wondering is whether it might make sense to implement this as a function of SequenceExpr rather than creating a dedicated Expr.

On one hand, having a dedicated type does help to organize things a bit, on the other, it is just a wrapper for a preset SequnceExpr under the hood.

86xsk · 2026-03-16T20:43:11Z

harper-core/src/expr/pronoun_be.rs

+        let expr = SequenceExpr::default().then_any_of(vec![
+            Box::new(
+                SequenceExpr::default()
+                    .then_subject_pronoun()
+                    .t_ws()
+                    .t_set(&["am", "are", "is", "was", "were"]),
+            ),
+            Box::new(WordSet::new(&[
+                "i'm", "we're", "you're", "he's", "she's", "it's", "they're",
+            ])),
+        ]);


I would consider LazyLocking this or constructing and storing this inside the struct, especially since constructing a WordSet (currently) requires O(n²) string comparisons.

I would consider LazyLocking this or constructing and storing this inside the struct, especially since constructing a WordSet (currently) requires O(n2) string comparisons.

Hmm even more like an implementation detail? You make it sound like there's a more efficient way to implement WordSet?

The biggest cost with WordSet is the duplicate checking it does whenever a value is inserted. You can get a sizable performance benefit (~10%) on the lint_essay benchmark simply by removing those checks in WordSet::add and WordSet::add_chars. (Since WordSet stores its words in a SmallVec, checking for duplicates requires scanning through all words already contained.)

I've dabbled locally with trying to optimize it before, but I didn't spend too much time on it and didn't finish anything.

I haven't dedicated time to that problem because the vast majority of WordSet constructions are done at startup, and thus don't have a tangible impact on linting latency.

86xsk · 2026-03-16T20:58:31Z

harper-core/src/expr/pronoun_be.rs

+
+use super::{Expr, SequenceExpr};
+
+#[derive(Default)]


I don't think you strictly need the Default derive here since it's a ZST, though I guess it doesn't hurt to have.

My one concern is that it may lead to usage of PronounBe::default() even though PronounBe alone would suffice as long as it remains a ZST.

I don't think you strictly need the Default derive here since it's a ZST, though I guess it doesn't hurt to have.

Oh I thought I deleted that! I was doing lots of experimentation with different ctors for optional inclusions such as the common misspellings without the apostrophes.

My one concern is that it may lead to usage of PronounBe::default() even though PronounBe alone would suffice as long as it remains a ZST.

Removed it. Thanks.

hippietrail · 2026-03-16T21:53:04Z

One thing I'm wondering is whether it might make sense to implement this as a function of SequenceExpr rather than creating a dedicated Expr.

Hmm I hadn't considered that. I did it this way because in my head it's similar to the 'spelled number', 'time unit', etc. expressions. Though I'm not sure there were any similar ones not made by me so I wonder what I based the decision on with my first one. I think 'fixed expression' might've been the OG.

On one hand, having a dedicated type does help to organize things a bit, on the other, it is just a wrapper for a preset SequnceExpr under the hood.

I think the idea was that semantically it's its own expression so it made semantic sense. On my one hand SequenceExpr is designed to be usable for components like this and on my other hand it's an implementation detail. Might be time to get an opinion from @elijah-potter

…noun-be

elijah-potter

It is preferable to have it as its own type. The organization is worth any potential performance trade off. I can't imagine there is a trade-off, though, since the compiler and the downstream branch predictor tends be pretty good at optimizing these situations.

However, as @86xsk alluded to, you should not be constructing those WordSet instances in the hot loop. Please refactor this to construct them as part of a new constructor.

…onoun-be

hippietrail · 2026-03-24T11:21:35Z

It is preferable to have it as its own type. The organization is worth any potential performance trade off. I can't imagine there is a trade-off, though, since the compiler and the downstream branch predictor tends be pretty good at optimizing these situations.

However, as @86xsk alluded to, you should not be constructing those WordSet instances in the hot loop. Please refactor this to construct them as part of a new constructor.

Oh I didn't even realize! Fix is in...

hippietrail added 2 commits March 16, 2026 17:19

feat: expression to match pronoun-be, word pair or contraction

d8029f2

fix: "it's" was missing - good/bad apostrophe tests

d8cab0d

86xsk reviewed Mar 16, 2026

View reviewed changes

hippietrail added 4 commits March 17, 2026 05:02

Merge branch 'master' of http://github.com/Automattic/harper into pro…

9a93c75

…noun-be

chore: remove derive Default

6f57b9b

Merge branch 'master' into pronoun-be

7a0294d

Merge branch 'master' into pronoun-be

ca92acd

elijah-potter requested changes Mar 23, 2026

View reviewed changes

hippietrail added 2 commits March 24, 2026 18:12

Merge branch 'master' of https://github.com/Automattic/harper into pr…

1330245

…onoun-be

refactor: add ctor as per PR review

76b3911

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expression to match pronoun-be, word pair or contraction#2944

feat: expression to match pronoun-be, word pair or contraction#2944
hippietrail wants to merge 8 commits intoAutomattic:masterfrom
hippietrail:pronoun-be

hippietrail commented Mar 16, 2026

Uh oh!

86xsk left a comment

Uh oh!

86xsk Mar 16, 2026

Uh oh!

hippietrail Mar 16, 2026

Uh oh!

86xsk Mar 16, 2026

Uh oh!

elijah-potter Mar 23, 2026

Uh oh!

86xsk Mar 16, 2026

Uh oh!

hippietrail Mar 16, 2026

Uh oh!

hippietrail commented Mar 16, 2026 •

edited

Loading

Uh oh!

elijah-potter left a comment

Uh oh!

hippietrail commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		use super::{Expr, SequenceExpr};

		#[derive(Default)]

Conversation

hippietrail commented Mar 16, 2026

Issues

Description

How Has This Been Tested?

Checklist

Uh oh!

86xsk left a comment

Choose a reason for hiding this comment

Uh oh!

86xsk Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hippietrail Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

86xsk Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

elijah-potter Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

86xsk Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hippietrail Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hippietrail commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elijah-potter left a comment

Choose a reason for hiding this comment

Uh oh!

hippietrail commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hippietrail commented Mar 16, 2026 •

edited

Loading