AC-Search

基于 AC自动机（Aho-Corasick） 的高性能多模式字符串匹配库，结合了字典树（Trie）和 KMP 失败指针思想，可在文本中高效查找多个模式串。

特性

泛型支持：模式可以是任意实现 Pattern 接口的结构体，支持携带用户标记、分类等元数据
支持中文和多字节字符
零内存分配的快速检测方法
支持字节位置和 Unicode 字符位置两种返回格式
自动处理重叠模式匹配
线程安全（构建完成后）

安装

go get github.com/rushteam/ac-search

快速开始

简单字符串匹配

package main

import (
    "fmt"
    acsearch "github.com/rushteam/ac-search"
)

func main() {
    // 使用便捷方法创建字符串匹配器
    ac := acsearch.NewStringMatcher()

    // 添加词库
    ac.AddPatterns([]acsearch.StringPattern{"敏感词", "违禁内容", "不良信息"})

    text := "这段文本包含敏感词和不良信息"

    // 快速检查是否包含
    if ac.Contains(text) {
        fmt.Println("包含敏感词")
    }

    // 获取所有匹配及位置
    results := ac.SearchRune(text)
    for _, r := range results {
        fmt.Printf("找到: %s, 位置: [%d, %d)\n", r.Pattern.String(), r.Start, r.End)
    }
}

自定义结构体（带用户标记）

package main

import (
    "fmt"
    acsearch "github.com/rushteam/ac-search"
)

// 自定义敏感词结构体
type SensitiveWord struct {
    Text     string
    Category string // 分类：政治、色情、广告等
    Level    int    // 级别：1-低 2-中 3-高
    UserID   int64  // 添加该词的用户ID
}

// 实现 fmt.Stringer 接口
func (s SensitiveWord) String() string {
    return s.Text
}

func main() {
    // 创建泛型AC自动机
    ac := acsearch.New[SensitiveWord]()

    // 添加带元数据的敏感词
    ac.AddPatterns([]SensitiveWord{
        {Text: "敏感词", Category: "政治", Level: 3, UserID: 1001},
        {Text: "广告", Category: "营销", Level: 1, UserID: 1002},
    })

    text := "这段文本包含敏感词和广告"

    // 获取匹配结果，包含完整的结构体信息
    results := ac.SearchRune(text)
    for _, r := range results {
        fmt.Printf("词: %s, 分类: %s, 级别: %d, 添加者: %d\n",
            r.Pattern.Text, r.Pattern.Category, r.Pattern.Level, r.Pattern.UserID)
    }
}

项目结构

ac-search/
├── go.mod          # Go模块定义
├── ac.go           # AC自动机核心实现（泛型版本）
├── ac_test.go      # 测试和基准测试
├── README.md       # 项目文档
└── example/
    └── main.go     # 使用示例

API 说明

Pattern 接口

直接使用标准库 fmt.Stringer，任何实现了 String() string 方法的类型都可作为模式：

type Pattern = fmt.Stringer

内置类型

// StringPattern 简单字符串模式
type StringPattern string

func (s StringPattern) Word() string {
    return string(s)
}

创建和初始化

// 创建泛型AC自动机
ac := acsearch.New[YourPatternType]()

// 使用便捷方法创建字符串匹配器
ac := acsearch.NewStringMatcher()

// 添加单个模式
ac.AddPattern(pattern)

// 批量添加模式
ac.AddPatterns([]YourPatternType{...})

// 手动构建（可选，Search/Contains会自动调用）
ac.Build()

// 清空重置
ac.Clear()

// 获取所有已添加的模式
patterns := ac.GetPatterns()

核心方法

方法	返回值	说明	适用场景
`Contains(text)`	`bool`	快速检查是否包含任意模式	零内存分配，敏感词过滤最优选
`ContainsAny(text)`	`(T, bool)`	返回第一个匹配的模式	需要知道具体命中哪个模式
`Search(text)`	`[]MatchResult[T]`	返回所有匹配及字节位置	需要精确字节位置
`SearchRune(text)`	`[]RuneMatchResult[T]`	返回所有匹配及 Unicode 字符位置	中文场景，字符位置更直观
`SearchFirst(text)`	`(MatchResult[T], bool)`	只返回第一个匹配详情	只需第一个匹配即可
`SearchCount(text)`	`map[string]int`	统计每个词出现次数	词频统计

返回结构

// 字节位置结果
type MatchResult[T Pattern] struct {
    Pattern T   // 匹配到的模式（包含完整信息）
    Start   int // 起始字节位置
    End     int // 结束字节位置（不包含）
}

// Unicode字符位置结果
type RuneMatchResult[T Pattern] struct {
    Pattern T   // 匹配到的模式（包含完整信息）
    Start   int // 起始字符位置
    End     int // 结束字符位置（不包含）
}

使用示例

1. 敏感词检测（推荐使用 Contains）

ac := acsearch.NewStringMatcher()
ac.AddPatterns([]acsearch.StringPattern{"敏感词", "违禁内容"})

// 最快速的检测方式，零内存分配
if ac.Contains(userInput) {
    return errors.New("包含敏感内容")
}

2. 多用户词库

type UserKeyword struct {
    Keyword string
    UserID  int64
    Tag     string
}

func (u UserKeyword) Word() string {
    return u.Keyword
}

ac := acsearch.New[UserKeyword]()

// 用户1的词库
ac.AddPattern(UserKeyword{Keyword: "苹果", UserID: 1, Tag: "水果"})

// 用户2的词库（同一个词，不同含义）
ac.AddPattern(UserKeyword{Keyword: "苹果", UserID: 2, Tag: "科技"})

results := ac.SearchRune("我买了一个苹果")
for _, r := range results {
    fmt.Printf("用户 %d 的词 %q 匹配\n", r.Pattern.UserID, r.Pattern.Keyword)
}
// 输出:
// 用户 1 的词 "苹果" 匹配
// 用户 2 的词 "苹果" 匹配

3. 敏感词过滤与替换

ac := acsearch.New[SensitiveWord]()
ac.AddPatterns([]SensitiveWord{
    {Text: "敏感", Category: "政治", Level: 3},
    {Text: "广告", Category: "营销", Level: 1},
})

text := "这是一段包含敏感内容和广告的文本"
runes := []rune(text)
results := ac.SearchRune(text)

// 从后往前替换，避免位置偏移
for i := len(results) - 1; i >= 0; i-- {
    r := results[i]
    var replacement string
    if r.Pattern.Level >= 3 {
        replacement = "***"
    } else {
        replacement = "**"
    }
    runes = append(runes[:r.Start], append([]rune(replacement), runes[r.End:]...)...)
}
fmt.Println(string(runes))
// 输出: 这是一段包含***内容和**的文本

4. 按分类统计

results := ac.SearchRune(text)

categoryCount := make(map[string]int)
for _, r := range results {
    categoryCount[r.Pattern.Category]++
}
// categoryCount: {"政治": 1, "营销": 2}

算法原理

AC自动机 = 字典树(Trie) + KMP失败指针

字典树（Trie）

存储所有模式串，共享公共前缀，节省内存。

        root
       /    \
      h      s
      |      |
      e      h
     / \     |
    r   (he) e
    |       (she)
    s
    |
   (hers)

失败指针（Fail Pointer）

类似 KMP 的 next 数组，当匹配失败时快速跳转到最长 proper 后缀对应的前缀位置，避免重复扫描。

复杂度

操作	时间复杂度	说明
构建	O(m)	m = 所有模式串总长度
搜索	O(n + z)	n = 文本长度，z = 匹配数量
总体	O(n + m + z)	线性时间复杂度

性能基准

测试环境：Apple M5

BenchmarkACAutomaton_Build-10        8397    141488 ns/op   218867 B/op   3449 allocs/op
BenchmarkACAutomaton_Search-10    1343772       892 ns/op      800 B/op      5 allocs/op
BenchmarkACAutomaton_Contains-10  5883663       204 ns/op        0 B/op      0 allocs/op

Contains 方法零内存分配，适合高频调用场景
1000个模式串的构建时间约 0.14ms
搜索性能约 500万次/秒

运行测试

# 运行所有测试
go test -v

# 运行基准测试
go test -bench=. -benchmem

# 运行示例
go run example/main.go

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AC-Search

特性

安装

快速开始

简单字符串匹配

自定义结构体（带用户标记）

项目结构

API 说明

Pattern 接口

内置类型

创建和初始化

核心方法

返回结构

使用示例

1. 敏感词检测（推荐使用 Contains）

2. 多用户词库

3. 敏感词过滤与替换

4. 按分类统计

算法原理

字典树（Trie）

失败指针（Fail Pointer）

复杂度

性能基准

运行测试

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example		example
README.md		README.md
ac.go		ac.go
ac_test.go		ac_test.go
go.mod		go.mod

rushteam/ac-search

Folders and files

Latest commit

History

Repository files navigation

AC-Search

特性

安装

快速开始

简单字符串匹配

自定义结构体（带用户标记）

项目结构

API 说明

Pattern 接口

内置类型

创建和初始化

核心方法

返回结构

使用示例

1. 敏感词检测（推荐使用 Contains）

2. 多用户词库

3. 敏感词过滤与替换

4. 按分类统计

算法原理

字典树（Trie）

失败指针（Fail Pointer）

复杂度

性能基准

运行测试

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages