regexpscanner

import "github.com/tonymet/regexpscanner"

©️ 2024 Anthony Metzidis

regexpscanner -- stream-based scanner and regex-based tokenizer in one.

scans io.Reader streams and returns matching tokens

Index

func MakeScanner(in io.Reader, re *regexp.Regexp) *bufio.Scanner
func MakeSplitter(re *regexp.Regexp) func([]byte, bool) (int, []byte, error)
func ProcessTokens(in io.Reader, re *regexp.Regexp, handler func(string))

func MakeScanner

func MakeScanner(in io.Reader, re *regexp.Regexp) *bufio.Scanner

MakeScanner creates a scanner you can call scanner.Scan() and scanner.Text() with.

Calling scanner.Scan() && scanner.Text() will return the latest token matching the regex in the stream.

Example

use MakeScanner to create a scanner that will tokenize using the regex

package main

import (
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	scanner := rs.MakeScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
		regexp.MustCompile(`</?[a-z]+>`),
	)
	// scanner has Split function defined using the regexp passed to MakeScanner
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

func MakeSplitter

func MakeSplitter(re *regexp.Regexp) func([]byte, bool) (int, []byte, error)

MakeSplitter(re) creates a splitter to be passed to scanners.Split() the re will be used to tokenize input passed to the scanner.

splitters can be wrapped with more complicated splitters for further processing see bufio.Scanner for example splitter-wrappers

Example

use MakeSplitter to create a "splitter" for scanner.Split()

package main

import (
	"bufio"
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	splitter := rs.MakeSplitter(regexp.MustCompile(`</?[a-z]+>`))
	scanner := bufio.NewScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"))
	// be sure to call Split()
	scanner.Split(splitter)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

func ProcessTokens

func ProcessTokens(in io.Reader, re *regexp.Regexp, handler func(string))

ProcessTokens calls handler(string) for each matching token from the Scanner.

Example

use ProcessTokens when a simple callback-based stream tokenizer is needed

package main

import (
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	rs.ProcessTokens(
		strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
		regexp.MustCompile(`</?[a-z]+>`),
		func(text string) {
			fmt.Println(text)
		})
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

Performance

regexpscanner is designed for high-efficiency, stream-based tokenization. It uses the regexp engine to find matches while minimizing buffer copies and splitter invocations.

Benchmark Results

Results from an AMD Ryzen 7 7735HS:

Benchmark	Throughput	Memory Allocs	Notes
`regexpscanner`	~470 MB/s	13 allocs/op	Full regex-based tokenization
Control	~12.8 GB/s	7 allocs/op	Simple `bytes.Index` (non-regex)

Efficiency Stats

Splitter Invocations: ~1.01 calls per match (highly efficient buffer utilization).
Throughput: ~470 MB/s for large-scale stream scanning.

While primitive character searching (bytes.Index) is naturally faster for fixed patterns, regexpscanner provides excellent performance for complex tokenization tasks that require the full power of the Go regexp engine.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
cmd/regexpscanner		cmd/regexpscanner
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
regexpscanner.go		regexpscanner.go
regexpscanner_test.go		regexpscanner_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

regexpscanner

Index

func MakeScanner

Output

func MakeSplitter

Output

func ProcessTokens

Output

Performance

Benchmark Results

Efficiency Stats

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

regexpscanner

Index

func MakeScanner

Output

func MakeSplitter

Output

func ProcessTokens

Output

Performance

Benchmark Results

Efficiency Stats

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages