Skip to content

tonymet/regexpscanner

Repository files navigation

regexpscanner

Go Reference Go Test Go Report Card

import "github.com/tonymet/regexpscanner"

©️ 2024 Anthony Metzidis

regexpscanner -- stream-based scanner and regex-based tokenizer in one.

scans io.Reader streams and returns matching tokens

Index

func MakeScanner(in io.Reader, re *regexp.Regexp) *bufio.Scanner

MakeScanner creates a scanner you can call scanner.Scan() and scanner.Text() with.

Calling scanner.Scan() && scanner.Text() will return the latest token matching the regex in the stream.

Example

use MakeScanner to create a scanner that will tokenize using the regex

package main

import (
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	scanner := rs.MakeScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
		regexp.MustCompile(`</?[a-z]+>`),
	)
	// scanner has Split function defined using the regexp passed to MakeScanner
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

func MakeSplitter(re *regexp.Regexp) func([]byte, bool) (int, []byte, error)

MakeSplitter(re) creates a splitter to be passed to scanners.Split() the re will be used to tokenize input passed to the scanner.

splitters can be wrapped with more complicated splitters for further processing see bufio.Scanner for example splitter-wrappers

Example

use MakeSplitter to create a "splitter" for scanner.Split()

package main

import (
	"bufio"
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	splitter := rs.MakeSplitter(regexp.MustCompile(`</?[a-z]+>`))
	scanner := bufio.NewScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"))
	// be sure to call Split()
	scanner.Split(splitter)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

func ProcessTokens(in io.Reader, re *regexp.Regexp, handler func(string))

ProcessTokens calls handler(string) for each matching token from the Scanner.

Example

use ProcessTokens when a simple callback-based stream tokenizer is needed

package main

import (
	"fmt"
	"regexp"
	"strings"

	rs "github.com/tonymet/regexpscanner"
)

func main() {
	rs.ProcessTokens(
		strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
		regexp.MustCompile(`</?[a-z]+>`),
		func(text string) {
			fmt.Println(text)
		})
}

Output

<html>
<body>
<p>
</p>
</body>
</html>

Performance

regexpscanner is designed for high-efficiency, stream-based tokenization. It uses the regexp engine to find matches while minimizing buffer copies and splitter invocations.

Benchmark Results

Results from an AMD Ryzen 7 7735HS:

Benchmark Throughput Memory Allocs Notes
regexpscanner ~470 MB/s 13 allocs/op Full regex-based tokenization
Control ~12.8 GB/s 7 allocs/op Simple bytes.Index (non-regex)

Efficiency Stats

  • Splitter Invocations: ~1.01 calls per match (highly efficient buffer utilization).
  • Throughput: ~470 MB/s for large-scale stream scanning.

While primitive character searching (bytes.Index) is naturally faster for fixed patterns, regexpscanner provides excellent performance for complex tokenization tasks that require the full power of the Go regexp engine.

About

regexpscanner -- golang stream-based scanner and regex-based tokenizer in one. scans io.Reader streams and returns matching tokens

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages