import "github.com/tonymet/regexpscanner"©️ 2024 Anthony Metzidis
regexpscanner -- stream-based scanner and regex-based tokenizer in one.
scans io.Reader streams and returns matching tokens
- func MakeScanner(in io.Reader, re *regexp.Regexp) *bufio.Scanner
- func MakeSplitter(re *regexp.Regexp) func([]byte, bool) (int, []byte, error)
- func ProcessTokens(in io.Reader, re *regexp.Regexp, handler func(string))
func MakeScanner
func MakeScanner(in io.Reader, re *regexp.Regexp) *bufio.ScannerMakeScanner creates a scanner you can call scanner.Scan() and scanner.Text() with.
Calling scanner.Scan() && scanner.Text() will return the latest token matching the regex in the stream.
Example
use MakeScanner to create a scanner that will tokenize using the regex
package main
import (
"fmt"
"regexp"
"strings"
rs "github.com/tonymet/regexpscanner"
)
func main() {
scanner := rs.MakeScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
regexp.MustCompile(`</?[a-z]+>`),
)
// scanner has Split function defined using the regexp passed to MakeScanner
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}<html>
<body>
<p>
</p>
</body>
</html>
func MakeSplitter
func MakeSplitter(re *regexp.Regexp) func([]byte, bool) (int, []byte, error)MakeSplitter(re) creates a splitter to be passed to scanners.Split() the re will be used to tokenize input passed to the scanner.
splitters can be wrapped with more complicated splitters for further processing see bufio.Scanner for example splitter-wrappers
Example
use MakeSplitter to create a "splitter" for scanner.Split()
package main
import (
"bufio"
"fmt"
"regexp"
"strings"
rs "github.com/tonymet/regexpscanner"
)
func main() {
splitter := rs.MakeSplitter(regexp.MustCompile(`</?[a-z]+>`))
scanner := bufio.NewScanner(strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"))
// be sure to call Split()
scanner.Split(splitter)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}<html>
<body>
<p>
</p>
</body>
</html>
func ProcessTokens
func ProcessTokens(in io.Reader, re *regexp.Regexp, handler func(string))ProcessTokens calls handler(string) for each matching token from the Scanner.
Example
use ProcessTokens when a simple callback-based stream tokenizer is needed
package main
import (
"fmt"
"regexp"
"strings"
rs "github.com/tonymet/regexpscanner"
)
func main() {
rs.ProcessTokens(
strings.NewReader("<html><body><p>Welcome to My Website</p></body></html>"),
regexp.MustCompile(`</?[a-z]+>`),
func(text string) {
fmt.Println(text)
})
}<html>
<body>
<p>
</p>
</body>
</html>
regexpscanner is designed for high-efficiency, stream-based tokenization. It uses the regexp engine to find matches while minimizing buffer copies and splitter invocations.
Results from an AMD Ryzen 7 7735HS:
| Benchmark | Throughput | Memory Allocs | Notes |
|---|---|---|---|
regexpscanner |
~470 MB/s | 13 allocs/op | Full regex-based tokenization |
| Control | ~12.8 GB/s | 7 allocs/op | Simple bytes.Index (non-regex) |
- Splitter Invocations: ~1.01 calls per match (highly efficient buffer utilization).
- Throughput: ~470 MB/s for large-scale stream scanning.
While primitive character searching (bytes.Index) is naturally faster for fixed patterns, regexpscanner provides excellent performance for complex tokenization tasks that require the full power of the Go regexp engine.