Regular Expression Engine using DFA construction, written in Rust.
This is, primarily, another step in my journey to desperately learn Rust for my Studienarbeit lol
This will never be a fully perl-comatible expression engine, for the very simple reason that it uses a DFA and thus cannot possibly support backreferences. However, I'll probably also not include all of the syntactic sugar of deterministic perl-compatible expressions, as a simple form of regular expressions are equally strong from a theoretical standpoint.
- Klenee Star quantifier (
a*) - Union (
a|b) - Explicit precedence (
(r)) - Escaping (
\rwhereris a reserved symbol) -
^and$quantifiers and making the automata behave correctly to accpet parts of a word instead of the entire word
- One-or-more quantifier (
a+) - Zero-or-one quantifier (
a?) - Everything matcher (
.) (introduces non-determinism i'm afraid) - Character classes (
[ab])- Ranges (
[a-f])
- Ranges (
- n-times (
r{n}) - Escape sequences for non-reserved symbols (
\d)
- Non-greedy quantifiers
- Capture groups
Automata are represented as by their theoretical tuplet definition.
Caveat: An automaton has no explicit alphabet associated with it. As I only care about accepting words, if I encounter a symbol that's not present in any
transition it simply means the Automaton does not accept the word (i.e. not match in this case).
This goes for both NFAs and DFAs (latter of which usually must have a transition for every symbol of the alphabet for every state).
When finding patterns in a string, an unknown symbol simply means the run is reset back to q_start.
- Expression is parsed into a NFA using a predictive recursive descent
- NFA is transformed into a DFA using a powerset construction
- A run of the DFA on the input word is simulated
Note: I will also include a naive and non-optimized backtracking simulation of an NFA, as this is an easy byproduct of building NFAs correctly.
I may try my hands at a Thompson NFA at some point, to see how my approach stacks up.
One of my goals for this project is to make it as fast as possible (at the very least the actual matching, the construction is fine if it's a bit slower). This shall serve as a list of things I think I could do but don't want to now (because I'm already thinking too much and make slow progress (usual me problem)):
- Make transition relation a
Vec<(Set<state, symbol>, Vec<state>)>for O(len) iterations & better lookup - Re-introduce alphabet attribute to not have to deduct it at compile-time from the transitions -> saves one iteration over all transitions each time
successors_multipleis called - When I'm in the mood for a headache: Get rid of BTreeSets and use vectors (or hash maps)