Struct regex_automata::Regex
source · Expand description
A regular expression that uses deterministic finite automata for fast searching.
A regular expression is comprised of two DFAs, a “forward” DFA and a “reverse” DFA. The forward DFA is responsible for detecting the end of a match while the reverse DFA is responsible for detecting the start of a match. Thus, in order to find the bounds of any given match, a forward search must first be run followed by a reverse search. A match found by the forward DFA guarantees that the reverse DFA will also find a match.
The type of the DFA used by a Regex
corresponds to the D
type
parameter, which must satisfy the DFA
trait. Typically,
D
is either a DenseDFA
or a
SparseDFA
, where dense DFAs use more memory but
search faster, while sparse DFAs use less memory but search more slowly.
By default, a regex’s DFA type parameter is set to
DenseDFA<Vec<usize>, usize>
. For most in-memory work loads, this is the
most convenient type that gives the best search performance.
Sparse DFAs
Since a Regex
is generic over the DFA
trait, it can be used with any
kind of DFA. While this crate constructs dense DFAs by default, it is easy
enough to build corresponding sparse DFAs, and then build a regex from
them:
use regex_automata::Regex;
// First, build a regex that uses dense DFAs.
let dense_re = Regex::new("foo[0-9]+")?;
// Second, build sparse DFAs from the forward and reverse dense DFAs.
let fwd = dense_re.forward().to_sparse()?;
let rev = dense_re.reverse().to_sparse()?;
// Third, build a new regex from the constituent sparse DFAs.
let sparse_re = Regex::from_dfas(fwd, rev);
// A regex that uses sparse DFAs can be used just like with dense DFAs.
assert_eq!(true, sparse_re.is_match(b"foo123"));
Implementations§
source§impl Regex
impl Regex
sourcepub fn new(pattern: &str) -> Result<Regex, Error>
pub fn new(pattern: &str) -> Result<Regex, Error>
Parse the given regular expression using a default configuration and return the corresponding regex.
The default configuration uses usize
for state IDs, premultiplies
them and reduces the alphabet size by splitting bytes into equivalence
classes. The underlying DFAs are not minimized.
If you want a non-default configuration, then use the
RegexBuilder
to set your own configuration.
Example
use regex_automata::Regex;
let re = Regex::new("foo[0-9]+bar")?;
assert_eq!(Some((3, 14)), re.find(b"zzzfoo12345barzzz"));
source§impl Regex<SparseDFA<Vec<u8>, usize>>
impl Regex<SparseDFA<Vec<u8>, usize>>
sourcepub fn new_sparse(
pattern: &str
) -> Result<Regex<SparseDFA<Vec<u8>, usize>>, Error>
pub fn new_sparse(
pattern: &str
) -> Result<Regex<SparseDFA<Vec<u8>, usize>>, Error>
Parse the given regular expression using a default configuration and return the corresponding regex using sparse DFAs.
The default configuration uses usize
for state IDs, reduces the
alphabet size by splitting bytes into equivalence classes. The
underlying DFAs are not minimized.
If you want a non-default configuration, then use the
RegexBuilder
to set your own configuration.
Example
use regex_automata::Regex;
let re = Regex::new_sparse("foo[0-9]+bar")?;
assert_eq!(Some((3, 14)), re.find(b"zzzfoo12345barzzz"));
source§impl<D: DFA> Regex<D>
impl<D: DFA> Regex<D>
sourcepub fn is_match(&self, input: &[u8]) -> bool
pub fn is_match(&self, input: &[u8]) -> bool
Returns true if and only if the given bytes match.
This routine may short circuit if it knows that scanning future input
will never lead to a different result. In particular, if the underlying
DFA enters a match state or a dead state, then this routine will return
true
or false
, respectively, without inspecting any future input.
Example
use regex_automata::Regex;
let re = Regex::new("foo[0-9]+bar")?;
assert_eq!(true, re.is_match(b"foo12345bar"));
assert_eq!(false, re.is_match(b"foobar"));
sourcepub fn shortest_match(&self, input: &[u8]) -> Option<usize>
pub fn shortest_match(&self, input: &[u8]) -> Option<usize>
Returns the first position at which a match is found.
This routine stops scanning input in precisely the same circumstances
as is_match
. The key difference is that this routine returns the
position at which it stopped scanning input if and only if a match
was found. If no match is found, then None
is returned.
Example
use regex_automata::Regex;
let re = Regex::new("foo[0-9]+")?;
assert_eq!(Some(4), re.shortest_match(b"foo12345"));
// Normally, the end of the leftmost first match here would be 3,
// but the shortest match semantics detect a match earlier.
let re = Regex::new("abc|a")?;
assert_eq!(Some(1), re.shortest_match(b"abc"));
sourcepub fn find(&self, input: &[u8]) -> Option<(usize, usize)>
pub fn find(&self, input: &[u8]) -> Option<(usize, usize)>
Returns the start and end offset of the leftmost first match. If no
match exists, then None
is returned.
The “leftmost first” match corresponds to the match with the smallest
starting offset, but where the end offset is determined by preferring
earlier branches in the original regular expression. For example,
Sam|Samwise
will match Sam
in Samwise
, but Samwise|Sam
will
match Samwise
in Samwise
.
Generally speaking, the “leftmost first” match is how most backtracking
regular expressions tend to work. This is in contrast to POSIX-style
regular expressions that yield “leftmost longest” matches. Namely,
both Sam|Samwise
and Samwise|Sam
match Samwise
when using
leftmost longest semantics.
Example
use regex_automata::Regex;
let re = Regex::new("foo[0-9]+")?;
assert_eq!(Some((3, 11)), re.find(b"zzzfoo12345zzz"));
// Even though a match is found after reading the first byte (`a`),
// the leftmost first match semantics demand that we find the earliest
// match that prefers earlier parts of the pattern over latter parts.
let re = Regex::new("abc|a")?;
assert_eq!(Some((0, 3)), re.find(b"abc"));
sourcepub fn is_match_at(&self, input: &[u8], start: usize) -> bool
pub fn is_match_at(&self, input: &[u8], start: usize) -> bool
Returns the same as is_match
, but starts the search at the given
offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, if the DFA is anchored, then
a match can only occur when start == 0
.
sourcepub fn shortest_match_at(&self, input: &[u8], start: usize) -> Option<usize>
pub fn shortest_match_at(&self, input: &[u8], start: usize) -> Option<usize>
Returns the same as shortest_match
, but starts the search at the
given offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, if the DFA is anchored, then
a match can only occur when start == 0
.
sourcepub fn find_at(&self, input: &[u8], start: usize) -> Option<(usize, usize)>
pub fn find_at(&self, input: &[u8], start: usize) -> Option<(usize, usize)>
Returns the same as find
, but starts the search at the given
offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, if the DFA is anchored, then
a match can only occur when start == 0
.
sourcepub fn find_iter<'r, 't>(&'r self, input: &'t [u8]) -> Matches<'r, 't, D>
pub fn find_iter<'r, 't>(&'r self, input: &'t [u8]) -> Matches<'r, 't, D>
Returns an iterator over all non-overlapping leftmost first matches in the given bytes. If no match exists, then the iterator yields no elements.
Note that if the regex can match the empty string, then it is
possible for the iterator to yield a zero-width match at a location
that is not a valid UTF-8 boundary (for example, between the code units
of a UTF-8 encoded codepoint). This can happen regardless of whether
allow_invalid_utf8
was enabled or not.
Example
use regex_automata::Regex;
let re = Regex::new("foo[0-9]+")?;
let text = b"foo1 foo12 foo123";
let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
assert_eq!(matches, vec![(0, 4), (5, 10), (11, 17)]);
sourcepub fn from_dfas(forward: D, reverse: D) -> Regex<D>
pub fn from_dfas(forward: D, reverse: D) -> Regex<D>
Build a new regex from its constituent forward and reverse DFAs.
This is useful when deserializing a regex from some arbitrary memory region. This is also useful for building regexes from other types of DFAs.
Example
This example is a bit a contrived. The usual use of these methods
would involve serializing initial_re
somewhere and then deserializing
it later to build a regex.
use regex_automata::Regex;
let initial_re = Regex::new("foo[0-9]+")?;
assert_eq!(true, initial_re.is_match(b"foo123"));
let (fwd, rev) = (initial_re.forward(), initial_re.reverse());
let re = Regex::from_dfas(fwd, rev);
assert_eq!(true, re.is_match(b"foo123"));
This example shows how you might build smaller DFAs, and then use those smaller DFAs to build a new regex.
use regex_automata::Regex;
let initial_re = Regex::new("foo[0-9]+")?;
assert_eq!(true, initial_re.is_match(b"foo123"));
let fwd = initial_re.forward().to_u16()?;
let rev = initial_re.reverse().to_u16()?;
let re = Regex::from_dfas(fwd, rev);
assert_eq!(true, re.is_match(b"foo123"));
This example shows how to build a Regex
that uses sparse DFAs instead
of dense DFAs:
use regex_automata::Regex;
let initial_re = Regex::new("foo[0-9]+")?;
assert_eq!(true, initial_re.is_match(b"foo123"));
let fwd = initial_re.forward().to_sparse()?;
let rev = initial_re.reverse().to_sparse()?;
let re = Regex::from_dfas(fwd, rev);
assert_eq!(true, re.is_match(b"foo123"));