Module aho_corasick::packed
source · Expand description
A lower level API for packed multiple substring search, principally for a small number of patterns.
This sub-module provides vectorized routines for quickly finding matches of a
small number of patterns. In general, users of this crate shouldn’t need to
interface with this module directory, as the primary
AhoCorasick
searcher will use these routines automatically as a prefilter when applicable.
However, in some cases, callers may want to bypass the Aho-Corasick machinery
entirely and use this vectorized searcher directly.
Overview
The primary types in this sub-module are:
Searcher
executes the actual search algorithm to report matches in a haystack.Builder
accumulates patterns incrementally and can construct aSearcher
.Config
permits tuning the searcher, and itself will produce aBuilder
(which can then be used to build aSearcher
). Currently, the only tuneable knob are the match semantics, but this may be expanded in the future.
Examples
This example shows how to create a searcher from an iterator of patterns.
By default, leftmost-first match semantics are used. (See the top-level
MatchKind
type for more details about match
semantics, which apply similarly to packed substring search.)
use aho_corasick::packed::{MatchKind, Searcher};
let searcher = Searcher::new(["foobar", "foo"].iter().cloned())?;
let matches: Vec<usize> = searcher
.find_iter("foobar")
.map(|mat| mat.pattern())
.collect();
assert_eq!(vec![0], matches);
This example shows how to use Config
to change the
match semantics to leftmost-longest:
use aho_corasick::packed::{Config, MatchKind};
let searcher = Config::new()
.match_kind(MatchKind::LeftmostLongest)
.builder()
.add("foo")
.add("foobar")
.build()?;
let matches: Vec<usize> = searcher
.find_iter("foobar")
.map(|mat| mat.pattern())
.collect();
assert_eq!(vec![1], matches);
Packed substring searching
Packed substring searching refers to the use of SIMD (Single Instruction, Multiple Data) to accelerate the detection of matches in a haystack. Unlike conventional algorithms, such as Aho-Corasick, SIMD algorithms for substring search tend to do better with a small number of patterns, where as Aho-Corasick generally maintains reasonably consistent performance regardless of the number of patterns you give it. Because of this, the vectorized searcher in this sub-module cannot be used as a general purpose searcher, since building the searcher may fail. However, in exchange, when searching for a small number of patterns, searching can be quite a bit faster than Aho-Corasick (sometimes by an order of magnitude).
The key take away here is that constructing a searcher from a list of patterns is a fallible operation. While the precise conditions under which building a searcher can fail is specifically an implementation detail, here are some common reasons:
- Too many patterns were given. Typically, the limit is on the order of 100 or so, but this limit may fluctuate based on available CPU features.
- The available packed algorithms require CPU features that aren’t available.
For example, currently, this crate only provides packed algorithms for
x86_64
. Therefore, constructing a packed searcher on any other target (e.g., ARM) will always fail. - Zero patterns were given, or one of the patterns given was empty. Packed searchers require at least one pattern and that all patterns are non-empty.
- Something else about the nature of the patterns (typically based on heuristics) suggests that a packed searcher would perform very poorly, so no searcher is built.