Function bstr::decode_utf8
source · Expand description
UTF-8 decode a single Unicode scalar value from the beginning of a slice.
When successful, the corresponding Unicode scalar value is returned along with the number of bytes it was encoded with. The number of bytes consumed for a successful decode is always between 1 and 4, inclusive.
When unsuccessful, None
is returned along with the number of bytes that
make up a maximal prefix of a valid UTF-8 code unit sequence. In this case,
the number of bytes consumed is always between 0 and 3, inclusive, where
0 is only returned when slice
is empty.
Examples
Basic usage:
use bstr::decode_utf8;
// Decoding a valid codepoint.
let (ch, size) = decode_utf8(b"\xE2\x98\x83");
assert_eq!(Some('☃'), ch);
assert_eq!(3, size);
// Decoding an incomplete codepoint.
let (ch, size) = decode_utf8(b"\xE2\x98");
assert_eq!(None, ch);
assert_eq!(2, size);
This example shows how to iterate over all codepoints in UTF-8 encoded bytes, while replacing invalid UTF-8 sequences with the replacement codepoint:
use bstr::{B, decode_utf8};
let mut bytes = B(b"\xE2\x98\x83\xFF\xF0\x9D\x9E\x83\xE2\x98\x61");
let mut chars = vec![];
while !bytes.is_empty() {
let (ch, size) = decode_utf8(bytes);
bytes = &bytes[size..];
chars.push(ch.unwrap_or('\u{FFFD}'));
}
assert_eq!(vec!['☃', '\u{FFFD}', '𝞃', '\u{FFFD}', 'a'], chars);