Parse a string containing a Unicode number into the corresponding Unicode character? - rust

Is there a function to do something like this:
fn string_to_unicode_char(s: &str) -> Option<char> {
// ...
}
fn main() {
let s = r"\u{00AA}"; // note the raw string literal!
string_to_unicode_char(s).unwrap();
}
Note that r"\u{00AA}" uses a raw string i. e. it isn't a Unicode sequence but 8 separate symbols, as \ u { 0 0 A A }.
I need to interpret/convert/parse this string and return a char if all is good, None otherwise. I don't have any experience with Unicode, so any ideas are welcome.

I believe the function you are looking for is char::from_u32:
fn string_to_unicode_char(s: &str) -> Option<char> {
// Do something more appropriate to find the actual number
let number = &s[3..7];
u32::from_str_radix(number, 16)
.ok()
.and_then(std::char::from_u32)
}
fn main() {
let s = r"\u{00AA}"; // note the raw string literal!
let ch = string_to_unicode_char(s);
assert_eq!(ch, Some('\u{00AA}'));
}

I indeed completely misunderstood your question; my old answer can be seen in the edit logs
Is there a builtin function to parse a string containing a Rust unicode escape into the corresponding unicode character?
AFAIK, no, there is not a builtin function to do that.
The answer to "how to do it yourself" is a bit broad, as there are many ways to do it (and it's not clear whether you also want to parse standard escapes, such as "\n").
Use a regex
Do simple, naive manual parsing
Embed it into a bigger lexer (the function in the Rust compiler parsing such unicode escapes)

Related

How do I remove some chars at the end of a string?

I need to match a few words in the start of a string, handle it, than removes it. How should I remove few chars or bytes in then end of aString?
I using regex crate to match the string. I can't find a way to remove chars in the end of the String.
Maybe something like this, but have non-ASCII chars:
use lazy_static::lazy_static;
use regex::Regex;
fn func(s: &mut String) {
lazy_static! {
static ref RE: Regex = Regex::new(r"123").unwrap();
}
let cap = match RE.captures(s.as_str()) {
Some(v) => v.get(0).unwrap(),
None => panic!("Error"),
};
do_something(cap.as_str());
s.delete(0, cap.end());
}
fn do_something(s: &str) {
assert_eq!(s, "123")
}
fn main() {
let s = String::from("123456");
func(s);
assert_eq!(s, "456");
}
I have seen remove method, but it says it's O(n). If it is, I think O(nm) is a little bit too slow for me.
You can use regexes Match::start to get a start of the capture group.
You can then use truncate to get rid of everything after that.
fn main() {
let mut text: String = "this is a text with some garbage after!abc".into();
let re = regex::Regex::new("abc$").unwrap();
let m = re.captures(&text).unwrap();
let g = m.get(0).unwrap();
text.truncate(g.start());
dbg!(text);
}
What you're looking for is truncate - except with non-ascii support.
For ascii only, this works:
let mut s = String::from("123456789");
s.truncate(s.len() - 3);
assert_eq!(s, "123456");
However since String can contain unicode characters which aren't always 1 byte, it doesn't work for non-ascii (panics if the new length does not lie on a char boundary)
If you want non-ascii support, there isn't an O(1) solution according to this answer. That answer does give an implementation using char_indicies(), I think it's the best way unless I'm missing something.
There is also the unicode-truncate crate, which also seems to use char_indicies() - might be worth a look.

Simpler way to check if string start with a digit in Rust?

What I am currently using is this
fn main() {
let a = "abc123";
let b = "1a2b3c";
println!("{}", a[0..1].chars().all(char::is_numeric));
println!("{}", b[0..1].chars().all(char::is_numeric));
}
Are there a more idiomatic and/or simpler way to do this?
Note: The string is guaranteed to be non empty and made of ASCII characters.
If you are sure that it is non-empty and made out of ascii, you can operate directly on bytes (u8):
a.as_bytes()[0].is_ascii_digit()
or
(b'0'..=b'9').contains(&a.as_bytes()[0])
More general setting (and, in my opinion, more idiomatic):
a.chars().next().unwrap().is_numeric()
The reason all this looks a bit unwieldy is that there may be some things going wrong (that are easily overlooked in other languages):
string might be empty => leads us into Option/unwrap-land
strings in rust are UTF-8 (which basically complicates random-accessing into string; note that rust does not only consider 0-9 as numeric, as shown here)
Starting from your original solution and parse:
fn main() {
let a = "abc123";
let b = "1a2b3c";
println!("{:?}", a[0..1].parse::<u8>().is_ok()); // false
println!("{:?}", b[0..1].parse::<u8>().is_ok()); // true
}
If the first character is guaranteed to be ASCII and the string is not empty.
Playground

How to get the last character of a &str?

In Python, this would be final_char = mystring[-1]. How can I do the same in Rust?
I have tried
mystring[mystring.len() - 1]
but I get the error the type 'str' cannot be indexed by 'usize'
That is how you get the last char (which may not be what you think of as a "character"):
mystring.chars().last().unwrap();
Use unwrap only if you are sure that there is at least one char in your string.
Warning: About the general case (do the same thing as mystring[-n] in Python): UTF-8 strings are not to be used through indexing, because indexing is not a O(1) operation (a string in Rust is not an array). Please read this for more information.
However, if you want to index from the end like in Python, you must do this in Rust:
mystring.chars().rev().nth(n - 1) // Python: mystring[-n]
and check if there is such a character.
If you miss the simplicity of Python syntax, you can write your own extension:
trait StrExt {
fn from_end(&self, n: usize) -> char;
}
impl<'a> StrExt for &'a str {
fn from_end(&self, n: usize) -> char {
self.chars().rev().nth(n).expect("Index out of range in 'from_end'")
}
}
fn main() {
println!("{}", "foobar".from_end(2)) // prints 'b'
}
One option is to use slices. Here's an example:
let len = my_str.len();
let final_str = &my_str[len-1..];
This returns a string slice from position len-1 through the end of the string. That is to say, the last byte of your string. If your string consists of only ASCII values, then you'll get the final character of your string.
The reason why this only works with ASCII values is because they only ever require one byte of storage. Anything else, and Rust is likely to panic at runtime. This is what happens when you try to slice out one byte from a 2-byte character.
For a more detailed explanation, please see the strings section of the Rust book.
As #Boiethios mentioned
let last_ch = mystring.chars().last().unwrap();
Or
let last_ch = codes.chars().rev().nth(0).unwrap();
I would rather have (how hard is that!?)
let last_ch = codes.chars(-1); // Not implemented as rustc 1.56.1

Why is capitalizing the first letter of a string so convoluted in Rust?

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}

How to iterate through characters in a string in Rust to match words?

I'd like to iterate through a sentence to extract out simple words from the string. Here's what I have so far, trying to make the parse function first match world in the input string:
fn parse(input: String) -> String {
let mut val = String::new();
for c in input.chars() {
if c == "w".to_string() {
// guessing I have to test one character at a time
val.push_str(c.to_str());
}
}
return val;
}
fn main() {
let s = "Hello world!".to_string();
println!("{}", parse(s)); // should say "world"
}
What is the correct way to iterate through the characters in a string to match patterns in Rust (such as for a basic parser)?
Checking for words in a string is easy with the str::contains method.
As for writing a parser itself, I don't think it's any different in Rust than other languages. You have to create some sort of state machine.
For examples, you could check out serialize::json. I also wrote a CSV parser that uses a buffer with a convenient read_char method. The advantage of using this approach is that you don't need to load the whole input into memory at once.

Resources