Split string in Rust, treating consecutive delimiters as one - rust

How do I split a string in Rust such that contiguous delimiters are collapsed into one? For example:
"1 2 3".splitX(" ")
should yield this Vec: ["1", "2", "3"] (when collected from the Split object, or any other intermediate object there may be). This example is for whitespace but we should be able to extend this for other delimiters too.
I believe we can use .filter() to remove empty items after using .split(), but it would be cleaner if it could be done as part of the original .split() directly. I obviously searched this thoroughly and am surprised I can't find the answer anywhere.
I know for whitespace we already have split_whitespace() and split_ascii_whitespace(), but I am looking for a solution that works for a general delimiter string.

The standard solution is to use split then filter:
let output: Vec<&str> = input
.split(pattern)
.filter(|s| !s.is_empty())
.collect();
This is fast and clear.
You can also use a regular expression to avoid the filter step:
let output: Vec<&str> = regex::Regex::new(" +").unwrap()
.split(input)
.collect();
If it's in a function which will be called several times, you can avoid repeating the Regex compilation with lazy_regex:
let output: Vec<&str> = lazy_regex::regex!(" +")
.split(input)
.collect();

IMO, by far the cleanest way is to write .split(" ").filter(|s| !s.is_empty()). It works for all separators and the intent is obvious from reading the code.
If that's too "ugly", you could perhaps pull it into a trait:
trait SplitNonEmpty {
// you might want to define your own struct for the return type
fn split_non_empty<'a, P>(&self, p: P) where P: Pattern<'a> -> ...;
}
impl SplitNonEmpty for &str {
// ...
}
If it's very important that this function returns a Split, you might need to refactor your code to use traits more; do you really care that it was created by splitting a string, or do you care that you can iterate over it? If so, maybe that function should take a impl IntoIterator<&'a str>?

As stated by others, split and filter or with regex is better here. But there is one pattern which can be used flat_map. Though in this context it doesn't add much value.
fn main() {
let output: Vec<&str> = "1 2 3"
.split(" ")
.flat_map(|x| if !x.is_empty() { Some(x) } else { None })
.collect();
println!("{:#?}", output)
}
You can use this pattern, say, if you want to parse these strings as numbers and ignore error values.
fn main() {
let output: Vec<i32> = "1 2 3"
.split(" ")
.flat_map(|x| x.parse())
.collect();
println!("{:#?}", output)
}
All flat_map cares is closure to return something which implements IntoIterator

Related

How to succinctly convert an iterator over &str to a collection of String

I am new to Rust, and it seems very awkward to use sequences of functional transformations on strings, because they often return &str.
For example, here is an implementation in which I try to read lines of two words separated by a space, and store them into a container of tuples:
use itertools::Itertools;
fn main() {
let s = std::io::stdin()
.lines()
.map(|l| l.unwrap())
.map(|l| {
l.split(" ")
.collect_tuple()
.map(|(a, b)| (a.to_string(), b.to_string()))
.unwrap()
})
.collect::<Vec<_>>();
println!("{:?}", s);
}
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=7f6d370457cc3254195565f69047018c
Because split returns an iterator to &str objects, whose scope is the lambda used for the map, the only way I saw to return them was to manually convert them back to strings. This seems really awkward.
Is there a better way to implement such a program?
Rust is explicit about allocation. The Strings returned by the lines() iterator don't persist beyond the iterator chain, so you can't just store references into them. Therefore, logically, there needs to be a to_string() (or to_owned, or String::from) somewhere.
But putting it after the tuple creation is a bit awkward, because it requires you to call the function twice. You can turn the result of the split() into owned objects instead. This should work:
.map(|l| {
l.split(" ")
.map(String::from)
.collect_tuple()
.unwrap()
})
.collect::<Vec<(_,_)>>();
Note that now you have to be explicit about the tuple type, though.

How to extract a value from a set of Strings?

I have a set of strings where I am getting using lines() function. The Strings are like
abcdjf hfdf
test oinf=ddfn
cbdfk test12345=my value
mngf jdk
I want to get my value from the above strings. So, I am using the code as
body.lines()
.filter(|s| s.contains("test12345="))
.map(|x| x.split("=")[1]).to_string();
But it's not working and not returning any value. What is the correct code for this?
First of all, you cannot call to_string on an iterator. Second split returns an iterator as well, so you cannot index it (i.e. [1]), instead you'd need to call nth(1).
body
.lines()
.filter(|s| s.contains("test12345="))
.map(|x| x.split("=").nth(1))
In case there can be multiple = after the first one, which you want to retain in the value, then instead use splitn(2, "="), i.e.:
.map(|x| x.splitn(2, "=").nth(1))
Also, given your filter then everything is needlessly wrapped in Some(..). To avoid that, you can combine the filter and map using filter_map.
body
.lines()
.filter_map(|s| {
if s.contains("test12345=") {
s.splitn(2, "=").nth(1)
} else {
None
}
});
Since you attempted to use to_string. Then if you do want the iterator to return String instead of &str then you can add .map(ToString::to_string) either after nth(2) or after filter_map(..).
Iterator::map() returns an interator, not a value, so you can't use to_string() on it. On the other hand, String::split() does not return a slice, but an iterator, so you can't access the value like [1]; instead, you must access it with the iterator API. As far as Rust can know, there could be multiple lines that contain "test12345=", so it must deal with that. To do so, you would need to .collect() your results in a Vec<String>:
let values: Vec<String> = body.lines()
.filter(|s| s.contains("test12345="))
.map(|x| x.split("=").nth(1).unwrap().to_string())
.collect();
Now, that doens't look nice nor idiomatic, does it?. Since the .filter().map() is a common pattern, there's .filter_map() that accomplishes both in a single function. It's quite handy that it expects that the closure to return Option<T>, so you could use ? for early returns if needed.
let values: Vec<String> = body.lines()
.filter_map(|line| {
if !line.contains("test12345=") {
return None;
}
line.split("=").nth(1).map(String::from)
})
.collect();
Iterator::nth() will give you the nth element on the iterator, but it could not exist, that's why it returns an Option. By using Option::map() you can convert from &str to String if there's a value. In this case by passing the String::from function as the argument to .map() it will convert from Option<&str> to Option<String> which matches the return type of the closure, so now you'll have what you're looking for

Most idiomatic way to double every character in a string in Rust

I have a String, and I want to make a new String, with every character in the first one doubled. So "abc" would become "aabbcc" and so on.
The best I've come up with is:
let mut result = String::new();
for c in original_string.chars() {
result.push(c);
result.push(c);
}
result
This works fine. but is there a more succinct (or more idiomatic) way to do this?
In JavaScript I would probably write something like:
original.split('').map(c => c+c).join('')
Or in Ruby:
(original.chars.map { |c| c+c }).join('')
Since Rust also has functional elements, I was wondering if there is a similarly succinct solution.
I would use std::iter::repeat to repeat every char value from the input. This creates an infinite iterator, but for your case we only need to iterate 2 times, so we can use take to limit our iterator, then flatten all the iterators that hold the doubled chars.
use std::iter;
fn main() {
let input = "abc"; //"abc".to_string();
let output = input
.chars()
.flat_map(|c| iter::repeat(c).take(2))
.collect::<String>();
println!("{:?}", output);
}
Playground
Note: To double we are using take(2) but you can use any usize to increase the repetition.
Personally, I would just do exactly what you're doing. Its intent is clear (more clear than the functional approaches you presented from JavaScript or Ruby, in my opinion) and it is efficient. The only thing I would change is perhaps reserve space for the characters, since you know exactly how much space you will need.
let mut result = String::with_capacity(original_string.len() * 2);
However, if you are really in love with this style, you could use flat_map
let result: String = original_string.chars()
.flat_map(|c| std::iter::repeat(c).take(2))
.collect();

Searching for a matching subslice? [duplicate]

I have a &[u8] slice over a binary buffer. I need to parse it, but a lot of the methods that I would like to use (such as str::find) don't seem to be available on slices.
I've seen that I can covert both by buffer slice and my pattern to str by using from_utf8_unchecked() but that seems a little dangerous (and also really hacky).
How can I find a subsequence in this slice? I actually need the index of the pattern, not just a slice view of the parts, so I don't think split will work.
Here's a simple implementation based on the windows iterator.
fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
haystack.windows(needle.len()).position(|window| window == needle)
}
fn main() {
assert_eq!(find_subsequence(b"qwertyuiop", b"tyu"), Some(4));
assert_eq!(find_subsequence(b"qwertyuiop", b"asd"), None);
}
The find_subsequence function can also be made generic:
fn find_subsequence<T>(haystack: &[T], needle: &[T]) -> Option<usize>
where for<'a> &'a [T]: PartialEq
{
haystack.windows(needle.len()).position(|window| window == needle)
}
I don't think the standard library contains a function for this. Some libcs have memmem, but at the moment the libc crate does not wrap this. You can use the twoway crate however. rust-bio implements some pattern matching algorithms, too. All of those should be faster than using haystack.windows(..).position(..)
I found the memmem crate useful for this task:
use memmem::{Searcher, TwoWaySearcher};
let search = TwoWaySearcher::new("dog".as_bytes());
assert_eq!(
search.search_in("The quick brown fox jumped over the lazy dog.".as_bytes()),
Some(41)
);
How about Regex on bytes? That looks very powerful. See this Rust playground demo.
extern crate regex;
use regex::bytes::Regex;
fn main() {
//see https://doc.rust-lang.org/regex/regex/bytes/
let re = Regex::new(r"say [^,]*").unwrap();
let text = b"say foo, say bar, say baz";
// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<usize> =
re.captures_iter(text)
.map(|c| c.get(0).unwrap().start())
.collect();
assert_eq!(cstrs, vec![0, 9, 18]);
}

Why is capitalizing the first letter of a string so convoluted in Rust?

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}

Resources