How to parse string containing vectors to actual vector - rust

I have a string that I'm reading from input stream:
"['App01', 'App02', 'App03' , 'App04']"
What's the most efficient way to convert it to a Vec<String> type in Rust?

As a Code Golf answer:
serde_json::from_str(&s.replace('\'', "\"")).unwrap()
But both replace and from_str have to copy the strings data so memory wise it's probably not the best.

There's a lot of ways to parse a string like this, and it depends what you mean by "efficent" (performance? lines of code?), but here's a possible solution:
fn main() {
let input = "['App01', 'App02', 'App03' , 'App04']";
// Start by removing the surrounding braces.
let trimmed = input.trim_matches(['[', ']'].as_slice());
// Make the vector by:
let vector: Vec<String> = trimmed
.split(',') // separating the string parts by the comma
.map(str::trim) // cleaning each part by removing surrounding space
.map(|item| item.trim_matches('\'')) // then removing the single quotes
.map(String::from) // and converting each item from a slice to an owned String
.collect(); // and putting it all into a vector
println!("{vector:#?}");
}
And its output:
[
"App01",
"App02",
"App03",
"App04",
]
That's assuming you truly want an owned String. But it'd be a bit better if you opted to use a reference instead. To do that, it'd look like this instead:
let vector: Vec<&str> = trimmed
.split(',') // separating the string parts by the comma
.map(str::trim) // cleaning each part by removing surrounding space
.map(|item| item.trim_matches('\'')) // then removing the single quotes
.collect(); // and putting it all into a vector
Notice that the .map(String::from) call is gone and the resulting type is Vec<&str>. This avoids duplicating the strings and instead just uses a reference to the original input, using less memory overall.
Hope this helps.

Related

What is the most efficient way of taking a number of integer user inputs and storing it in a Vec<i32>?

I was trying to use rust for competitive coding and I was wondering what is the most efficient way of storing user input in a Vec. I have come up with a method but I am afraid that it is slow and redundant.
Here is my code:
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).expect("cant read line");
let input:Vec<&str> = input.split(" ").collect();
let input:Vec<String> = input.iter().map(|x| x.to_string()).collect();
let input:Vec<i32> = input.iter().map(|x| x.trim().parse().unwrap()).collect();
println!("{:?}", input);
}
PS: I am new to rust.
I see those ways of improving performance of the code:
Although not really relevant for std::io::stdin(), std::io::BufReader may have great effect for reading e.g. from std::fs::File. Buffer capacity can also matter.
Using locked stdin: let si = std::io::stdin(); let si = si.locked();
Avoiding allocations by keeping vectors around and using extend_from_iter instead of collect, if the code reads multiple line (unlike in the sample you posted in the question).
Maybe avoiding temporary vectors alltogether and just chaining Iterator operations together. Or using a loop like for line in input.split(...) { ... }. It may affect performance in both ways - you need to experiment to find out.
Avoiding to_string() and just storing reference to input buffer (which can also be used to parse() into i32. Note that this may invite famous Rust borrowing and lifetimes complexity.
Maybe finding some fast SIMD-enhanced string to int parser instead of libstd's parse().
Maybe streaming the result to algorithm instead of collecting everything to a Vec first. This can be beneficial especially if multiple threads can be used. For performance, you would still likely need to send data in chunks, not by one single i32.
Yeah, there are some changes you can make that will make your code more precise, simple and faster.
A better code :
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
let input: Vec<i32> = input.split_whitespace().map(|x| x.parse().unwrap()).collect();
println!("{:?}", input);
}
Explanation
The input.split_whitespace() returns an iterator containing elements that are seperated by any kind of whitespace including line breaks. This saves the time used in spliting by just one whitespace input.split(" ") and iterating over again with a .trim() method on each string slice to remove any surronding whitespaces.
(You can also checkout input.split_ascii_whitespace(), if you want to restrict the split over ascii whitespaces).
There was no need for the code input.iter().map(|x| x.to_string()).collect(), since you can call also call a .trim() method on a string slice.
This saves some time in both the runtime and coding process, since the .collect() method is only used once and there was just one iteration.

Get a random character from a string and append to another string

I'm trying to write the Rust equivalent of the following C++ code:
result += consonants[rand() % consonants.length()];
It is meant to take a random character from the string consonants and append it to the string result.
I seem to have found a working Rust equivalent, but it's... monstrous, to say the least. What would be a more idiomatic equivalent?
format!("{}{}", result, consonants.chars().nth(rand::thread_rng().gen_range(1, consonants.chars().count())).unwrap().to_string());
A few things:
You don't need to use format!() here. There is String::push() which appends a single char.
There is also the rand::sample() function which can randomly choose multiple elements from an iterator. This looks like the perfect fit!
So let's see how this fits together! I created three different versions for different use cases.
1. Unicode string (the general case)
let consonants = "bcdfghjklmnpqrstvwxyz";
let mut result = String::new();
result.push(rand::sample(&mut rand::thread_rng(), consonants.chars(), 1)[0]);
// | |
// sample one element from the iterator --+ |
// |
// get the first element from the returned vector --+
(Playground)
We sample only one element from the iterator and immediately push it to the string. Still not as short as with C's rand(), but please note that rand() is considered harmful for any kind of serious use! Using C++'s <random> header is a lot better, but will require a little bit more code, too. Additionally, your C version can't handle multi-byte characters (e.g. UTF-8 encoding), while the Rust version has full UTF-8 support.
2. ASCII string
However, if you only want to have a string with English consonants, then UTF-8 is not needed and we can make use of O(1) indexing, by using a byte slice:
use rand::{thread_rng, Rng};
let consonants = b"bcdfghjklmnpqrstvwxyz";
let mut result = String::new();
result.push(thread_rng().choose(consonants).cloned().unwrap().into());
// convert Option<&u8> into Option<u8> ^^^^^^
// unwrap, because we know `consonants` is not empty ^^^^^^
// convert `u8` into `char` ^^^^
(Playground)
3. Collection of characters with Unicode support
As mentioned in the comments, you probably just want a collection of characters ("consonants"). This means, we don't have to use a string, but rather an array of chars. So here is one last version which does have UTF-8 support and avoids O(n) indexing:
use rand::{thread_rng, Rng};
// If you need to avoid the heap allocation here, you can create a static
// array like this: let consonants = ['b', 'c', 'd', ...];
let consonants: Vec<_> = "bcdfghjklmnpqrstvwxyz".chars().collect();
let mut result = String::new();
result.push(*thread_rng().choose(&consonants).unwrap());
(Playground)

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Yes, you can take subslices of strings using the Index operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

How to partition a string at a fixed index?

I have a String (in particular a SHA1 hex digest) that I would like to split into two substrings - the first two characters and the rest of of the string. Is there a clean way to do this in Rust?
If you know that your string only contains ASCII characters (as in case with sha digests), you can use slices directly:
let s = "13e3f28a65a42bf6258cbd1d883d1ce3dac8f085";
let first = &s[..2]; // "13"
let rest = &s[2..]; // "e3f28a65a42bf6258cbd1d883d1ce3dac8f085"
It won't work correctly if your string contains non-ASCII characters because slicing uses byte offsets, and if any index used in slicing points into the middle of a code point representation, your program will panic.
There's a split_at method since Rust 1.4, use it like this:
let s = "13e3f28a65a42bf6258cbd1d883d1ce3dac8f085";
let (first, last) = s.split_at(2);
assert_eq!("13", first);
assert_eq!("e3f28a65a42bf6258cbd1d883d1ce3dac8f085", last);
Note that the index is a byte position and must lie on a character boundary. In this case this works because you know that your input string is ASCII.
If you are expecting two Strings instead of slices, you can use the chars() method and some Iterator methods to obtain them.
let text = "abcdefg".to_string();
let start: String = text.chars().take(2).collect();
let end: String = text.chars().skip(2).collect();
If you don't want to do heap allocations, you can use slices instead:
let start: &str = text.slice_chars(0, 2);
let end: &str = text.slice_chars(2, text.char_len());
Note that the slices version requires you to use unstable rust (nightly builds, not the beta)
Here is a way to efficiently split a String into two Strings, in case you have this owned string data case. The allocation of the input string is retained in the first piece by just using truncation.
/// Split a **String** at a particular index
///
/// **Panic** if **byte_index** is not a character boundary
fn split_string(mut s: String, byte_index: usize) -> (String, String)
{
let tail = s[byte_index..].into();
s.truncate(byte_index);
(s, tail)
}
Note: The .into() method is from the generic conversion trait Into and in this case it converts &str into String.

How do you iterate over a string by character

I have a string and I need to scan for every occurrence of "foo" and read all the text following it until a second ". Since Rust does not have a contains function for strings, I need to iterate by characters scanning for it. How would I do this?
Edit: Rust's &str has a contains() and find() method.
I need to iterate by characters scanning for it.
The .chars() method returns an iterator over characters in a string. e.g.
for c in my_str.chars() {
// do something with `c`
}
for (i, c) in my_str.chars().enumerate() {
// do something with character `c` and index `i`
}
If you are interested in the byte offsets of each char, you can use char_indices.
Look into .peekable(), and use peek() for looking ahead. It's wrapped like this because it supports UTF-8 codepoints instead of being a simple vector of characters.
You could also create a vector of chars and work on it from there, but that's more time and space intensive:
let my_chars: Vec<_> = mystr.chars().collect();
The concept of a "character" is very ambiguous and can mean many different things depending on the type of data you are working with. The most obvious answer is the chars method. However, this does not work as advertised. What looks like a single "character" to you may actually be made up of multiple Unicode code points, which can lead to unexpected results:
"a̐".chars() // => ['a', '\u{310}']
For a lot of string processing, you want to work with graphemes. A grapheme consists of one or more unicode code points represented as a string slice. These map better to the human perception of "characters". To create an iterator of graphemes, you can use the unicode-segmentation crate:
use unicode_segmentation::UnicodeSegmentation;
for grapheme in my_str.graphemes(true) {
// ...
}
If you are working with raw ASCII then none of the above applies to you, and you can simply use the bytes iterator:
for byte in my_str.bytes() {
// ...
}
Although, if you are working with ASCII then arguably you shouldn't be using String/&str at all and instead use Vec<u8>/&[u8] directly.
fn main() {
let s = "Rust is a programming language";
for i in s.chars() {
print!("{}", i);
}}
Output: Rust is a programming language
I use the chars() method to iterate over each element of the string.

Resources