Checking a byte inside of a loop doesn't work - rust

I'm building a translator that should convert words into Pig Latin (i.e. the word apple to apple-hay or word happy to appy-fay). If the word begins with vowel, it doesn't drop it and adds "-hay" to its end, and drops the first letter if it is consonant and adds "-fay" to the end:
use std::str;
fn main() {
// The case when it works perfectly well
let dict = String::from("Hello").into_bytes();
let vowels: Vec<u8> = vec![b'a', b'e', b'i', b'o', b'u'];
let mut result = String::new();
for c in vowels.iter() {
if &dict[0] == c {
result = str::from_utf8(&dict).unwrap().to_owned() + "-hay ";
} else {
result = str::from_utf8(&dict[1..]).unwrap().to_owned() + "-fay";
}
}
println!("{}", result);
}
The code compiles without any errors or warnings and if I pass a string that begins with consonant it works perfectly well. However, when I pass a string that starts with a vowel, e.g. apple, the function behaves just like it began from a consonant and still performs actions from the else block. What is my error here?

You need to break once you find a matching vowel... Otherwise, unless the first letter of the string happens to be the last vowel in your set, once you match it, you'll
continue to compare the first letter of the string against every other vowel,
find that it doesn't match, because obviously a letter can't be two or more vowels at once, and hence
conclude - potentially multiple times, and finally - that the first letter of the string is a consonant.
Anyway, that should be a separate function, not main(), and just return once you find a match, so you won't need a result variable or break.

Related

How to get a substring of a &str based on character index?

I am trying to write a program that takes a list of words and then, if the word has an even length, prints the two middle letters. If the word has an odd length, it prints the single middle letter.
I can find the index of the middle letter(s), but I do not know how to use that index to print the corresponding letters of the word.
fn middle(wds: &[&str)){
for word in wds{
let index = words.chars().count() /2;
match words.chars().count() % 2{
0 => println!("Even word found"),
_ => println!("odd word found")
}
}
}
fn main(){
let wordlist = ["Some","Words","to","test","testing","elephant","absolute"];
middle(&wordlist);
}
You can use slices for this, specifically &str slices. Note the &.
These links might be helpful:
https://riptutorial.com/rust/example/4146/string-slicing
https://doc.rust-lang.org/book/ch04-03-slices.html
fn main() {
let s = "elephant";
let mid = s.len() / 2;
let sliced = &s[mid - 1..mid + 1];
println!("{}", sliced);
}
Hey after posting i found two different ways of doing it, the fact i had two seperate ways in my head was confusing me and stopping me finding the exact answer.
//i fixed printing the middle letter of the odd numbered string with
word.chars().nth(index).unwrap()
//to fix the even index problem i did
&word[index-1..index+1]

Split string and skip empty substrings

I am learning Rust and discovered this problem:
I would like to split a string by a pattern and remove all cases where the resultin substring is empty.
Here is an example:
let s = "s,o,m,e,";
for elem in s.split(",").skip_while(|&x| x.is_empty()) {
print!(" <{}> ", elem);
//print!(" <{}>({}) ", elem, elem.is_empty());
}
But the result is the following:
<s> <o> <m> <e> <>
My thoughts were: The struct Split returned by split implements Iterator which provides skip_while. IntelliSense told me the x in the closure is of type &&str so I would assume all the elements of the iterator (of type &str) which are empty to be omitted.
But it doesn't skip the empty substring.
I also tried to print the result of the is_empty function. It shows that the last slice is indeed empty. If I instead for the skip_while use skip_while(|&x| x == "s"), it correctly leaves out the "s" (printed with is_empty here):
<o>(false) <m>(false) <e>(false) <>(true)
So somehow the slice behaves differently in the iterator?
Why is that or where am I mistaken?
If you only need to omit the 1 empty string at the end of input, just use split_terminator instead of split. This adapter is basically just like split but it treats the pattern argument as a terminator instead of a separator, so the empty string at the end of input is not considered a new element.
If you truly want to skip all the empty strings, keep reading.
skip_while is doing exactly as its documentation says (emphasis mine):
skip_while() takes a closure as an argument. It will call this closure on each element of the iterator, and ignore elements until it returns false.
After false is returned, skip_while()'s job is over, and the rest of the elements are yielded.
Filtering out all the elements that match a predicate, regardless of where they are in the sequence, is the job of filter:
let s = ",s,o,,m,e,";
for elem in s.split(",").filter(|&x| !x.is_empty()) {
print!(" <{}> ", elem);
}
The above code will have the effect you wanted.
Note that the predicate to filter has the opposite meaning to skip_while: instead of returning true for elements that should not be yielded, it returns true for elements that should be yielded.
skip_while ignore elements until it returns false, that is it will take all elements after the first non empty string in your case. For instance, given:
let s = ",,s,o,m,e,";
skip_while will ignore the first two empty strings but keep the last one:
let s = ",,s,o,m,e,";
for elem in s.split(",").skip_while(|&x| x.is_empty()) {
print!(" <{}> ", elem);
}
Will print:
<s> <o> <m> <e> <>
For your case, it seems you just need filter:
let s = "s,o,m,e,";
for elem in s.split(",").filter(|&x| !x.is_empty()) {
print!(" <{}> ", elem);
}
Playground

Slice a string containing Unicode chars

I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "fire"
>>> s[0:2]
'fir'
Also not what you'd expect. This time, fi is actually the ligature fi, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/
A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();
Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}

How to compare upper and lowercase letters in a conditional in Swift

Apologies if this is a duplicate. I have a helper function called inputString() that takes user input and returns a String. I want to proceed based on whether an upper or lowercase character was entered. Here is my code:
print("What do you want to do today? Enter 'D' for Deposit or 'W' for Withdrawl.")
operation = inputString()
if operation == "D" || operation == "d" {
print("Enter the amount to deposit.")
My program quits after the first print function, but gives no compiler errors. I don't know what I'm doing wrong.
It's important to keep in mind that there is a whole slew of purely whitespace characters that show up in strings, and sometimes, those whitespace characters can lead to problems just like this.
So, whenever you are certain that two strings should be equal, it can be useful to print them with some sort of non-whitespace character on either end of them.
For example:
print("Your input was <\(operation)>")
That should print the user input with angle brackets on either side of the input.
And if you stick that line into your program, you'll see it prints something like this:
Your input was <D
>
So it turns out that your inputString() method is capturing the newline character (\n) that the user presses to submit their input. You should improve your inputString() method to go ahead and trim that newline character before returning its value.
I feel it's really important to mention here that your inputString method is really clunky and requires importing modules. But there's a way simpler pure Swift approach: readLine().
Swift's readLine() method does exactly what your inputString() method is supposed to be doing, and by default, it strips the newline character off the end for you (there's an optional parameter you can pass to prevent the method from stripping the newline).
My version of your code looks like this:
func fetchInput(prompt: String? = nil) -> String? {
if let prompt = prompt {
print(prompt, terminator: "")
}
return readLine()
}
if let input = fetchInput("Enter some input: ") {
if input == "X" {
print("it matches X")
}
}
the cause of the error that you experienced is explained at Swift how to compare string which come from NSString. Essentially, we need to remove any whitespace or non-printing characters such as newline etc.
I also used .uppercaseString to simplify the comparison
the amended code is as follows:
func inputString() -> String {
var keyboard = NSFileHandle.fileHandleWithStandardInput()
var inputData = keyboard.availableData
let str: String = (NSString(data: inputData, encoding: NSUTF8StringEncoding)?.stringByTrimmingCharactersInSet(
NSCharacterSet.whitespaceAndNewlineCharacterSet()))!
return str
}
print("What do you want to do today? Enter 'D' for Deposit or 'W' for Withdrawl.")
let operation = inputString()
if operation.uppercaseString == "D" {
print("Enter the amount to deposit.")
}

Accessing a character in a borrowed string by index

How would you access an element in a borrowed string by index?
Straightforward in Python:
my_string_lst = list(my_string)
print my_string_list[0]
print my_string[0] # same as above
Rust (attempt 1):
let my_string_vec = vec![my_string]; # doesn't work
println!("{}", my_string_vec[0]); # returns entire of `my_string`
Rust (attempt 2):
let my_string_vec = my_string.as_bytes(); # returns a &[u8]
println!("{}", my_string_vec[0]); # prints nothing
My end goal is to stick it into a loop like this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
I think it's possible to do use my_string_vec.as_bytes()[pos]and my_string_vec.as_bytes()[pos-shift]in my condition statement, but I feel that this has a bad code smell.
You can use char_at(index) to access a specific character. If you want to iterate over the characters in a string, you can use the chars() method which yields an iterator over the characters in the string.
The reason it was specifically not made possible to use indexing syntax is, IIRC, because indexing syntax would give the impression that it was like accessing a character in your typical C-string-like string, where accessing a character at a given index is a constant time operation (i.e. just accessing a single byte in an array). Strings in Rust, on the other hand, are Unicode and a single character may not necessarily consist of just one byte, making a specific character access a linear time operation, so it was decided to make that performance difference explicit and clear.
As far as I know, there is no method available for swapping characters in a string (see this question). Note that this wouldn't have been possible anyways via an immutably borrowed string, since such a string isn't yours to modify. You would have to most likely use a String, or perhaps a &mut str if you're strictly swapping, but I'm not too familiar with Unicode's intricacies.
I recommend instead you build up a String the way you want it, that way you don't have to worry about the mutability of the borrowed string. You'd refer/look into the borrowed string, and write into the output/build-up string accordingly based on your logic.
So this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
Might become something like this (not tested; not clear what your logic is for):
for ch in my_string.chars()
while shift <= pos && ch != my_string.char_at(pos - shift) {
// assuming shifts is a vec; not clear in question
shift += shifts[pos - shift];
}
shifts.push(shift);
}
Your last for loop:
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
That kind of seems like you want to compare a given character in string A with the corresponding character (in the same position) of string B. For this I would recommend zipping the chars iterator of the first with the second, something like:
for (left, right) in my_string.chars().zip(my_other_string.chars()) {
if left != right {
}
}
Note that zip() stops iterating as soon as either iterator stops, meaning that if the strings are not the same length, then it'll only go as far as the shortest string.
If you need access to the "character index" information, you could add .enumerate() to that, so the above would change to:
for (index, (left, right)) in my_string.chars().zip(my_other_string.chars()).enumerate()

Resources