How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string? - string

I am new to Rust and I was trying to split Devanagari (vowels and) bi-tri and tetra conjuncts consonants as whole while keeping the vowel sign and virama. and later map them with other Indic script. I first tried using Rust's chars() which didn't work. Then I came across grapheme clusters. I have been googling and searching SO about Unicode and UTF-8, grapheme clusters, and complex scripts.
I have used grapheme clusters in my current code, but it does not give me the desired output. I understand that this method may not work for complex scripts like Devanagari or other Indic scripts.
How can I achieve the desired output? I have another code where I attempted to build a simple cluster using an answer from Stack Overflow, converting it from Python to Rust, but I have not had any luck yet. It's been 2 weeks and I have been stuck on this problem.
Here's the Devanagari Script and Conjucts wiki:
Devanagari Script: https://en.wikipedia.org/wiki/Devanagari
Devanagari Conjucts: https://en.wikipedia.org/wiki/Devanagari_conjuncts
Here's what I wrote to split:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let hs = "हिन्दी मुख्यमंत्री हिमंत";
let hsi = hs.graphemes(true).collect::<Vec<&str>>();
for i in hsi {
print!("{} ", i); // double space eye comfort
}
}
Current output:
हि न् दी मु ख् य मं त् री हि मं त
Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त
My another try:
I also tried to create a simple grapheme cluster following this SO answer https://stackoverflow.com/a/6806203/2724286
fn split_conjuncts(text: &str) -> Vec<String> {
let mut result = vec![];
let mut temp = String::new();
for c in text.chars() {
if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
temp.push(c);
} else {
temp.push(c);
if !temp.is_empty() {
result.push(temp.clone());
temp.clear();
}
}
}
if !temp.is_empty() {
result.push(temp);
}
result
}
fn main() {
let text = "संस्कृतम्";
let split_tokens = split_conjuncts(text);
println!("{:?}", split_tokens);
}
Output:
["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]
So, how can I get the desired output?
Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त
I also checked other SO answers (links below) dealing issues with Unicode, grpahemes, UTF-8, but no luck yet.
Combined diacritics do not normalize with unicodedata.normalize (PYTHON)
what-is-the-difference-between-combining-characters-and-grapheme-extenders
extended-grapheme-clusters-stop-combining

Related

How to rotate a vector without standard library?

I'm getting into Rust and Arduino at the same time.
I was programming my LCD display to show a long string by rotating it through the top column of characters. Means: Every second I shift all characters by one position and show the new String.
This was fairly complex in the Arduino language, especially because I had to know the size of the String at compile time (given my limited knowledge).
Since I'd like to use Rust in the long term, I was curious to see if that could be done more easily in a modern language. Not so much.
This is the code I came up with, after hours of experimentation:
#![no_std]
extern crate alloc;
use alloc::{vec::Vec};
fn main() {
}
fn rotate_by<T: Copy>(rotate: Vec<T>, by: isize) -> Vec<T> {
let real_by = modulo(by, rotate.len() as isize) as usize;
Vec::from_iter(rotate[real_by..].iter().chain(rotate[..real_by].iter()).cloned())
}
fn modulo(a: isize, b: isize) -> isize {
a - b * (a as f64 /b as f64).floor() as isize
}
mod tests {
use super::*;
#[test]
fn test_rotate_five() {
let chars: Vec<_> = "I am the string and you should rotate me! ".chars().collect();
let res_chars: Vec<_> = "the string and you should rotate me! I am ".chars().collect();
assert_eq!(rotate_by(chars, 5), res_chars);
}
}
My questions are:
Could you provide an optimized version of this function? I'm aware that there already is Vec::rotate but it uses unsafe code and can panic, which I would like to avoid (by returning a Result).
Explain whether or not it is possible to achieve this in-place without unsafe code (I failed).
Is Vec<_> the most efficient data structure to work with? I tried hard to use [char], which I thought would be more efficient, but then I have to know the size at compile time, which hardly works. I thought Rust arrays would be similar to Java arrays, which can be sized at runtime yet are also fixed size once created, but they seem to have a lot more constraints.
Oh and also what happens if I index into a vector at an invalid index? Will it panic? Can I do this better? Without "manually" checking the validity of the slice indices?
I realize that's a lot of questions, but I'm struggling and this is bugging me a lot, so if somebody could set me straight it would be much appreciated!
You can use slice::rotate_left and slice::rotate_right:
#![no_std]
extern crate alloc;
use alloc::vec::Vec;
fn rotate_by<T>(data: &mut [T], by: isize) {
if by > 0 {
data.rotate_left(by.unsigned_abs());
} else {
data.rotate_right(by.unsigned_abs());
}
}
I made it rotate in-place because that is more efficient. If you don't want to do it in-place you still have the option of cloning the vector first, so this is more flexible than if the function creates a new vector, as you have done, because you aren't be able to opt out of that when you call it.
Notice that rotate_by takes a mutable slice, but you can still pass a mutable reference to a vector, because of deref coercion.
#[test]
fn test_rotate_five() {
let mut chars: Vec<_> = "I am the string and you should rotate me! ".chars().collect();
let res_chars: Vec<_> = "the string and you should rotate me! I am ".chars().collect();
rotate_by(&mut chars, 5);
assert_eq!(chars, res_chars);
}
There are some edge cases with moving chars around like this because some valid UTF-8 will contain grapheme clusters that are made up of multiple codepoints (chars in Rust). This will result in strange effects when a grapheme cluster is split between the start and end of the string. For example, rotating "abcdéfghijk" by 5 will result in "efghijkabcd\u{301}", with the acute accent stranded on its own, away from the 'e'.
If your strings are ASCII then you don't have to worry about that, but then you can also just treat them as byte strings anyway:
#[test]
fn test_rotate_five_ascii() {
let mut chars = b"I am the string and you should rotate me! ".clone();
let res_chars = b"the string and you should rotate me! I am ";
rotate_by(&mut chars, 5);
assert_eq!(chars, &res_chars[..]);
}

How to find the number of times that a substring occurs in a given string (include jointed)?

By jointed I mean:
let substring = "CNC";
And the string:
let s = "CNCNC";
In my version "jointed" would mean that there are 2 such substrings present.
What is the best way of doing that in Rust? I can think of a few but then it's basically ugly C.
I have something like that:
fn find_a_string(s: &String, sub_string: &String) -> u32 {
s.matches(sub_string).count() as u32
}
But that returns 1, because matches() finds only disjointed substrings.
What's the best way to do that in Rust?
Probably there is a better algorithm. Here I just move a window with the size of the sub-string we are looking for over the input string and compare if that window is the same as the substring.
fn main() {
let string = "aaaa";
let substring = "aa";
let substrings = string
.as_bytes()
.windows(substring.len())
.filter(|&w| w == substring.as_bytes())
.count();
println!("{}", substrings);
}
The approach of iterating over all windows is perfectly serviceable when your needle/haystack is small. And indeed, it might even be the preferred solution for small needles/haystacks, since a theoretically optimal solution is a fair bit more complicated. But it can get quite a bit slower as the lengths grow.
While Aho-Corasick is more well known for its support for searching multiple patterns simultaneously, it can be used with a single pattern to find overlapping matches in linear time. (In this case, it looks a lot like Knuth-Morris-Pratt.)
The aho-corasick crate can do this:
use aho_corasick::AhoCorasick;
fn main() {
let haystack = "CNCNC";
let needle = "CNC";
let matcher = AhoCorasick::new(&[needle]);
for m in matcher.find_overlapping_iter(haystack) {
let (s, e) = (m.start(), m.end());
println!("({:?}, {:?}): {:?}", s, e, &haystack[s..e]);
}
}
Output:
(0, 3): "CNC"
(2, 5): "CNC"
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ab6c547b1700bbbc4a29a99adcaceabe

Rust inputting 3 integers in one line [duplicate]

To learn Rust, I'm looking at things like the HackerRank 30 days challenge, Project Euler, and other programming contests. My first obstacle is to read multiple integers from a single line of stdin.
In C++ I can conveniently say:
cin >> n >> m;
How do I do this idiomatically in Rust?
The best way, as far as I know, is just to split the input line and then map those to integers, like this:
use std::io;
let mut line = String::new();
io::stdin().read_line(&mut line).expect("Failed to read line");
let inputs: Vec<u32> = line.split(" ")
.map(|x| x.parse().expect("Not an integer!"))
.collect();
// inputs is a Vec<u32> of the inputs.
Be aware that this will panic! if the input is invalid; you should instead handle the result values properly if you wish to avoid this.
You can use the scan-rules crate (docs), which makes this kind of scanning easy (and has features to make it powerful too).
The following example code uses scan-rules version 0.1.3 (file can be ran directly with cargo-script).
The example program accepts two integers separated by whitespace, on the same line.
// cargo-deps: scan-rules="^0.1"
#[macro_use]
extern crate scan_rules;
fn main() {
let result = try_readln! {
(let n: u32, let m: u32) => (n, m)
};
match result {
Ok((n, m)) => println!("I read n={}, m={}", n, m),
Err(e) => println!("Failed to parse input: {}", e),
}
}
Test runs:
4 5
I read n=4, m=5
5 a
Failed to parse input: scan error: syntax error: expected integer, at offset: 2

How to split a String in Rust that I take as an input with read!()

I want to split a String that I give as an input according to white spaces in it.
I have used the split_whitespaces() function but when I use this function on a custom input it just gives me the first String slice.
let s:String = read!();
let mut i:usize = 0;
for token in s.split_whitespace() {
println!("token {} {}", i, token);
i+=1;
}
What am I missing?
As far as I know, read! is not a standard macro. A quick search reveals that is probably is from the text_io crate (if you are using external crates you should tell so in the question).
From the docs in that crate:
The read!() macro will always read until the next ascii whitespace character (\n, \r, \t or space).
So what you are seeing is by design.
If you want to read a whole line from stdin you may try the standard function std::Stdin::read_line.
You are missing test cases which could locate the source of the problem. Split the code into a function and replace the read!()-macro with a test case, which you could put in main for now, where you provide different strings to the function and observe the output.
fn strspilit(s:String){
let mut i:usize = 0;
for token in s.split_whitespace() {
println!("token {} {}", i, token);
i+=1;
}
}
fn main() {
println!("Hello, world!");
strspilit("Hello Huge World".to_string());
}
Then you will see your code is working as it should but as notices in other answers the read!() macro is only returning the string until the first white space so you should probably use another way of reading your input.

How do I read multiple integers from a single line of stdin?

To learn Rust, I'm looking at things like the HackerRank 30 days challenge, Project Euler, and other programming contests. My first obstacle is to read multiple integers from a single line of stdin.
In C++ I can conveniently say:
cin >> n >> m;
How do I do this idiomatically in Rust?
The best way, as far as I know, is just to split the input line and then map those to integers, like this:
use std::io;
let mut line = String::new();
io::stdin().read_line(&mut line).expect("Failed to read line");
let inputs: Vec<u32> = line.split(" ")
.map(|x| x.parse().expect("Not an integer!"))
.collect();
// inputs is a Vec<u32> of the inputs.
Be aware that this will panic! if the input is invalid; you should instead handle the result values properly if you wish to avoid this.
You can use the scan-rules crate (docs), which makes this kind of scanning easy (and has features to make it powerful too).
The following example code uses scan-rules version 0.1.3 (file can be ran directly with cargo-script).
The example program accepts two integers separated by whitespace, on the same line.
// cargo-deps: scan-rules="^0.1"
#[macro_use]
extern crate scan_rules;
fn main() {
let result = try_readln! {
(let n: u32, let m: u32) => (n, m)
};
match result {
Ok((n, m)) => println!("I read n={}, m={}", n, m),
Err(e) => println!("Failed to parse input: {}", e),
}
}
Test runs:
4 5
I read n=4, m=5
5 a
Failed to parse input: scan error: syntax error: expected integer, at offset: 2

Resources