What is the easiest way to determine if a character is in Unicode range, in Rust? - unicode-string

I'm looking for easiest way to determine if a character in Rust is between two Unicode values.
For example, I want to know if a character s is between [#x1-#x8] or [#x10FFFE-#x10FFFF]. Is there a function that does this already?

The simplest way for me to match a character was this
fn match_char(data: &char) -> bool {
match *data {
'\x01'...'\x08' |
'\u{10FFFE}'...'\u{10FFFF}' => true,
_ => false,
}
}
Pattern matching a character was the easiest route for me, compared to a bunch of if statements. It might not be the most performant solution, but it served me very well.

The simplest way, assuming that they are not Unicode categories (in which case you should be using std::unicode) is to use the regular comparison operators:
(s >= '\x01' && s <= '\x08') || s == '\U0010FFFE' || s == '\U0010FFFF'
(In case you weren't aware of the literal forms of these things, one gets 8-bit hexadecimal literals \xXX, 16-bit hexadecimal literals \uXXXX, and 32-bit hexadecimal literals \UXXXXXXXX. Matter of fact, casts would work fine too, e.g. 0x10FFFE as char, and would be just as efficient; just less easily readable.)

Related

How to check if a string is alphanumeric?

I can use occursin function, but its haystack argument cannot be a regular expression, which means I have to pass the entire alphanumeric string to it. Is there a neat way of doing this in Julia?
I'm not sure your assumption about occursin is correct:
julia> occursin(r"[a-zA-z]", "ABC123")
true
julia> occursin(r"[a-zA-z]", "123")
false
but its haystack argument cannot be a regular expression, which means I have to pass the entire alphanumeric string to it.
If you mean its needle argument, it can be a Regex, for eg.:
julia> occursin(r"^[[:alnum:]]*$", "adf24asg24y")
true
julia> occursin(r"^[[:alnum:]]*$", "adf24asg2_4y")
false
This checks that the given haystack string is alphanumeric using Unicode-aware character class
[[:alnum:]] which you can think of as equivalent to [a-zA-Z\d], extended to non-English characters too. (As always with Unicode, a "perfect" solution involves more work and complication, but this takes you most of the way.)
If you do mean you want the haystack argument to be a Regex, it's not clear why you'd want that here, and also why "I have to pass the entire alphanumeric string to it" is a bad thing.
As has been noted, you can indeed use regexes with occursin, and it works well. But you can also roll your own version, quite simply:
isalphanumeric(c::AbstractChar) = isletter(c) || ('0' <= c <= '9')
isalphanumeric(str::AbstractString) = all(isalphanumeric, str)

Hash function to see if one string is scrambled form/permutation of another?

I want to check if string A is just a reordered version of string B. For example, "abc" = "bca" = "cab"...
There are other solutions here: https://www.geeksforgeeks.org/check-if-two-strings-are-permutation-of-each-other/
However, I was thinking a hash function would be an easy way of doing this, but the typical hash function takes order into consideration. Are there any hash functions that do not care about character order?
Are there any hash functions that do not care about character order?
I don't know of real-world hash functions that have this property, no. Because this is not a problem they are designed to solve.
However, in this specific case, you can make your own "hash" function (a very very bad one) that will indeed ignore order: just sum ASCII codes of characters. This works due to the commutative property of addition (a + b == b + a)
def isAnagram(self,a,b):
sum_a = 0
sum_b = 0
for c in a:
sum_a += ord(c)
for c in b:
sum_b += ord(c)
return sum_a == sum_b
To reiterate, this is absolutely a hack, that only happens to work because input strings are limited in content in the judge system (only have lowercase ASCII characters and do not contain spaces). It will not work (reliably) on arbitrary strings.
For a fast check you could use a kind af hash-funkction
Candidates are:
xor all characters of a String
add all characters of a String
multiply all characters of a String (be careful might lead to overflow for large Strings)
If the hash-value is equal, it could still be a collision of two not 'equal' strings. So you still need to make a dedicated compare. (e.g. sort the characters of each string before comparing them).

What is an efficient way to compare strings while ignoring case?

To compare two Strings, ignoring case, it looks like I first need to convert to a lower case version of the string:
let a_lower = a.to_lowercase();
let b_lower = b.to_lowercase();
a_lower.cmp(&b_lower)
Is there a method that compares strings, ignoring case, without creating the temporary lower case strings, i.e. that iterates over the characters, performs the to-lowercase conversion there and compares the result?
There is no built-in method, but you can write one to do exactly as you described, assuming you only care about ASCII input.
use itertools::{EitherOrBoth::*, Itertools as _}; // 0.9.0
use std::cmp::Ordering;
fn cmp_ignore_case_ascii(a: &str, b: &str) -> Ordering {
a.bytes()
.zip_longest(b.bytes())
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.to_ascii_lowercase().cmp(&b.to_ascii_lowercase()),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
As some comments below have pointed out, case-insensitive comparison is not going to work properly for UTF-8, without operating on the whole string, and even then there are multiple representations of some case conversions, which could give unexpected results.
With those caveats, the following will work for a lot of extra cases compared with the ASCII version above (e.g. most accented Latin characters) and may be satisfactory, depending on your requirements:
fn cmp_ignore_case_utf8(a: &str, b: &str) -> Ordering {
a.chars()
.flat_map(char::to_lowercase)
.zip_longest(b.chars().flat_map(char::to_lowercase))
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.cmp(&b),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
If you are only working with ASCII, you can use eq_ignore_ascii_case:
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
UNICODE
The best way for supporting UNICODE is using to_lowercase() or to_uppercase().
This is because UNICODE has many caveats and these functions handles most situations. There are some locale specific strings not handled correctly.
let left = "first".to_string();
let right = "FiRsT".to_string();
assert!(left.to_lowercase() == right.to_lowercase());
Efficiency
It is possible to iterate and return on first non-equal character, so in essence you only allocate one character at a time. However iterating using chars function does not account for all situations UNICODE can throw at us.
See the answer by Peter Hall for details on this.
ASCII
Most efficient if only using ASCII is to use eq_ignore_ascii_case (as per Ibraheem Ahmed's answer). This is does not allocate/copy temporaries.
This is only good if your code controls at least one side of the comparison and you are certain that it will only include ASCII.
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
Locale
Rusts case functions are best effort regarding locales and do not handle all locales. To support proper internationalisation, you will need to look for other crates that do this.

Convert a char to upper case

I have a variable which contains a single char. I want to convert this char to upper case. However, the to_uppercase function returns a rustc_unicode::char::ToUppercase struct instead of a char.
Explanation
ToUppercase is an Iterator, that may yield more than one char. This is necessary, because some Unicode characters consist of multiple "Unicode Scalar Values" (which a Rust char represents).
A nice example are the so called ligatures. Try this for example (on playground):
let fi_upper: Vec<_> = 'fi'.to_uppercase().collect();
println!("{:?}", fi_upper); // prints: ['F', 'I']
The 'fi' ligature is a single character whose uppercase version consists of two letters/characters.
Solution
There are multiple possibilities how to deal with that:
Work on &str: if your data is actually in string form, use str::to_uppercase which returns a String which is easier to work with.
Use ASCII methods: if you are sure that your data is ASCII only and/or you don't care about unicode symbols you can use std::ascii::AsciiExt::to_ascii_uppercase which returns just a char. But it only changes the letters 'a' to 'z' and ignores all other characters!
Deal with it manually: Collect into a String or Vec like in the example above.
ToUppercase is an iterator, because the uppercase version of the character may be composed of several codepoints, as delnan pointed in the comments. You can convert that to a Vector of characters:
c.to_uppercase().collect::<Vec<_>>();
Then, you should collect those characters into a string, as ker pointed.

How do you make a function detect whether a string is binary safe or not

How does one detect if a string is binary safe or not in Go?
A function like:
IsBinarySafe(str) //returns true if its safe and false if its not.
Any comment after this are just things I have thought or attempted to solve this:
I assumed that there must exist a library that already does this but had a tough time finding it. If there isn't one, how do you implement this?
I was thinking of some solution but wasn't really convinced they were good solutions.
One of them was to iterate over the bytes, and have a hash map of all the illegal byte sequences.
I also thought of maybe writing a regex with all the illegal strings but wasn't sure if that was a good solution.
I also was not sure if a sequence of bytes from other languages counted as binary safe. Say the typical golang example:
世界
Would:
IsBinarySafe(世界) //true or false?
Would it return true or false? I was assuming that all binary safe string should only use 1 byte. So iterating over it in the following way:
const nihongo = "日本語abc日本語"
for i, w := 0, 0; i < len(nihongo); i += w {
runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
w = width
}
and returning false whenever the width was great than 1. These are just some ideas I had just in case there wasn't a library for something like this already but I wasn't sure.
Binary safety has nothing to do with how wide a character is, it's mainly to check for non-printable characters more or less, like null bytes and such.
From Wikipedia:
Binary-safe is a computer programming term mainly used in connection
with string manipulating functions. A binary-safe function is
essentially one that treats its input as a raw stream of data without
any specific format. It should thus work with all 256 possible values
that a character can take (assuming 8-bit characters).
I'm not sure what your goal is, almost all languages handle utf8/16 just fine now, however for your specific question there's a rather simple solution:
// checks if s is ascii and printable, aka doesn't include tab, backspace, etc.
func IsAsciiPrintable(s string) bool {
for _, r := range s {
if r > unicode.MaxASCII || !unicode.IsPrint(r) {
return false
}
}
return true
}
func main() {
fmt.Printf("len([]rune(s)) = %d, len([]byte(s)) = %d\n", len([]rune(s)), len([]byte(s)))
fmt.Println(IsAsciiPrintable(s), IsAsciiPrintable("test"))
}
playground
From unicode.IsPrint:
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.

Resources