Generating strings and identifying substrings is very slow - string

I'd like to benchmark certain operations in Rust, but I seem to be having some trouble:
fn main(){
let needle = (0..100).map(|_| "b").collect::<String>();
let haystack = (0..100_000).map(|_| "a").collect::<String>();
println!("Data ready.");
for _ in 0..1_000_000 {
if haystack.contains( &needle ) {
// Stuff...
}
}
}
The above takes a very long time to complete while the same operation in Ruby finishes in around 4.5 seconds:
needle = 'b' * 100
haystack = 'a' * 100_000
puts 'Data ready.'
1_000_000.times do
haystack.include? needle
end
I can't help but think that I'm doing something fundamentally wrong.
What would be the proper way to do this in Rust?
rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]

A fix for this issue was merged today. That means it should be part of the next nightly, and will be expected to be released in Rust 1.3. The fix revived the Two-way substring search implementation that Rust used to have and adapted it to the new Pattern API in the standard library.
The Two-way algorithm is a good match for Rust's libcore since it is a linear time substring search algorithm that uses O(1) space and needs no dynamic allocation.
The particular implementation contains a simple addition that will reject this particular query in the question extremely quickly (and no, it was not written because of this question, it was part of the old code too).
During setup the searcher computes a kind of fingerprint for the needle: For each byte in the needle, take its low 6 bits, which is a number 0-63, then set the corresponding bit in the u64 variable byteset.
let byteset = needle.iter().fold(0, |a, &b| (1 << ((b & 0x3f) as usize)) | a);
Since the needle only contains 'b's, the value of byteset will have only the 34th bit set (98 & 63 == 34).
Now we can test any byte whether it can possibly be part of the needle or not. If its corresponding bit isn't set in byteset, the needle cannot match. Each byte we test in the haystack in this case will be 'a' (97 & 63 == 33), and it can't match. So the algorithm will read a single byte, reject it, and then skip the needle's length.
fn byteset_contains(&self, byte: u8) -> bool {
(self.byteset >> ((byte & 0x3f) as usize)) & 1 != 0
}
// Quickly skip by large portions unrelated to our substring
if !self.byteset_contains(haystack[self.position + needle.len() - 1]) {
self.position += needle.len();
continue 'search;
}
From libcore/str/pattern.rs in rust-lang/rust

Related

How do I convert an integer to binary in Rust, such that i can iterate over the digits?

I am trying to convert a decimal i32 in Rust to a binary representation I can manipulate, I understand from this question that I can print and declare i32 values to and from binary format; but I'm trying implement binary division using logical operators, (as opposed to just using the standard i32 math operators, this is for a leetcode question) and I'm not familiar enough with Rust syntax to understand how I can do this with the Binary trait that i32 implements, or if there is perhaps a dedicated binary type or iterator that I should be using.
While writing this question, I was able to figure out a workable solution by using the String type like so:
fn main() {
let a : i32 = 5;
let bin_a = String::from(format!("{a:b}"));
let bin_s = String::from("101");
assert_eq!(bin_a, bin_s);
}
But this feels clunky and like there's probably a more direct way of iterating over binary representations of numbers, so I'll submit this question in hopes someone more knowledgeable can contribute; thanks in advance for any help.
Basic operations:
(x >> n) & 1 gives you the nth bit,
x & !(1 << n) clears the nth bit,
and x | (1 << n) sets the nth bit.
So iterating over the digits is as simple as (0..32).map (|n| (x >> n) & 1) from least to most significant, or (0..32).rev().map (|n| (x >> n) & 1) from most to least significant.

What does the int value returned from compareTo function in Kotlin really mean?

In the documentation of compareTo function, I read:
Returns zero if this object is equal to the specified other object, a
negative number if it's less than other, or a positive number if it's
greater than other.
What does this less than or greater than mean in the context of strings? Is -for example- Hello World less than a single character a?
val epicString = "Hello World"
println(epicString.compareTo("a")) //-25
Why -25 and not -10 or -1 (for example)?
Other examples:
val epicString = "Hello World"
println(epicString.compareTo("HelloWorld")) //-55
Is Hello World less than HelloWorld? Why?
Why it returns -55 and not -1, -2, -3, etc?
val epicString = "Hello World"
println(epicString.compareTo("Hello World")) //55
Is Hello World greater than Hello World? Why?
Why it returns 55 and not 1, 2, 3, etc?
I believe you're asking about the implementation of compareTo method for java.lang.String. Here is a source code for java 11:
public int compareTo(String anotherString) {
byte v1[] = value;
byte v2[] = anotherString.value;
if (coder() == anotherString.coder()) {
return isLatin1() ? StringLatin1.compareTo(v1, v2)
: StringUTF16.compareTo(v1, v2);
}
return isLatin1() ? StringLatin1.compareToUTF16(v1, v2)
: StringUTF16.compareToLatin1(v1, v2);
}
So we have a delegation to either StringLatin1 or StringUTF16 here, so we should look further:
Fortunately StringLatin1 and StringUTF16 have similar implementation when it comes to compare functionality:
Here is an implementation for StringLatin1 for example:
public static int compareTo(byte[] value, byte[] other) {
int len1 = value.length;
int len2 = other.length;
return compareTo(value, other, len1, len2);
}
public static int compareTo(byte[] value, byte[] other, int len1, int len2) {
int lim = Math.min(len1, len2);
for (int k = 0; k < lim; k++) {
if (value[k] != other[k]) {
return getChar(value, k) - getChar(other, k);
}
}
return len1 - len2;
}
As you see, it iterated over the characters of the shorter string and in case the charaters in the same index of two strings are different it returns the difference between them. If during the iterations it doesn't find any different (one string is prefix of another) it resorts to the comparison between the length of two strings.
In your case, there is a difference in the first iteration already...
So its the same as `"H".compareTo("a") --> -25".
The code of "H" is 72
The code of "a" is 97
So, 72 - 97 = -25
Short answer: The exact value doesn't have any meaning; only its sign does.
As the specification for compareTo() says, it returns a -ve number if the receiver is smaller than the other object, a +ve number if the receiver is larger, or 0 if the two are considered equal (for the purposes of this ordering).
The specification doesn't distinguish between different -ve numbers, nor between different +ve numbers — and so neither should you.  Some classes always return -1, 0, and 1, while others return different numbers, but that's just an implementation detail — and implementations vary.
Let's look at a very simple hypothetical example:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length)
= metres - other.metres
}
This class has a single numerical property, so we can use that property to compare them.  One common way to do the comparison is simply to subtract the two lengths: that gives a number which is positive if the receiver is larger, negative if it's smaller, and zero of they're the same length — which is just what we need.
In this case, the value of compareTo() would happen to be the signed difference between the two lengths.
However, that method has a subtle bug: the subtraction could overflow, and give the wrong results if the difference is bigger than Int.MAX_VALUE.  (Obviously, to hit that you'd need to be working with astronomical distances, both positive and negative — but that's not implausible.  Rocket scientists write programs too!)
To fix it, you might change it to something like:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length) = when {
metres > other.metres -> 1
metres < other.metres -> -1
else -> 0
}
}
That fixes the bug; it works for all possible lengths.
But notice that the actual return value has changed in most cases: now it only ever returns -1, 0, or 1, and no longer gives an indication of the actual difference in lengths.
If this was your class, then it would be safe to make this change because it still matches the specification.  Anyone who just looked at the sign of the result would see no change (apart from the bug fix).  Anyone using the exact value would find that their programs were now broken — but that's their own fault, because they shouldn't have been relying on that, because it was undocumented behaviour.
Exactly the same applies to the String class and its implementation.  While it might be interesting to poke around inside it and look at how it's written, the code you write should never rely on that sort of detail.  (It could change in a future version.  Or someone could apply your code to another object which didn't behave the same way.  Or you might want to expand your project to be cross-platform, and discover the hard way that the JavaScript implementation didn't behave exactly the same as the Java one.)
In the long run, life is much simpler if you don't assume anything more than the specification promises!

What is the simplest way to split a string into a list of characters?

This appears to be covered by the Str module in the api documentation but according to this issue opened this is an oversight.
This is perhaps the simplest, though certainly not the most efficient:
let split = s =>
s |> Js.String.split("")
|> Array.to_list
|> List.map(s => s.[0])
This is more efficient, and cross-platform:
let split = s => {
let rec aux = (acc, i) =>
if (i >= 0) {
aux([s.[i], ...acc], i - 1)
} else {
acc
}
aux([], String.length(s) - 1)
}
I don't think it usually makes much sense to convert a string to a list though, since the conversion will have significant overhead regardless of method and it'd be better to just iterate the string directly. If it does make sense it's probably when the strings are small enough that the difference between the first and second method matters little.

Panicked at 'attempt to subtract with overflow' when cycling backwards though a list

I am writing a cycle method for a list that moves an index either forwards or backwards. The following code is used to cycle backwards:
(i-1)%list_length
In this case, i is of the type usize, meaning it is unsigned. If i is equal to 0, this leads to an 'attempt to subtract with overflow' error. I tried to use the correct casting methods to work around this problem:
((i as isize)-1)%(list_length as isize)) as usize
This results in an integer overflow.
I understand why the errors happen, and at the moment I've solved the problem by checking if the index is equal to 0, but I was wondering if there was some way to solve it by casting the variables to the correct types.
As DK. points out, you don't want wrapping semantics at the integer level:
fn main() {
let idx: usize = 0;
let len = 10;
let next_idx = idx.wrapping_sub(1) % len;
println!("{}", next_idx) // Prints 5!!!
}
Instead, you want to use modulo logic to wrap around:
let next_idx = (idx + len - 1) % len;
This only works if len + idx is less than the max of the type — this is much easier to see with a u8 instead of usize; just set idx to 200 and len to 250.
If you can't guarantee that the sum of the two values will always be less than the maximum value, I'd probably use the "checked" family of operations. This does the same level of conditional checking you mentioned you already have, but is neatly tied into a single line:
let next_idx = idx.checked_sub(1).unwrap_or(len - 1);
If your code can have overflowing operations, I would suggest using Wrapping. You don't need to worry about casting or overflow panics when you allow it:
use std::num::Wrapping;
let zero = Wrapping(0u32);
let one = Wrapping(1u32);
assert_eq!(std::u32::MAX, (zero - one).0);

How to apply step when iterating over string characters?

Is there an easy way to apply a step during iteration? I have seen a reference to step_by() in the book but I cant seem to get it to work.
For example, to print every-other character of a string I can do this but is there an easier way?
let s1 = "whhaatt".to_string();
for letter in s1.chars().enumerate() {
let (i, l) = letter;
if i % 2 == 0 {
println!("{:?}", l );
}
}
The simplest way would be to use the step adaptor from the itertools crate. In this case, you could use s1.chars().step(2).
Aside: Your code does not iterate over "characters"; it iterates over code points. It's quite likely that you want the graphemes method from the unicode-segmentation crate.

Resources