trying to break apart a string in rust - rust

I have this string
abcdef x y z
or this one
"ab cd ef" x y z
I am trying to parse this in rust to
s1 = "abcdef"
arr = ["x","y","z"]
or
s1 = "ab cd ef"
arr = ["x","y","z"]
I tried the following (str is the starting string)
let chars = str.chars().peekable();
let s1:String = if *chars.peek().expect("value isnt empty") == '\"'{
chars.skip(1).take_while(|c| *c!= '\"').collect()
}else{
chars.take_while(|c| *c!= ' ').collect()
};
let remainder_str = chars.collect::<String>();
let remainder = remainder_str.split_whitespace();
let mut arr: Vec<&OsStr> =
remainder.map(|s| OsStr::new(s)).collect();
Ie create an iterator of the chars and walk it down the string pulling bits out as I go.
Doesnt work becuase the first collect eats 'chars'. I am sure I could do it via walking down the array of chars with an index inspecting each one in turn (aka, brute force) but that doesnt seem like idiomatic rust.
Can anybody suggest a better way.

Your idea taken to compilation:
fn doit(str: String) -> (String, Vec<String>) {
let mut chars = str.chars().peekable();
let s1:String = if *chars.by_ref().peek().expect("value isnt empty") == '\"'{
chars.by_ref().skip(1).take_while(|c| *c!= '\"').collect()
}else{
chars.by_ref().take_while(|c| *c!= ' ').collect()
};
let remainder_str = chars.collect::<String>();
let remainder = remainder_str.split_whitespace();
let arr: Vec<String> =
remainder.map(|s| s.to_string()).collect();
(s1, arr)
}
by_ref is possibly your friend here.
Note, however, that this method is very likely not what you actually want, as it allocates stuff all over the place (all the Strings, etc.).
You could be better off using only string slices that refer to the original string.

Related

efficiently creating a list of pointers to a character in a buffer using arm neon simd

I've been rewriting some performance sensitive parts of my code to aarch64 neon. For some things, like population count, i've managed to get a 12x speed. But for some algorithms i'm having trouble..
The high level problem is quickly adding a list of newline separated strings to a hashset. Assuming the hashset functionality is optimal (I am looking into it next), first i need to scan for the strings in the buffer.
I have tried various techniques - but my intuition tells me that I can create a list of pointers to each newline, and then insert them into the hashset afterwards now that i have the slices.
The fundamental problem is I can't work out an efficient way to load a vector, compare against the newline, and spit out a list of pointers to the newlines. eg. the output is a variable length, depending on how many newlines were found in the input vector.
Here is my approach;
fn read_file7(mut buffer: Vec<u8>, needle: u8) -> Result<HashSet<Vec<u8>>, Error>
{
let mut set = HashSet::new();
let mut chunk_offset: usize = 0;
let special_finder_big = [
0x80u8, 0x40u8, 0x20u8, 0x10u8, 0x08u8, 0x04u8, 0x02u8, 0x01u8, // high
0x80u8, 0x40u8, 0x20u8, 0x10u8, 0x08u8, 0x04u8, 0x02u8, 0x01u8, // low
];
let mut next_start: usize = 0;
let needle_vector = unsafe { vdupq_n_u8(needle) };
let special_finder_big = unsafe { vld1q_u8(special_finder_big.as_ptr()) };
let mut line_counter = 0;
// we process 16 chars at a time
for chunk in buffer.chunks(16) {
unsafe {
let src = vld1q_u8(chunk.as_ptr());
let out = vceqq_u8(src, needle_vector);
let anded = vandq_u8(out, special_finder_big);
// each of these is a bitset of each matching character
let vadded = vaddv_u8(vget_low_u8(anded));
let vadded2 = vaddv_u8(vget_high_u8(anded));
let list = [vadded2, vadded];
// combine bitsets into one big one!
let mut num = std::mem::transmute::<[u8; 2], u16>(list);
// while our bitset has bits left, find the set bits
while num > 0 {
let mut xor = 0x8000u16; // only set the highest bit
let clz = (num).leading_zeros() as usize;
set.get_or_insert_owned(&buffer[(next_start)..(chunk_offset + clz)]);
// println!("found '{}' at {} | clz is {} ", needle.escape_ascii(), start_offset + clz, clz);
// println!("string is '{}'", input[(next_start)..(start_offset + clz)].escape_ascii());
xor = xor >> clz;
num = num ^ xor;
next_start = chunk_offset + clz + 1;
//println!("new num {:032b}", num);
line_counter += 1;
}
}
chunk_offset += 16;
}
// get the remaining
set.get_or_insert_owned(&buffer[(next_start)..]);
println!(
"line_counter: {} unique elements {}",
line_counter,
set.len()
);
Ok(set)
}
if I unroll this to do 64 bytes at a time, on a big input it will be slightly faster than memchr. But not much.
Any tips would be appreciated.
I've shown this to a colleague who's come up with better intrinsics code than I would. Here's his suggestion, it's not been compiled, so there needs to be some finishing off of pseudo-code pieces etc, but something along the lines of below should be much faster & work:
let mut line_counter = 0;
for chunk in buffer.chunks(32) { // Read 32 bytes at a time
unsafe {
let src1 = vld1q_u8(chunk.as_ptr());
let src2 = vld1q_u8(chunk.as_ptr() + 16);
let out1 = vceqq_u8(src1, needle_vector);
let out2 = vceqq_u8(src2, needle_vector);
// We slot these next to each other in the same vector.
// In this case the bottom 64-bits of the vector will tell you
// if there are any needle values inside the first vector and
// the top 64-bits tell you if you have any needle values in the
// second vector.
let combined = vpmaxq_u8(out1, out2);
// Now we get another maxp which compresses this information into
// a single 64-bit value, where the bottom 32-bits tell us about
// src1 and the top 32-bit about src2.
let combined = vpmaxq_u8(combined, combined);
let remapped = vreinterpretq_u64_u8 (combined);
let val = vgetq_lane_u64 (remapped, 0);
if (val == 0) // most chunks won't have a new-line
... // If val is 0 that means no match was found in either vectors, adjust offset and continue.
if (val & 0xFFFF)
... // there must be a match in src1. use below code in a function
if (val & 0xFFFF0000)
... // there must be a match in src2. use below code in a function
...
}
}
Now that we now which vector to look in, we should find the index in the vector
As an example, let's assume matchvec is the vector we found above (so either out1 or out2).
To find the first index:
// We create a mark of repeating 0xf00f chunks. when we fill an entire vector
// with it we get a pattern where every byte is 0xf0 or 0x0f. We'll use this
// to find the index of the matches.
let mask = unsafe { vreinterpretq_u16_u8 (vdupq_n_u16 (0xf00f)); }
// We first clear the bits we don't want, which leaves for each adjacent 8-bit entries
// 4 bits of free space alternatingly.
let masked = vandq_u8 (matchvec, mask);
// Which means when we do a pairwise addition
// we are sure that no overflow will ever happen. The entries slot next to each other
// and a non-zero bit indicates the start of the first element.
// We've also compressed the values into the lower 64-bits again.
let compressed = vpaddq_u8 (masked, masked);
let val = vgetq_lane_u64 (compressed, 0);
// Post now contains the index of the first element, every 4 bit is a new entry
// This assumes Rust has kept val on the SIMD side. if it did not, then it's best to
// call vclz on the lower 64-bits of compressed and transfer the results.
let pos = (val).leading_zeros() as usize;
// So just shift pos right by 2 to get the actual index.
let pos = pos >> 2;
pos will now contain the index of the first needle value.
If you were processing out2, remember to add 16 to the result.
To find all the indices we can run through the bitmask without using clz, we avoid the repeated register file transfers this way.
// set masked and compressed as above
let masked = vandq_u8 (matchvec, mask);
let compressed = vpaddq_u8 (masked, masked);
int idx = current_offset;
while (val)
{
if (val & 0xf)
{
// entry found at idx.
}
idx++;
val = val >> 4;
}

Swift - best practice to find the longest string at [String] array

I'm trying to find what is the most effective way to get the longest string in a string array. For example :
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
and the outcome will be - "Game Of Thrones is just good"
I've tried using the maxElement func, tho it's give the max string in a alphabetic ideas(maxElement()).
Any suggestions? Thanks!
Instead of sorting which is O(n log(n)) for a good sort, use max(by:) which is O(n) on Array providing it a closure to compare string lengths:
Swift 4:
For Swift 4 you can get the string length with the count property on String:
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
if let max = array.max(by: {$1.count > $0.count}) {
print(max)
}
Swift 3:
Use .characters.count on String to get the string lengths:
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
if let max = array.max(by: {$1.characters.count > $0.characters.count}) {
print(max)
}
Swift 2:
Use maxElement on Array providing it a closure to compare string lengths:
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
if let max = array.maxElement({$1.characters.count > $0.characters.count}) {
print(max)
}
Note: maxElement is O(n). A good sort is O(n log(n)), so for large arrays, this will be much faster than sorting.
You can use reduce to do this. It will iterate through your array, keeping track of the current longest string, and then return it when finished.
For example:
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
if let longestString = array.reduce(Optional<String>.None, combine:{$0?.characters.count > $1.characters.count ? $0:$1}) {
print(longestString) // "Game Of Thrones is just good"
}
(Note that Optional.None is now Optional.none in Swift 3)
This uses an nil starting value to account for the fact that the array could be empty, as pointed out by #JHZ (it will return nil in that case). If you know your array has at least one element, you can simplify it to:
let longestString = array.reduce("") {$0.characters.count > $1.characters.count ? $0:$1}
Because it only iterates through each element once, it will quicker than using sort(). I did a quick benchmark and sort() appears around 20x slower (although no point in premature optimisation, I feel it is worth mentioning).
Edit: I recommend you go with #vacawama's solution as it's even cleaner than reduce!
Here you go:
let array = ["I'm Roi","I'm asking here","Game Of Thrones is just good"]
var sortedArr = array.sort() { $0.characters.count > $1.characters.count }
let longestEelement = sortedArr[0]
You can also practice with the use of Generics by creating this function:
func longestString<T:Sequence>(from stringsArray: T) -> String where T.Iterator.Element == String{
return (stringsArray.max {$0.count < $1.count}) ?? ""
}
Explanation: Create a function named longestString. Declar that there is a generic type T that implements the Sequence protocol (Sequence is defined here: https://developer.apple.com/documentation/swift/sequence). The function will return a single String (of course, the longest). The where clause explains that the generic type T should be limited to having elements of type String.
Inside the function, call the max function of the stringsArray by comparing the longest string of the elements inside. What will be returned is the longest String (an optional as it can be nil if the array is empty). If the longest string is nil then (use of ??) returns an empty string as the longest string instead.
Now call it:
let longestA = longestString(from:["Shekinah", "Chesedh", "Agape Sophia"])
If you get the hang of using generics, even if the strings are hidden inside objects, you can make use of the pattern of coding above. You can change the element to objects of the same class (Person for example).
Thus:
class Person {
let name: String
init(name: String){
self.name = name
}
}
func longestName<T:Sequence>(from stringsArray: T) -> String where T.Iterator.Element == Person{
return (stringsArray.max {$0.name.count < $1.name.count})?.name ?? ""
}
Then call the function like these:
let longestB = longestName(from:[Person(name: "Shekinah"), Person(name: "Chesedh"), Person(name: "Agape Sophia")])
You also get to rename your function based on the appropriateness of its use. You can tweak the pattern to return something else, like the object itself, or the length (count) of the String. And finally, becoming familiar with generics may improve your coding ability.
Now, with a little tweak again, you may extend further so that you can compare strings owned by many different types as long as they implement a common protocol.
protocol Nameable {
var name: String {get}
}
This defines a protocol named Nameable that requires those who implement to have a name variable of type String. Next, we define two different things that both implement the protocol.
class Person: Nameable {
let name: String
init(name: String){
self.name = name
}
}
struct Pet: Nameable {
let name: String
}
Then we tweak our generic function so that it requires that the elements must conform to Nameable, vastly different though they are.
func longestName<T:Sequence>(from stringsArray: T) -> String where T.Iterator.Element == Nameable{
return (stringsArray.max {$0.name.count < $1.name.count})?.name ?? ""
}
Let's collect the different objects into an array. Then call our function.
let myFriends: [Nameable] = [Pet(name: "Bailey"), Person(name: "Agape Sophia")]
let longestC = longestName(from: myFriends)
Lastly, after knowing "where" above and "Sequence" above, you may simply extend Sequence:
extension Sequence where Iterator.Element == String {
func topString() -> String {
self.max(by: { $0.count < $1.count }) ?? ""
}
}
Or the protocol type:
extension Sequence where Iterator.Element == Nameable {
func theLongestName() -> Nameable? {
self.max(by: { $0.name.count < $1.name.count })
}
}

reading integers from a string

I want to read a line from a file, initialize an array from that line and then display the integers.
Why is is not reading the five integers in the line? I want to get output 1 2 3 4 5, i have 1 1 1 1 1
open Array;;
open Scanf;;
let print_ints file_name =
let file = open_in file_name in
let s = input_line(file) in
let n = ref 5 in
let arr = Array.init !n (fun i -> if i < !n then sscanf s "%d" (fun a -> a) else 0) in
let i = ref 0 in
while !i < !n do
print_int (Array.get arr !i);
print_string " ";
i := !i + 1;
done;;
print_ints "string_ints.txt";;
My file is just: 1 2 3 4 5
You might want to try the following approach. Split your string into a list of substrings representing numbers. This answer describes one way of doing so. Then use the resulting function in your print_ints function.
let ints_of_string s =
List.map int_of_string (Str.split (Str.regexp " +") s)
let print_ints file_name =
let file = open_in file_name in
let s = input_line file in
let ints = ints_of_string s in
List.iter (fun i -> print_int i; print_char ' ') ints;
close_in file
let _ = print_ints "string_ints.txt"
When compiling, pass str.cma or str.cmxa as an argument (see this answer for details on compilation):
$ ocamlc str.cma print_ints.ml
Another alternative would be using the Scanf.bscanf function -- this question, contains an example (use with caution).
The Scanf.sscanf function may not be particularly suitable for this task.
An excerpt from the OCaml manual:
the scanf facility is not intended for heavy duty lexical analysis and parsing. If it appears not expressive enough for your needs, several alternative exists: regular expressions (module Str), stream parsers, ocamllex-generated lexers, ocamlyacc-generated parsers
There is though a way to parse a string of ints using Scanf.sscanf (which I wouldn't recommend):
let rec int_list_of_string s =
try
Scanf.sscanf s
"%d %[0-9-+ ]"
(fun n rest_str -> n :: int_list_of_string rest_str)
with
| End_of_file | Scanf.Scan_failure _ -> []
The trick here is to represent the input string s as a part which is going to be parsed into a an integer (%d) and the rest of the string using the range format: %[0-9-+ ]", which will match the rest of the string, containing only decimal digits 0-9, the - and + signs, and whitespace .

Swift String.removeRange cannot compile

I don't understand what to do with the issue reported by the compiler. I tried to create a Range, but it says Index is not known:
//let range = matches.first!.range.location
let range = Range(
start:matches.first!.range.location,
end: matches.first!.range.location+matches.first!.range.length
)
id = text[range]
var t = text
t.removeRange(range)
return t
Compiler says: Cannot invoke 'removeRange' with an argument list of type '(Range)' on t.removeRange(range).
I'm pretty sure it's evident, but I lost a great deal of time on such a small issue… any help highly appreciated!
As your error says that:
Cannot invoke 'removeRange' with an argument list of type '(Range)'
Means there is a problem with your range instance type and removeRange function will only accept an argument with type Range<String.Index> and its syntax is :
/// Remove the indicated `subRange` of characters
///
/// Invalidates all indices with respect to `self`.
///
/// Complexity: O(\ `count(self)`\ ).
mutating func removeRange(subRange: Range<String.Index>)
And here is working example with removeRange:
var welcome = "hello there"
let range = advance(welcome.endIndex, -6)..<welcome.endIndex
welcome.removeRange(range)
println(welcome) //hello
Hope this will help.
Swift 2.2 example of removing first 4 characters:
let range = text.startIndex..<text.startIndex.advancedBy(4)
text.removeRange(range)
That first line feels verbose. I hope newer Swift versions improve upon it.
Here is the working equivalent snippet:
static func unitTest() {
let text = "a👿bbbbb🇩🇪c"
let tag = Tag(id: "🇩🇪")
tag.regex = "👿b+"
print ("Unit test tag.foundIn(\(text)) ? = \(tag.foundIn(text))")
}
func foundIn(text: String) -> (id:String, remainingText:String)? {
// if a regex is provided, use it to capture, and keep the capture as a tag ID
if let regex = regex {
let r = Regex(regex) // text =~ regex
let matches = r.matches(text)
if matches.count >= 1 {
let first = matches.first!.range
let start = advance(text.startIndex, first.location)
let end = advance(start, first.length-1)
let range = Range(start: start, end: end)
id = text[range]
var t = text
t.removeRange(range)
return (id, t)
}
return nil
}
else if let range = text.rangeOfString(id) {
var t = text
t.removeRange(range)
return (id, t)
}
else {
return nil
}
}
The unit test returns :
Unit test tag.foundIn(a👿bbbbb🇩🇪c) ? = Optional(("👿bbbbb", "a🇩🇪c"))

Remove nth character from string

I have seen many methods for removing the last character from a string. Is there however a way to remove any old character based on its index?
Here is a safe Swift 4 implementation.
var s = "Hello, I must be going"
var n = 5
if let index = s.index(s.startIndex, offsetBy: n, limitedBy: s.endIndex) {
s.remove(at: index)
print(s) // prints "Hello I must be going"
} else {
print("\(n) is out of range")
}
While string indices aren't random-access and aren't numbers, you can advance them by a number in order to access the nth character:
var s = "Hello, I must be going"
s.removeAtIndex(advance(s.startIndex, 5))
println(s) // prints "Hello I must be going"
Of course, you should always check the string is at least 5 in length before doing this!
edit: as #MartinR points out, you can use the with-end-index version of advance to avoid the risk of running past the end:
let index = advance(s.startIndex, 5, s.endIndex)
if index != s.endIndex { s.removeAtIndex(index) }
As ever, optionals are your friend:
// find returns index of first match,
// as an optional with nil for no match
if let idx = s.characters.index(of:",") {
// this will only be executed if non-nil,
// idx will be the unwrapped result of find
s.removeAtIndex(idx)
}
Swift 3.2
let str = "hello"
let position = 2
let subStr = str.prefix(upTo: str.index(str.startIndex, offsetBy: position)) + str.suffix(from: str.index(str.startIndex, offsetBy: (position + 1)))
print(subStr)
"helo"
var hello = "hello world!"
Let's say we want to remove the "w". (It's at the 6th index position.)
First: Create an Index for that position. (I'm making the return type Index explicit; it's not required).
let index:Index = hello.startIndex.advancedBy(6)
Second: Call removeAtIndex() and pass it our just-made index. (Notice it returns the character in question)
let choppedChar:Character = hello.removeAtIndex(index)
print(hello) // prints hello orld!
print(choppedChar) // prints w

Resources