Rust: How to concatenate bytes together - rust

I have already read the following link but still get some errors with my current attempt:
let data = &[37u8, 42u8];
let data_two = &[0x34u8, 0x32u8];
let res:Vec<u8> = [data, data_two].concat();
Also, ideally I would like to avoid concatenation, and write an array of u8 to a buffer, where I reserve the first two bytes for storing length and index like:
let nb:u8 = get_chunks_nb();
let index:u8 = get_chunk_index();
let header = &[nb, index];
// this kind of things in C:
memcpy(&buffer, header, 2);
memcpy(&buffer[2], chunk, chunk_len);
Thank you for your help!

I took a shot at it, but I'm not 100% sure as to why, I'm still new to Rust.
It looks like the compiler is seeing data and data_two as arrays, and so [data, data_two] is then an array of array and not an array of slice. Which is probably why it couldn't find the concat method on it.
By explicitely saying that data is a slice, everything seems to fall into place:
let data:&[u8] = &[37u8, 42u8];
let data_two = &[0x34u8, 0x32u8];
let mut res:Vec<u8> = [data, data_two].concat();

Related

Why does this loop repeatedly log the last pair in the array instead of all the pairs?

I am using Node v16.15.1 and TypeScript v4.7.4
I want to split an object into multiple objects, and then insert each object as a value in another object.
i.e.
{key1:"value1", key2:"value2"}
-> {key1:"value1"} and {key2:"value2"}
-> {key3:"value3", key4:"value4", key5:{key1:"value1"}} and {key3:"value3", key4:"value4", key5:{key2:"value2"}}
Below is the code I am using:
let a:any = {}
let entries = Object.entries({key1:"value1", key2:"value2"});
for(const el of entries) {
let b = a;
b.key = Object.fromEntries(new Map([el]));
console.log(b.key);
console.log(b)
}
However, the output I get is this.
{key2:"value2"} is in both objects, instead of just the second one.
If I use the following code, however, I get the correct result:
let entries = Object.entries({key1:"value1", key2:"value2"});
for(const el of entries) {
let b:any = {};
b.key = Object.fromEntries(new Map([el]));
console.log(b.key);
console.log(b)
}
The problem with this is that I am not inserting into a blank object, and am passing it as a parameter in a function.
Why does this happen?
How would I be able to fix this?
TIA
In javascript, when you do let a:any = {}; and then let b = a; you are assigning to b the references of a (not the value). So if you update b, you are actually updating a because both variable are the same.
If you want b to be a copy of a you should do something like : let b = {...a}.

efficiently creating a list of pointers to a character in a buffer using arm neon simd

I've been rewriting some performance sensitive parts of my code to aarch64 neon. For some things, like population count, i've managed to get a 12x speed. But for some algorithms i'm having trouble..
The high level problem is quickly adding a list of newline separated strings to a hashset. Assuming the hashset functionality is optimal (I am looking into it next), first i need to scan for the strings in the buffer.
I have tried various techniques - but my intuition tells me that I can create a list of pointers to each newline, and then insert them into the hashset afterwards now that i have the slices.
The fundamental problem is I can't work out an efficient way to load a vector, compare against the newline, and spit out a list of pointers to the newlines. eg. the output is a variable length, depending on how many newlines were found in the input vector.
Here is my approach;
fn read_file7(mut buffer: Vec<u8>, needle: u8) -> Result<HashSet<Vec<u8>>, Error>
{
let mut set = HashSet::new();
let mut chunk_offset: usize = 0;
let special_finder_big = [
0x80u8, 0x40u8, 0x20u8, 0x10u8, 0x08u8, 0x04u8, 0x02u8, 0x01u8, // high
0x80u8, 0x40u8, 0x20u8, 0x10u8, 0x08u8, 0x04u8, 0x02u8, 0x01u8, // low
];
let mut next_start: usize = 0;
let needle_vector = unsafe { vdupq_n_u8(needle) };
let special_finder_big = unsafe { vld1q_u8(special_finder_big.as_ptr()) };
let mut line_counter = 0;
// we process 16 chars at a time
for chunk in buffer.chunks(16) {
unsafe {
let src = vld1q_u8(chunk.as_ptr());
let out = vceqq_u8(src, needle_vector);
let anded = vandq_u8(out, special_finder_big);
// each of these is a bitset of each matching character
let vadded = vaddv_u8(vget_low_u8(anded));
let vadded2 = vaddv_u8(vget_high_u8(anded));
let list = [vadded2, vadded];
// combine bitsets into one big one!
let mut num = std::mem::transmute::<[u8; 2], u16>(list);
// while our bitset has bits left, find the set bits
while num > 0 {
let mut xor = 0x8000u16; // only set the highest bit
let clz = (num).leading_zeros() as usize;
set.get_or_insert_owned(&buffer[(next_start)..(chunk_offset + clz)]);
// println!("found '{}' at {} | clz is {} ", needle.escape_ascii(), start_offset + clz, clz);
// println!("string is '{}'", input[(next_start)..(start_offset + clz)].escape_ascii());
xor = xor >> clz;
num = num ^ xor;
next_start = chunk_offset + clz + 1;
//println!("new num {:032b}", num);
line_counter += 1;
}
}
chunk_offset += 16;
}
// get the remaining
set.get_or_insert_owned(&buffer[(next_start)..]);
println!(
"line_counter: {} unique elements {}",
line_counter,
set.len()
);
Ok(set)
}
if I unroll this to do 64 bytes at a time, on a big input it will be slightly faster than memchr. But not much.
Any tips would be appreciated.
I've shown this to a colleague who's come up with better intrinsics code than I would. Here's his suggestion, it's not been compiled, so there needs to be some finishing off of pseudo-code pieces etc, but something along the lines of below should be much faster & work:
let mut line_counter = 0;
for chunk in buffer.chunks(32) { // Read 32 bytes at a time
unsafe {
let src1 = vld1q_u8(chunk.as_ptr());
let src2 = vld1q_u8(chunk.as_ptr() + 16);
let out1 = vceqq_u8(src1, needle_vector);
let out2 = vceqq_u8(src2, needle_vector);
// We slot these next to each other in the same vector.
// In this case the bottom 64-bits of the vector will tell you
// if there are any needle values inside the first vector and
// the top 64-bits tell you if you have any needle values in the
// second vector.
let combined = vpmaxq_u8(out1, out2);
// Now we get another maxp which compresses this information into
// a single 64-bit value, where the bottom 32-bits tell us about
// src1 and the top 32-bit about src2.
let combined = vpmaxq_u8(combined, combined);
let remapped = vreinterpretq_u64_u8 (combined);
let val = vgetq_lane_u64 (remapped, 0);
if (val == 0) // most chunks won't have a new-line
... // If val is 0 that means no match was found in either vectors, adjust offset and continue.
if (val & 0xFFFF)
... // there must be a match in src1. use below code in a function
if (val & 0xFFFF0000)
... // there must be a match in src2. use below code in a function
...
}
}
Now that we now which vector to look in, we should find the index in the vector
As an example, let's assume matchvec is the vector we found above (so either out1 or out2).
To find the first index:
// We create a mark of repeating 0xf00f chunks. when we fill an entire vector
// with it we get a pattern where every byte is 0xf0 or 0x0f. We'll use this
// to find the index of the matches.
let mask = unsafe { vreinterpretq_u16_u8 (vdupq_n_u16 (0xf00f)); }
// We first clear the bits we don't want, which leaves for each adjacent 8-bit entries
// 4 bits of free space alternatingly.
let masked = vandq_u8 (matchvec, mask);
// Which means when we do a pairwise addition
// we are sure that no overflow will ever happen. The entries slot next to each other
// and a non-zero bit indicates the start of the first element.
// We've also compressed the values into the lower 64-bits again.
let compressed = vpaddq_u8 (masked, masked);
let val = vgetq_lane_u64 (compressed, 0);
// Post now contains the index of the first element, every 4 bit is a new entry
// This assumes Rust has kept val on the SIMD side. if it did not, then it's best to
// call vclz on the lower 64-bits of compressed and transfer the results.
let pos = (val).leading_zeros() as usize;
// So just shift pos right by 2 to get the actual index.
let pos = pos >> 2;
pos will now contain the index of the first needle value.
If you were processing out2, remember to add 16 to the result.
To find all the indices we can run through the bitmask without using clz, we avoid the repeated register file transfers this way.
// set masked and compressed as above
let masked = vandq_u8 (matchvec, mask);
let compressed = vpaddq_u8 (masked, masked);
int idx = current_offset;
while (val)
{
if (val & 0xf)
{
// entry found at idx.
}
idx++;
val = val >> 4;
}

trying to break apart a string in rust

I have this string
abcdef x y z
or this one
"ab cd ef" x y z
I am trying to parse this in rust to
s1 = "abcdef"
arr = ["x","y","z"]
or
s1 = "ab cd ef"
arr = ["x","y","z"]
I tried the following (str is the starting string)
let chars = str.chars().peekable();
let s1:String = if *chars.peek().expect("value isnt empty") == '\"'{
chars.skip(1).take_while(|c| *c!= '\"').collect()
}else{
chars.take_while(|c| *c!= ' ').collect()
};
let remainder_str = chars.collect::<String>();
let remainder = remainder_str.split_whitespace();
let mut arr: Vec<&OsStr> =
remainder.map(|s| OsStr::new(s)).collect();
Ie create an iterator of the chars and walk it down the string pulling bits out as I go.
Doesnt work becuase the first collect eats 'chars'. I am sure I could do it via walking down the array of chars with an index inspecting each one in turn (aka, brute force) but that doesnt seem like idiomatic rust.
Can anybody suggest a better way.
Your idea taken to compilation:
fn doit(str: String) -> (String, Vec<String>) {
let mut chars = str.chars().peekable();
let s1:String = if *chars.by_ref().peek().expect("value isnt empty") == '\"'{
chars.by_ref().skip(1).take_while(|c| *c!= '\"').collect()
}else{
chars.by_ref().take_while(|c| *c!= ' ').collect()
};
let remainder_str = chars.collect::<String>();
let remainder = remainder_str.split_whitespace();
let arr: Vec<String> =
remainder.map(|s| s.to_string()).collect();
(s1, arr)
}
by_ref is possibly your friend here.
Note, however, that this method is very likely not what you actually want, as it allocates stuff all over the place (all the Strings, etc.).
You could be better off using only string slices that refer to the original string.

Swift: CFArray : get values as UTF Strings

I call some functions that return a CFArray of CFStringRef values. I need to get utf strings from them. As I didn't want to make my code too complicated, I did the following:
let initString = "\(TISCreateInputSourceList(nil, false).takeUnretainedValue())"
And then I just split the resulting string by \ns to get an array of Swift strings. However, when the function started returning non-ascii strings, trouble started. I started getting strings like "\U2345\U2344".
Then i tried to take the CFArray and iterate over it getting the values and possibly converting them to Strings, but I can't get the values from it:
let ar = TISCreateInputSourceList(nil, true).takeUnretainedValue()
for i in 0...CFArrayGetCount(ar) - 1 {
print(">> ( CFArrayGetValueAtIndex(ar, i).memory )")
}
The values are always empty.
How can i get the actual values?
There are some issues here. First, TISCreateInputSourceList()
has "Create" in its name which means that it returns a (+1) retained
object and you have to take the value with takeRetainedValue(),
not takeUnretainedValue(), otherwise the code will leak memory:
let srcs = TISCreateInputSourceList(nil, true).takeRetainedValue()
You could now use the CFArray... methods to get values from the array,
but it is much easier to convert it to a NSArray (which is "toll-free bridged"):
let srcs = TISCreateInputSourceList(nil, true).takeRetainedValue() as NSArray
This is not an array of CFStringRef values but an array of
TISInputSource objects. You can convert the NSArray to a Swift array:
let srcs = TISCreateInputSourceList(nil, true).takeRetainedValue()
as NSArray as! [TISInputSource]
The forced cast as! is acceptable here because the function is
documented to return an array of input sources.
Now you can simply iterate over the elements of the array:
for src in srcs {
// do something with `src` (which is a `TISInputSource`)
}
The properties of an input source are retrieved with the TISGetInputSourceProperty() function, for example:
let ptr = TISGetInputSourceProperty(src, kTISPropertyInputSourceID)
This returns a "void pointer" (UnsafeMutablePointer<Void>) which has to be converted to an object
pointer of the appropriate type (which is CFStringRef for the
kTISPropertyInputSourceID property). Unfortunately, this is a bit
complicated (compare How to cast self to UnsafeMutablePointer<Void> type in swift):
let val = Unmanaged<CFString>.fromOpaque(COpaquePointer(ptr)).takeUnretainedValue()
Again we can take advantage of toll-free bridging, now from
CFStringRef to NSString and String:
let val = Unmanaged<CFString>.fromOpaque(COpaquePointer(ptr)).takeUnretainedValue()
as String
Putting it all together:
let srcs = TISCreateInputSourceList(nil, true).takeRetainedValue()
as NSArray as! [TISInputSource]
for src in srcs {
let ptr = TISGetInputSourceProperty(src, kTISPropertyInputSourceID)
let val = Unmanaged<CFString>.fromOpaque(COpaquePointer(ptr)).takeUnretainedValue()
as String
print(val)
}

SWIFT: performance of uppercaseString

I have a large file (25 MB) of text. I read it into a NSString var. I want to use "uppercaseString" to convert every char to upper case. But the function in so terribly slow, it needs minutes.
Any tip to get it work much faster?
Added code:
if let path = NSBundle.mainBundle().pathForResource("GERMANU", ofType: "txt") {
var error: NSError?
if let data = NSData(contentsOfFile: path, options: NSDataReadingOptions(), error: &error) {
if let datastring = NSString(data: data, encoding: NSMacOSRomanStringEncoding) {
var upper = datastring.uppercaseString
...
That's the code which works, but is slow. Only last row needs all the time.
String::uppercaseString is instantaneous; creating the string is not.
# Long time
12> var st : String = "".join(Array(count:25000000, repeatedValue: "a"))
st: String = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa..."
# Short time
13> st.uppercaseString
$R8: String = "AAAAAAAAAAAAAAAAAAAAAAAAAAAA..."
Given that you are using the Roman encoding, it is possible that the conversion to uppercase is non-trivial. Perhaps you can try another encoding (if any others are appropriate)? You might try the init?(... usedEncoding ...) variant and invoke fastestEncoding on the result to explore a bit.
Note: you can create a Swift string directly from a file with a particular encoding using:
if let datastring = String(contentsOfFile: path, encoding: ... , error: &error) {
var upper = datastring.uppercaseString
}
To me it looks like a poor library implementation. Using NSString.uppercaseString() is realy fast (half a second). So I will use this, but I'm developing in Swift because I like the language. So I don't want to switch back to old stuff.

Resources