efficient SIMD dot product in rust - rust

I'm trying to create efficient SIMD version of dot product to implement 2D convolution for i16 type for FIR filter.
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[target_feature(enable = "avx2")]
unsafe fn dot_product(a: &[i16], b: &[i16]) {
let a = a.as_ptr() as *const [i16; 16];
let b = b.as_ptr() as *const [i16; 16];
let a = std::mem::transmute(*a);
let b = std::mem::transmute(*b);
let ms_256 = _mm256_mullo_epi16(a, b);
dbg!(std::mem::transmute::<_, [i16; 16]>(ms_256));
let hi_128 = _mm256_castsi256_si128(ms_256);
let lo_128 = _mm256_extracti128_si256(ms_256, 1);
dbg!(std::mem::transmute::<_, [i16; 8]>(hi_128));
dbg!(std::mem::transmute::<_, [i16; 8]>(lo_128));
let temp = _mm_add_epi16(hi_128, lo_128);
}
fn main() {
let a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
let b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
unsafe {
dot_product(&a, &b);
}
}
I] ~/c/simd (master|…) $ env RUSTFLAGS="-C target-cpu=native" cargo run --release | wl-copy
warning: unused variable: `temp`
--> src/main.rs:16:9
|
16 | let temp = _mm_add_epi16(hi_128, lo_128);
| ^^^^ help: if this is intentional, prefix it with an underscore: `_temp`
|
= note: `#[warn(unused_variables)]` on by default
warning: 1 warning emitted
Finished release [optimized] target(s) in 0.00s
Running `target/release/simd`
[src/main.rs:11] std::mem::transmute::<_, [i16; 16]>(ms_256) = [
0,
1,
4,
9,
16,
25,
36,
49,
64,
81,
100,
121,
144,
169,
196,
225,
]
[src/main.rs:14] std::mem::transmute::<_, [i16; 8]>(hi_128) = [
0,
1,
4,
9,
16,
25,
36,
49,
]
[src/main.rs:15] std::mem::transmute::<_, [i16; 8]>(lo_128) = [
64,
81,
100,
121,
144,
169,
196,
225,
]
While I understand SIMD conceptually I'm not familiar with exact instructions and intrinsics.
I know what I need to multiply two vectors and then horizontally sum then by halving them and using instructions to vertically add two halved of lower size.
I've found madd instruction which supposedly should do one such summation after multiplications right away, but not sure what to do with the result.
If using mul instead of madd I'm not sure which instructions to use to reduce the result further.
Any help is welcome!
PS
I've tried packed_simd but it seems like it doesn't work on stable rust.

Related

Borrowing/scoping issues when pushing &[u8] to a Vec [duplicate]

I'm doing something like this:
fn main() {
//[1, 0, 0, 0, 99]; // return [2, 0, 0, 0, 99]
//[2, 3, 0, 3, 99]; // return [2,3,0,6,99]
//[2, 4, 4, 5, 99, 0]; // return [2,4,4,5,99,9801]
//[1, 1, 1, 4, 99, 5, 6, 0, 99]; // return [30,1,1,4,2,5,6,0,99]
let map: Vec<(&mut [usize], &[usize])> = vec![(&mut [1, 0, 0, 0, 99], &[2, 0, 0, 0, 99])];
for (x, y) in map {
execute_program(x);
assert_eq!(x, y);
}
}
pub fn execute_program(vec: &mut [usize]) {
//do something inside vec
}
Here the playground
The problem is that I don't use the let on the first element in the tuple, that i want to borrow to execute_program:
error[E0716]: temporary value dropped while borrowed
--> src/main.rs:2:57
|
2 | let map: Vec<(&mut [usize], &[usize])> = vec![(&mut [1, 0, 0, 0, 99], &[2, 0, 0, 0, 99])];
| ^^^^^^^^^^^^^^^^ - temporary value is freed at the end of this statement
| |
| creates a temporary which is freed while still in use
3 |
4 | for (x, y) in map {
| --- borrow later used here
|
= note: consider using a `let` binding to create a longer lived value
But what I was doing was a refactoring exactly because I didn't want to do a let for every slice I want to test!
Is the let really needed?
Well, something has to own each of those arrays, because references can't own things. And the arrays are of different sizes, so the owner has to be a pointer. The most common array-like owning pointer is Vec:
let map: Vec<(Vec<usize>, &[usize])> = vec![
(vec![1, 0, 0, 0, 99], &[2, 0, 0, 0, 99]),
(vec![2, 3, 0, 3, 99], &[2, 3, 0, 6, 99]),
(vec![2, 4, 4, 5, 99, 0], &[2, 4, 4, 5, 99, 9801]),
(vec![1, 1, 1, 4, 99, 5, 6, 0, 99], &[30, 1, 1, 4, 2, 5, 6, 0, 99]),
];
for (mut x, y) in map {
execute_program(&mut x);
assert_eq!(x, y);
}
The arrays are therefore owned by map and borrowed when necessary, as loganfsmyth also suggested in the question comments.
You may be concerned about the performance cost of making unnecessary allocations. This is the cost of using a single let; since the arrays are not all the same size, if you want them on the stack there really isn't a way around declaring them with different lets. However, you could write a macro that removes the boilerplate.
Wait, why does it work for y?
You may wonder why I turned x into a vector, but left y as it is. The answer is that because y is a shared reference, those arrays are subject to static promotion, so that &[2, 0, 0, 0, 99] is actually of type &'static [usize; 5] which can be coerced to &'static [usize]. &mut references do not trigger static promotion because it is unsafe to mutate a static value without some kind of synchronization.

How to write a loop that adds all the numbers from the list into a variable

I'm a total beginner, self-learner and I'm trying to solve the problem 5 from How to Think Like a Computer Scientist: Learning with Python 3. The problem looks like this:
xs = [12, 10, 32, 3, 66, 17, 42, 99, 20]
Write a loop that adds all the numbers from the list into a variable called total. You should set the total variable to have the value 0 before you start adding them up, and print the value in total after the loop has completed.
Here is what I tried to do:
for xs in [12, 10, 32, 3, 66, 17, 42, 99, 20]:
xs = [12, 10, 32, 3, 66, 17, 42, 99, 20]
total = 0
total = sum(xs)
print(total)
Should I use a for loop at all? Or should I use a sum function?
There is no need for a for loop here simply:
xs = [12, 10, 32, 3, 66, 17, 42, 99, 20]
total = sum(xs)
print(total)
If you really want to use a loop:
total = 0
xs = [12, 10, 32, 3, 66, 17, 42, 99, 20]
for i in xs:
total += i
print(total)

Checking equality of two TriMat matrices written to Matrix Market format in sprs does not work if insertion order is different

I am trying to check if two Matrix Market format files contain the same matrix. In my actual code, due to the use of multi-threading, there is no guarantee that I am inserting items into a TriMat in the same order before being serialized to disk. As a result, when I load the resulting files and compare them, they are not always the same. How can I compare two different .mtx files and ensure that they are the same, regardless of insertion order?
Example code:
extern crate sprs;
use sprs::io::{write_matrix_market, read_matrix_market};
use sprs::TriMat;
fn main() {
let mut mat = TriMat::new((4, 20));
let mut vals = Vec::new();
vals.push((0, 19, 1));
vals.push((1, 14, 1));
vals.push((1, 19, 1));
vals.push((2, 17, 2));
for (i, j, v) in vals {
mat.add_triplet(i, j, v)
}
let _ = write_matrix_market("a.mtx", &mat).unwrap();
let mut mat2 = TriMat::new((4, 20));
let mut vals2 = Vec::new();
vals2.push((0, 19, 1));
vals2.push((1, 14, 1));
vals2.push((2, 17, 2)); // different order
vals2.push((1, 19, 1));
for (i, j, v) in vals2 {
mat2.add_triplet(i, j, v)
}
let _ = write_matrix_market("b.mtx", &mat2).unwrap();
let seen_mat: TriMat<usize> = read_matrix_market("a.mtx").unwrap();
let expected_mat: TriMat<usize> = read_matrix_market("b.mtx").unwrap();
assert_eq!(seen_mat, expected_mat);
}
And the resulting error:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `TriMatBase { rows: 4, cols: 20, row_inds: [0, 1, 1, 2], col_inds: [19, 14, 19, 17], data: [1, 1, 1, 2] }`,
right: `TriMatBase { rows: 4, cols: 20, row_inds: [0, 1, 2, 1], col_inds: [19, 14, 17, 19], data: [1, 1, 2, 1] }`', src/main.rs:31:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
You can see that these two matrices are actually identical, but that the items have been inserted in different orders.
Turns out you can convert to CSR to get it to work:
let seen_mat: TriMat<usize> = read_matrix_market("a.mtx").unwrap();
let expected_mat: TriMat<usize> = read_matrix_market("b.mtx").unwrap();
let a = seen_mat.to_csr();
let b = expected_mat.to_csr();
assert_eq!(a, b);

Why do Rust/sodiumoxide PublicKeys get prefixed with a length when serialised?

sodiumoxide defines PublicKey as:
new_type! {
/// `PublicKey` for signatures
public PublicKey(PUBLICKEYBYTES);
}
The new_type macro expands to:
pub struct $name(pub [u8; $bytes]);
Thus, PublicKey is defined as a simple wrapper of 32 bytes.
When I define my own wrapper of 32 bytes (MyPubKey) it bincode serialises to 32 bytes.
When I bincode serialise PublicKey, it is 40 bytes - the 32 bytes prefixed with a little-endian u64 containing the length.
#[macro_use]
extern crate serde_derive;
extern crate serde;
extern crate bincode;
extern crate sodiumoxide;
use sodiumoxide::crypto::{sign, box_};
use bincode::{serialize, deserialize, Infinite};
#[derive(Serialize, Deserialize, PartialEq, Debug)]
pub struct MyPubKey(pub [u8; 32]);
fn main() {
let (pk, sk) = sign::gen_keypair();
let arr: [u8; 32] = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31];
let mpk = MyPubKey(arr);
let encoded_pk: Vec<u8> = serialize(&pk, Infinite).unwrap();
let encoded_arr: Vec<u8> = serialize(&arr, Infinite).unwrap();
let encoded_mpk: Vec<u8> = serialize(&mpk, Infinite).unwrap();
println!("encoded_pk len:{:?} {:?}", encoded_pk.len(), encoded_pk);
println!("encoded_arr len:{:?} {:?}", encoded_arr.len(), encoded_arr);
println!("encoded_mpk len:{:?} {:?}", encoded_mpk.len(), encoded_mpk);
}
Results:
encoded_pk len:40 [32, 0, 0, 0, 0, 0, 0, 0, 7, 199, 134, 217, 109, 46, 34, 155, 89, 232, 171, 185, 199, 190, 253, 88, 15, 202, 58, 211, 198, 49, 46, 225, 213, 233, 114, 253, 61, 182, 123, 181]
encoded_arr len:32 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
encoded_mpk len:32 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
What is the difference between the PublicKey type, created with sodiumoxide's new_type! macro and the MyPublicKey type?
How can I get the 32 bytes out of a PublicKey so that I can serialise them efficiently?
It's up to the implementation of the serialization. sodiumoxide has chosen to implement all serialization by converting the types to a slice and then serializing that:
#[cfg(feature = "serde")]
impl ::serde::Serialize for $newtype {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: ::serde::Serializer
{
serializer.serialize_bytes(&self[..])
}
}
Since slices do not have a size known at compile time, serialization must include the length so that deserialization can occur.
You can probably implement your own serialization for a remote type or even just serialize the inner field directly:
serialize(&pk.0, Infinite)

OS X GCD multi-thread concurrency uses more CPU but executes slower than single thread

I have a method which does a series of calculations which take quite a bit of time to complete. The objects that this method does computations on are generated at runtime and can range from a few to a few thousand. Obviously it would be better if I could run these computations across several threads concurrently, but when I try that, my program uses more CPU yet takes longer than running them one-by-one. Any ideas why?
let itemsPerThread = (dataArray.count / 4) + 1
for var i = 0; i < dataArray.count; i += itemsPerThread
{
let name = "ComputationQueue\(i)".bridgeToObjectiveC().cString()
let compQueue = dispatch_queue_create(name, DISPATCH_QUEUE_CONCURRENT)
dispatch_async(compQueue,
{
let itemCount = i + itemsPerThread < dataArray.count ? itemsPerThread : dataArray.count - i - 1
let subArray = dataArray.bridgeToObjectiveC().subarrayWithRange(NSMakeRange(i, dataCount)) as MyItem[]
self.reallyLongComputation(subArray, increment: increment, outputIndex: self.runningThreads-1)
})
NSThread.sleepForTimeInterval(1)
}
Alternatively:
If I run this same thing, but a single dispatch_async call and on the whole dataArray rather than the subarrays, it completes much faster while using less CPU.
what you (it is my guess) want to do should looks like
//
// main.swift
// test
//
// Created by user3441734 on 12/11/15.
// Copyright © 2015 user3441734. All rights reserved.
//
import Foundation
let computationGroup = dispatch_group_create()
var arr: Array<Int> = []
for i in 0..<48 {
arr.append(i)
}
print("arr \(arr)")
func job(inout arr: Array<Int>, workers: Int) {
let count = arr.count
let chunk = count / workers
guard chunk * workers == count else {
print("array.cout divided by workers must by integer !!!")
return
}
let compQueue = dispatch_queue_create("test", DISPATCH_QUEUE_CONCURRENT)
let syncQueue = dispatch_queue_create("aupdate", DISPATCH_QUEUE_SERIAL)
for var i = 0; i < count; i += chunk
{
let j = i
var tarr = arr[j..<j+chunk]
dispatch_group_enter(computationGroup)
dispatch_async(compQueue) { () -> Void in
for k in j..<j+chunk {
// long time computation
var z = 100000000
repeat {
z--
} while z > 0
// update with chunk
tarr[k] = j
}
dispatch_async(syncQueue, { () -> Void in
for k in j..<j+chunk {
arr[k] = tarr[k]
}
dispatch_group_leave(computationGroup)
})
}
}
dispatch_group_wait(computationGroup, DISPATCH_TIME_FOREVER)
}
var stamp: Double {
return NSDate.timeIntervalSinceReferenceDate()
}
print("running on dual core ...\n")
var start = stamp
job(&arr, workers: 1)
print("job done by 1 worker in \(stamp-start) seconds")
print("arr \(arr)\n")
start = stamp
job(&arr, workers: 2)
print("job done by 2 workers in \(stamp-start) seconds")
print("arr \(arr)\n")
start = stamp
job(&arr, workers: 4)
print("job done by 4 workers in \(stamp-start) seconds")
print("arr \(arr)\n")
start = stamp
job(&arr, workers: 6)
print("job done by 6 workers in \(stamp-start) seconds")
print("arr \(arr)\n")
with results
arr [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]
running on dual core ...
job done by 1 worker in 5.16312199831009 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
job done by 2 workers in 2.49235796928406 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24]
job done by 4 workers in 3.18479603528976 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36]
job done by 6 workers in 2.51704299449921 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24, 24, 24, 32, 32, 32, 32, 32, 32, 32, 32, 40, 40, 40, 40, 40, 40, 40, 40]
Program ended with exit code: 0
... you can use next pattern for distributing job between any number of workers (the number of workers which give you the best performance depends on worker definition and sources which are available in your environment). generally for any kind of long time calculation ( transformation ) you can expect some performance gain. in two core environment up to 50%. if your worker use highly optimized functions using more cores 'by default', the performance gain can be close to nothing :-)
// generic implementation
// 1) job distribute data between workers as fair, as possible
// 2) workers do their task in parallel
// 3) the order in resulting array reflect the input array
// 4) there is no requiremets of worker block, to return
// the same type as result of yor 'calculation'
func job<T,U>(arr: [T], workers: Int, worker: T->U)->[U] {
guard workers > 0 else { return [U]() }
var res: Dictionary<Int,[U]> = [:]
let workersQueue = dispatch_queue_create("workers", DISPATCH_QUEUE_CONCURRENT)
let syncQueue = dispatch_queue_create("sync", DISPATCH_QUEUE_SERIAL)
let group = dispatch_group_create()
var j = min(workers, arr.count)
var i = (0, 0, arr.count)
var chunk: ArraySlice<T> = []
repeat {
let a = (i.1, i.1 + i.2 / j, i.2 - i.2 / j)
i = a
chunk = arr[i.0..<i.1]
dispatch_group_async(group, workersQueue) { [i, chunk] in
let arrs = chunk.map{ worker($0) }
dispatch_sync(syncQueue) {[i,arrs] in
res[i.0] = arrs
}
}
j--
} while j != 0
dispatch_group_wait(group, DISPATCH_TIME_FOREVER)
let idx = res.keys.sort()
var results = [U]()
idx.forEach { (idx) -> () in
results.appendContentsOf(res[idx]!)
}
return results
}
You need to
Get rid of the 1 second sleep. This is artifically reducing the degree to which you get parallel execution, because you're waiting before starting the next thread. You are starting 4 threads - and you are therefore artifically delaying the start (and potentially the finish) of the final thread by 3 seconds.
Use a single concurrent queue, not one per dispatch block. A concurrent queue will start blocks in the order in which they are dispatched, but does not wait for one block to finish before starting the next one - i.e. it will run blocks in parallel.
NSArray is a thread-safe class. I presume that it uses a multiple-reader/single-writer lock internally, which means there is probably no advantage to be obtained from creating a set of subarrays. You are, however, incurring the overhead of creating the subArray
Multiple threads running on different cores cannot talk to the same cache line at the same time.Typical cache line size is 64 bytes, which seems unlikely to cause a problem here.

Resources