Persistent thread style OpenCL implementation is very slow

Persistent thread style OpenCL implementation is very slow - multithreading

I came across the persistent thread (PT) style implementation for non-homogeneous work distribution and wrote a simple kernel to compare the computation time with a kernel doing the same computations the usual way. But my test implementation is about 6 times slower than the ordinary implementation even without the overhead for sorting the buffer to get corresponding operations of 32. Is this a reasonable slowdown or am I overlooking something? I launched the PT kernel with global_work_size = local_work_size = CL_DEVICE_MAX_WORK_GROUP_SIZE, which is 512. If I chose less, than obviously it gets even slower.
This is the ordinary kernel:
__kernel void myKernel(const __global int* buffer)
{
int myIndex = get_local_id(0);
doSomeComputations(buffer[myIndex]); //just many adds and mults, no conditionals
}
And this is the PT style kernel:
__constant int finalIndex = 655360;
__kernel void myKernel(const __global int* buffer)
{
__local volatile int nextIndex;
if (get_local_id(0) == 0)
nextIndex = 0;
mem_fence(CLK_LOCAL_MEM_FENCE);
int myIndex;
while(true){
// get next index
myIndex = nextIndex + get_local_id(0);
if (myIndex > finalIndex)
return;
if ( get_local_id(0) == 0)
nextIndex += 512;
mem_fence(CLK_LOCAL_MEM_FENCE);
doSomeComputations(buffer[myIndex]); //same computations as above
}
}
I thought both implementations should take about the same time. Why is the PT style implementation so much slower? Thank you in advance.
------------Edited below this line-------------
So just to be clear. This kernel launched with global_work_size=655360 and local_work_size=512
__kernel void myKernel()
{
int myIndex = get_local_id(0);
volatile float result;
float4 test = float4(1.1f);
for(int i=0; i<1000; i++)
test = (test*test + test*test)/2.0;
result = test.x;
}
runs 6 times faster than this kernel launched with global_work_size=512 and local_work_size=512
__kernel void myKernel()
{
for(size_t idx = 0; idx < 655360; idx += get_local_size(0))
{
volatile float result;
float4 test = float4(1.1f);
for(int i=0; i<1000; i++)
test = (test*test + test*test)/2.0;
result = test.x;
}
}

You can reduce your second kernel to just this:
__kernel void myKernel(const __global int* buffer)
{
for(int x = 0; x < 655360; x += get_local_size(0))
doSomeComputations(buffer[x+get_local_id(0)]);
}
Update: added summary of the below conversation
First kernel (global_work_size=655360 and local_work_size=512) will be split into 655360/512 = 1280 work groups which will fully utilize the GPU. The second kernel (global_work_size=512 and local_work_size=512) will utilize just one computing unit which explains why the first one runs faster.
More details about persistent threads in GPU: persistent-threads-in-opencl-and-cuda.

Related

How to divide a huge loop into multiple threads and then add result in collection?

I am performing some task in a loop. I need to divide this loop of 1.2 million into multiple threads. Each thread will have some result in list. When all threads are completed I need to add all threads list data into one common list. I can not use ExecutorService. How can I do this?
It should be compatible to jdk 1.6.
This is what I am doing right now:
List<Thread> threads = new ArrayList<Thread>();
int elements = 1200000;
public void function1() {
int oneTheadElemCount = 10000;
float fnum_threads = (float)elements / (float)oneTheadElemCount ;
String s = String.valueOf(fnum_threads);
int num_threads = Integer.parseInt(s.substring(0, s.indexOf("."))) + 1 ;
for(int count =0 ; count < num_threads ; count++) {
int endIndex = ((oneTheadElemCount * (num_threads - count)) + 1000) ;
int startindex = endIndex - oneTheadElemCount ;
if(count == (num_threads-1) )
{
startindex = 0;
}
if(startindex == 0 && endIndex > elements) {
endIndex = elements -1 ;
}
dothis( startindex,endIndex);
}
for(Thread t : threads) {
t.run();
}
}
public List dothis(int startindex, int endIndex) throws Exception {
Thread thread = new Thread(new Runnable() {
#Override
public void run() {
for (int i = startindex;
(i < endIndex && (startindex < elements && elements) ) ; i++)
{
//task adding elements in list
}
}
});
thread.start();
threads.add(thread);
return list;
}

I don't know which version of Java you are using but in Java 7 and higher, you can use Fork/Join ForkJoinPool.
Basically,
Fork/Join, introduced in Java 7, isn't intended to replace or compete
with the existing concurrency utility classes; instead it updates and
completes them. Fork/Join addresses the need for divide-and-conquer,
or recursive task-processing in Java programs (see Resources).
Fork/Join's logic is very simple: (1) separate (fork) each large task
into smaller tasks; (2) process each task in a separate thread
(separating those into even smaller tasks if necessary); (3) join the
results.
Citation.
There are various example online that can help with it. I haven't used it myself.
I hope this helps.
For Java6, you can follow this related SO question.

correct usage of compare_exchange_weak

const int SIZE = 20;
struct Node { Node* next; };
std::atomic<Node*> head (nullptr);
void push (void* p)
{
Node* n = (Node*) p;
n->next = head.load ();
while (!head.compare_exchange_weak (n->next, n));
}
void* pop ()
{
Node* n = head.load ();
while (n &&
!head.compare_exchange_weak (n, n->next));
return n ? n : malloc (SIZE);
}
void thread_fn()
{
std::array<char*, 1000> pointers;
for (int i = 0; i < 1000; i++) pointers[i] = nullptr;
for (int i = 0; i < 10000000; i++)
{
int r = random() % 1000;
if (pointers[r] != nullptr) // allocated earlier
{
push (pointers[r]);
pointers[r] = nullptr;
}
else
{
pointers[r] = (char*) pop (); // allocate
// stamp the memory
for (int i = 0; i < SIZE; i++)
pointers[r][i] = 0xEF;
}
}
}
int main(int argc, char *argv[])
{
int N = 8;
std::vector<std::thread*> threads;
threads.reserve (N);
for (int i = 0; i < N; i++)
threads.push_back (new std::thread (thread_fn));
for (int i = 0; i < N; i++)
threads[i]->join();
}
What is wrong with this usage of compare_exchange_weak ? The above code crashes 1 in 5 times using clang++ (MacOSX).
The head.load() at the time of the crash will have "0xEFEFEFEFEF". pop is like malloc and push is like free. Each thread (8 threads) randomly allocate or deallocate memory from head

It could be nice lock-free allocator, but ABA-problem arise:
A: Assume, that some thread1 executes pop(), which reads current value of head into n variable, but immediately after this the thread is preemted and concurrent thread2 executes full pop() call, that is it reads same value from head and performs successfull compare_exchange_weak.
B: Now object, referred by n in the thread1, has no longer belonged to the list, and can be modified by thread2. So n->next is garbage in general: reading from it can return any value. For example, it can be 0xEFEFEFEFEF, where the first 5 bytes are stamp (EF), witch has been written by thread2, and the last 3 bytes are still 0, from nullptr. (Total value is numerically interpreted in little-endian manner). It seems that, because head value has been changed, thread1 will fail its compare_exchange_weak call, but...
A: Concurrent thread2 push()es resulted pointer back into the list. So thread1 sees initial value of head, and perform successfull compare_exchange_weak, which writes incorrect value into head. List is corrupted.
Note, that problem is more than possibility, that other thread can modify content of n->next. The problem is that value of n->next is no longer coupled with the list. So, even it is not modified concurrently, it becomes invalid (for replace head) in case, e.g., when other thread(s) pop() 2 elements from the list but push() back only first of them. (So n->next will points to the second element, which is has no longer belonged to the list.)

increase number of threads decrease time

I'm newbie in openmp. Beginning with a tutorial from the official page of openmp
https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG
In that page there is a hello world program to calculate pi by an approximation of integral.
I simply wrote the code below following the instructions but the time-speed of it increase as I increase the number of threads changing the NUM_THREADS. In the video, the speed goes down.
I'm executing the program in a remote server with 64 cpus having 8 cores each.
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000;
double step;
#define NUM_THREADS 2
int main()
{
int i, nthreads; double pi, sum[NUM_THREADS];
double start_t;
step = 1.0 / (double) num_steps;
omp_set_num_threads(NUM_THREADS);
start_t = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
x = (i + 0.5) * step;
sum[id] += 4.0 / (1.0 + x*x);
}
}
for (i = 0, pi = 0.0; i < nthreads; i++) {
pi += sum[i] * step;
}
printf("%f\n", omp_get_wtime() - start_t);
}

This is a bad approach to implementing reduction using shared arrays. The successive elements of sum are too close to each other and therefore reside in the same cache line. On cache-coherent architectures like x86/x64, this leads to a problem known as false sharing. The following simple modification will get rid of it:
double sum[8*NUM_THREADS];
#pragma omp parallel
{
...
for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
...
sum[8*id] += 4.0 / (1.0 + x*x);
}
}
for (i = 0, pi = 0.0; i < nthreads; i++) {
pi += sum[8*i] * step;
}
Only the relevant changes are shown. The idea is simple: instead of having threads access successive elements of sum, make them access every 8-th element. Thus it is guaranteed that threads do not share the same cache line as on most modern CPUs a cache line is 64 bytes long and that corresponds to 64 / sizeof(double) = 8 array elements.
Edit: my mistake, should have watched the video in the first place. False sharing is explained just after the results from running the code are shown. If you don't get any speedup in your case, that's probably because newer CPU generations handle false sharing better.

Convert For loop into Parallel.For loop

public void DoSomething(byte[] array, byte[] array2, int start, int counter)
{
int length = array.Length;
int index = 0;
while (count >= needleLen)
{
index = Array.IndexOf(array, array2[0], start, count - length + 1);
int i = 0;
int p = 0;
for (i = 0, p = index; i < length; i++, p++)
{
if (array[p] != array2[i])
{
break;
}
}

Given that your for loop appears to be using a loop body dependent on ordering, it's most likely not a candidate for parallelization.
However, you aren't showing the "work" involved here, so it's difficult to tell what it's doing. Since the loop relies on both i and p, and it appears that they would vary independently, it's unlikely to be rewritten using a simple Parallel.For without reworking or rethinking your algorithm.
In order for a loop body to be a good candidate for parallelization, it typically needs to be order independent, and have no ordering constraints. The fact that you're basing your loop on two independent variables suggests that these requirements are not valid in this algorithm.

Analyze "whistle" sound for pitch/note

I am trying to build a system that will be able to process a record of someone whistling and output notes.
Can anyone recommend an open-source platform which I can use as the base for the note/pitch recognition and analysis of wave files ?
Thanks in advance

As many others have already said, FFT is the way to go here. I've written a little example in Java using FFT code from http://www.cs.princeton.edu/introcs/97data/. In order to run it, you will need the Complex class from that page also (see the source for the exact URL).
The code reads in a file, goes window-wise over it and does an FFT on each window. For each FFT it looks for the maximum coefficient and outputs the corresponding frequency. This does work very well for clean signals like a sine wave, but for an actual whistle sound you probably have to add more. I've tested with a few files with whistling I created myself (using the integrated mic of my laptop computer), the code does get the idea of what's going on, but in order to get actual notes more needs to be done.
1) You might need some more intelligent window technique. What my code uses now is a simple rectangular window. Since the FFT assumes that the input singal can be periodically continued, additional frequencies are detected when the first and the last sample in the window don't match. This is known as spectral leakage ( http://en.wikipedia.org/wiki/Spectral_leakage ), usually one uses a window that down-weights samples at the beginning and the end of the window ( http://en.wikipedia.org/wiki/Window_function ). Although the leakage shouldn't cause the wrong frequency to be detected as the maximum, using a window will increase the detection quality.
2) To match the frequencies to actual notes, you could use an array containing the frequencies (like 440 Hz for a') and then look for the frequency that's closest to the one that has been identified. However, if the whistling is off standard tuning, this won't work any more. Given that the whistling is still correct but only tuned differently (like a guitar or other musical instrument can be tuned differently and still sound "good", as long as the tuning is done consistently for all strings), you could still find notes by looking at the ratios of the identified frequencies. You can read http://en.wikipedia.org/wiki/Pitch_%28music%29 as a starting point on that. This is also interesting: http://en.wikipedia.org/wiki/Piano_key_frequencies
3) Moreover it might be interesting to detect the points in time when each individual tone starts and stops. This could be added as a pre-processing step. You could do an FFT for each individual note then. However, if the whistler doesn't stop but just bends between notes, this would not be that easy.
Definitely have a look at the libraries the others suggested. I don't know any of them, but maybe they contain already functionality for doing what I've described above.
And now to the code. Please let me know what worked for you, I find this topic pretty interesting.
Edit: I updated the code to include overlapping and a simple mapper from frequencies to notes. It works only for "tuned" whistlers though, as mentioned above.
package de.ahans.playground;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.UnsupportedAudioFileException;
public class FftMaxFrequency {
// taken from http://www.cs.princeton.edu/introcs/97data/FFT.java.html
// (first hit in Google for "java fft"
// needs Complex class from http://www.cs.princeton.edu/introcs/97data/Complex.java
public static Complex[] fft(Complex[] x) {
int N = x.length;
// base case
if (N == 1) return new Complex[] { x[0] };
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2"); }
// fft of even terms
Complex[] even = new Complex[N/2];
for (int k = 0; k < N/2; k++) {
even[k] = x[2*k];
}
Complex[] q = fft(even);
// fft of odd terms
Complex[] odd = even; // reuse the array
for (int k = 0; k < N/2; k++) {
odd[k] = x[2*k + 1];
}
Complex[] r = fft(odd);
// combine
Complex[] y = new Complex[N];
for (int k = 0; k < N/2; k++) {
double kth = -2 * k * Math.PI / N;
Complex wk = new Complex(Math.cos(kth), Math.sin(kth));
y[k] = q[k].plus(wk.times(r[k]));
y[k + N/2] = q[k].minus(wk.times(r[k]));
}
return y;
}
static class AudioReader {
private AudioFormat audioFormat;
public AudioReader() {}
public double[] readAudioData(File file) throws UnsupportedAudioFileException, IOException {
AudioInputStream in = AudioSystem.getAudioInputStream(file);
audioFormat = in.getFormat();
int depth = audioFormat.getSampleSizeInBits();
long length = in.getFrameLength();
if (audioFormat.isBigEndian()) {
throw new UnsupportedAudioFileException("big endian not supported");
}
if (audioFormat.getChannels() != 1) {
throw new UnsupportedAudioFileException("only 1 channel supported");
}
byte[] tmp = new byte[(int) length];
byte[] samples = null;
int bytesPerSample = depth/8;
int bytesRead;
while (-1 != (bytesRead = in.read(tmp))) {
if (samples == null) {
samples = Arrays.copyOf(tmp, bytesRead);
} else {
int oldLen = samples.length;
samples = Arrays.copyOf(samples, oldLen + bytesRead);
for (int i = 0; i < bytesRead; i++) samples[oldLen+i] = tmp[i];
}
}
double[] data = new double[samples.length/bytesPerSample];
for (int i = 0; i < samples.length-bytesPerSample; i += bytesPerSample) {
int sample = 0;
for (int j = 0; j < bytesPerSample; j++) sample += samples[i+j] << j*8;
data[i/bytesPerSample] = (double) sample / Math.pow(2, depth);
}
return data;
}
public AudioFormat getAudioFormat() {
return audioFormat;
}
}
public class FrequencyNoteMapper {
private final String[] NOTE_NAMES = new String[] {
"A", "Bb", "B", "C", "C#", "D", "D#", "E", "F", "F#", "G", "G#"
};
private final double[] FREQUENCIES;
private final double a = 440;
private final int TOTAL_OCTAVES = 6;
private final int START_OCTAVE = -1; // relative to A
public FrequencyNoteMapper() {
FREQUENCIES = new double[TOTAL_OCTAVES*12];
int j = 0;
for (int octave = START_OCTAVE; octave < START_OCTAVE+TOTAL_OCTAVES; octave++) {
for (int note = 0; note < 12; note++) {
int i = octave*12+note;
FREQUENCIES[j++] = a * Math.pow(2, (double)i / 12.0);
}
}
}
public String findMatch(double frequency) {
if (frequency == 0)
return "none";
double minDistance = Double.MAX_VALUE;
int bestIdx = -1;
for (int i = 0; i < FREQUENCIES.length; i++) {
if (Math.abs(FREQUENCIES[i] - frequency) < minDistance) {
minDistance = Math.abs(FREQUENCIES[i] - frequency);
bestIdx = i;
}
}
int octave = bestIdx / 12;
int note = bestIdx % 12;
return NOTE_NAMES[note] + octave;
}
}
public void run (File file) throws UnsupportedAudioFileException, IOException {
FrequencyNoteMapper mapper = new FrequencyNoteMapper();
// size of window for FFT
int N = 4096;
int overlap = 1024;
AudioReader reader = new AudioReader();
double[] data = reader.readAudioData(file);
// sample rate is needed to calculate actual frequencies
float rate = reader.getAudioFormat().getSampleRate();
// go over the samples window-wise
for (int offset = 0; offset < data.length-N; offset += (N-overlap)) {
// for each window calculate the FFT
Complex[] x = new Complex[N];
for (int i = 0; i < N; i++) x[i] = new Complex(data[offset+i], 0);
Complex[] result = fft(x);
// find index of maximum coefficient
double max = -1;
int maxIdx = 0;
for (int i = result.length/2; i >= 0; i--) {
if (result[i].abs() > max) {
max = result[i].abs();
maxIdx = i;
}
}
// calculate the frequency of that coefficient
double peakFrequency = (double)maxIdx*rate/(double)N;
// and get the time of the start and end position of the current window
double windowBegin = offset/rate;
double windowEnd = (offset+(N-overlap))/rate;
System.out.printf("%f s to %f s:\t%f Hz -- %s\n", windowBegin, windowEnd, peakFrequency, mapper.findMatch(peakFrequency));
}
}
public static void main(String[] args) throws UnsupportedAudioFileException, IOException {
new FftMaxFrequency().run(new File("/home/axr/tmp/entchen.wav"));
}
}

i think this open-source platform suits you
http://code.google.com/p/musicg-sound-api/

Well, you could always use fftw to perform the Fast Fourier Transform. It's a very well respected framework. Once you've got an FFT of your signal you can analyze the resultant array for peaks. A simple histogram style analysis should give you the frequencies with the greatest volume. Then you just have to compare those frequencies to the frequencies that correspond with different pitches.

in addition to the other great options:
csound pitch detection: http://www.csounds.com/manual/html/pvspitch.html
fmod: http://www.fmod.org/ (has a free version)
aubio: http://aubio.org/doc/pitchdetection_8h.html

You might want to consider Python(x,y). It's a scientific programming framework for python in the spirit of Matlab, and it has easy functions for working in the FFT domain.

If you use Java, have a look at TarsosDSP library. It has a pretty good ready-to-go pitch detector.
Here is an example for android, but I think it doesn't require too much modifications to use it elsewhere.

I'm a fan of the FFT but for the monophonic and fairly pure sinusoidal tones of whistling, a zero-cross detector would do a far better job at determining the actual frequency at a much lower processing cost. Zero-cross detection is used in electronic frequency counters that measure the clock rate of whatever is being tested.
If you going to analyze anything other than pure sine wave tones, then FFT is definitely the way to go.
A very simple implementation of zero cross detection in Java on GitHub

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Persistent thread style OpenCL implementation is very slow - multithreading

Related

How to divide a huge loop into multiple threads and then add result in collection?

correct usage of compare_exchange_weak

increase number of threads decrease time

Convert For loop into Parallel.For loop

Analyze "whistle" sound for pitch/note

Categories

Resources