multithreading or shared memory - Architecture - multithreading

There are 3 parts to my application:
A numerical simulator solving a 21 variable diff equation by runge-kutta method - direct from numerical recipes in C, step size is 0.0001 s
A C code pinging a PIC based micrprocessor every 1s and receiving data at about 3600 samples per second over the USB-COM port; It sends relevant data to the front end over TCP/IP
A JAVA front end reading the data from the numerical simulator via SWIG (for the C code) and JNI, modifying the parameters with input from the microprocessor and finally plotting it to the GUI.
I want to recode the JAVA front end in C++ now, with the option of using HTML/Javascript for plotting.
Would rewriting the front end in C++ so that the numerical simulator runs on a separate thread be a good approach?
I don't understand threading though I have used it for the listening and plotting functions in the JAVA code. It seems like having it all run on multiple threads instead of separate processes would slow down my simulations.
Can I combine 1 , 2 and 3 into a single program or should they remain separate to retain the 0.0001 ms simulation speed and the ability to handle the large amount to microprocessor data.
Please help me pick a path forward!
Thanks in Advance!

On a multicore platform, multithreading will generally improve performance. However, GPOS such as Linux and Windows are not deterministic, so there are no guarantees.
That said, the computational performance of a modern PC is such that it will hardly be stretched by this task and data rate,so it hardly matters perhaps?

Related

Communication frequency vs Simulation Time for FMU

Lets say we have a FMU which is getting inputs from Python and simulating at an interval of 0.001s. Does the FMI/FMU standard allow us to run the FMU multiple times for a same input (so Python provides the input at 0.01s interval and the FMU simulates that 10 times at each step)? Would that be faster since we have reduced the communication interface by 1/10th ?
(For CS FMUs:) Updating the inputs only every 10th step can be seen as a special co-simualtion algorithm and is ok. Input variables keep their values until they they are newly set.
This will only lead to a benefit in simulation speed, if the the internal calculation time (of a doStep) is small compared to the communication runtime.

Measuring Multiple Voltages in LabView w/USB 6001

I'm trying to set up my LabView VI + my USB 6001 I/O box to be able to read multiple independent voltages at once, while also outputting a single constant voltage.
I've successfully gotten my USB box to output the voltage I want while reading back a single voltage, but so far I've been unable to read back more than one voltage (and if I do, the two voltages seem to be co-dependent on one another in some way).
Here's a screenshot of my VI:
Everything to the right of the screenshot window should be unimportant to the question.
If anyone is curious, this is to drive multiple LVDT's and read back their respective voltages.
Thank you all for your help!
Look at your DAQ's manual, especially the pages I noted below.
http://www.ni.com/pdf/manuals/374259a.pdf
Page 11
All the AI channels get multiplexed, and the low-side reference can be switched (RSE vs. differential). So the two channels you're sampling require both of those to switch. It might be a settling issue where the ADC is taking a sample before the input value is stable.
To verify this, try using using the same low side (differential or RSE) on both channels. Also try slowing down your sample rate (but your 1 kHz should already be slow enough...).
Page 14
Check this to make sure you have everything connected and grounded correctly.
Page 18
Check this for more details about switching between 2 sources quickly.
Perhaps you could try it using the Daqmx express VIs:
http://www.ni.com/tutorial/2744/en/

how to deal with large linnet object

I am trying to use a whole city network for a particular analysis which I know is very huge. I have also set it as sparse network.
library(maptools)
library(rgdal)
StreetsUTM=readShapeSpatial("cityIN_UTM")
#plot(StreetsUTM)
library(spatstat)
SS_StreetsUTM =as.psp(StreetsUTM)
SS_linnetUTM = as.linnet(SS_StreetsUTM, sparse=TRUE)
> SS_linnetUTM
Linear network with 321631 vertices and 341610 lines
Enclosing window: rectangle = [422130.9, 456359.7] x [4610458,
4652536] units
> SS_linnetUTM$sparse
[1] TRUE
I have the following problems:
It took 15-20 minutes to build psp object
It took almost 5 hours to build the linnet object
every time I want to analyse it for a point pattern or envelope, R crashes
I understand I should try to reduce the network size, but:
I was wondering if there is a smart way to overcome this problem. Would rescaling help?
How can I put it on more processing power?
I am also curios to know if spatstat can be used with parallel package
In the end, what are the limitations on network size for spatstat.
R crashes
R crashes when I use the instructions from Spatstat book:
KN <- linearK(spiders, correction="none") ; on my network (linnet) of course
envelope(spiders, linearK, correction="none", nsim=39); on my network
I do not think RAM is the problem, I have 16GB RAM and 2.5GhZ Dual core i5 processor on an SSD machine.
Could someone guide me please.
Please be more specific about the commands you used.
Did you build the linnet object from a psp object using as.linnet.psp (in which case the connectivity of the network must be guessed, and this can take a long time), or did you have information about the connectivity of the network that you passed to the linnet() command?
Exactly what commands to "analyse it for a point pattern or envelope" cause a crash, and what kind of crash?
The code for linear networks in spatstat is research code which is still under development. Faster algorithms for the K-function will be released soon.
I could only resolve this with simplifying my network in QGIS with Douglas-Peucker algorithm in Simplify Geometries tool. So it is a slight compromise on the geometry of the linear network in the shapefile.

Measuring time: differences among gettimeofday, TSC and clock ticks

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

How to search for Possibilities to parallelize?

I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...

Resources