Is MATLAB cellfun GPU-friendly

Is MATLAB cellfun GPU-friendly - multithreading

I have a loop in MATLAB which solves a linear system of equation. The total number of equations is about 500,000. Each system has 4 unknowns but the number of equations varies between 10 to 50. In addition, this system of equations needs to be solved over time (about 3000 timesteps)
for ivertex = 1 : nVertices
b = obj.f( obj.neighIDs{ ivertex } );
x = obj.coeffMatrix{ ivertex } \ b;
obj.solution( ivertex , : ) = x( 1 : 3 );
end
I tried to use parfor to accelerate this loop but I did not see any improvement. I also tried using cellfun to avoid looping through equations using MATLAB for:
tmpSolution = cellfun( ...
#(x, y) x \ obj.f( y ), ...
obj.coeffMatrix, ...
obj.neighIDs , ...
'UniformOutput', false );
But each timestep takes about 11 seconds (3000×11 = 9 hours). Is there a way to make this faster? Is this problem GPU friendly? My MATLAB version is 2013b in Windows 7 64 bit with 32 GB RAM.

Related

opencl speed and OUT_OF_RESOURCES

I am very new to opencl and trying my first program. I implemented a simple sinc filtering of waveforms. The code works, however i have two questions:
Once I increase the size of the input matrix (numrows needs to go up to 100 000) I get (clEnqueueReadBuffer failed: OUT_OF_RESOURCES) even though matrix is relatively small (few mb). This is to some extent related to the work group size I think, but could someone elaborate how I could fix this issue ?
Could it be driver issue ?
UPDATE:
leaving groups size None crashes
adjusting groups size for GPU (1,600) and IntelHD (1,50) lets me go up to some 6400 rows. However for larger size it crashes on GPU and IntelHD just freezes and does nothing ( 0% on resource monitor)
2.I have Intel HD4600 and Nvidia K1100M GPU available, however the Intel is ~2 times faster. I understand partially this is due to the fact that I don't need to copy my arrays to internal Intel memory different from my external GPU. However I expected marginal difference. Is this normal or should my code be better optimized to use on GPU ? (resolved)
Thanks for your help !!
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
import matplotlib.pyplot as plt
def resample_opencl(y,key='GPU'):
#
# selecting to run on GPU or CPU
#
newlen = 1200
my_platform = cl.get_platforms()[0]
device =my_platform.get_devices()[0]
for found_platform in cl.get_platforms():
if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):
my_platform = found_platform
device =my_platform.get_devices()[0]
print("using GPU")
#
#Create context for GPU/CPU
#
ctx = cl.Context([device])
#
# Create queue for each kernel execution
#
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
# queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void resample(
int M,
__global const float *y_g,
__global float *res_g)
{
int row = get_global_id(0);
int col = get_global_id(1);
int gs = get_global_size(1);
__private float tmp,tmp2,x;
__private float t;
t = (float)(col)/2+1;
tmp=0;
tmp2=0;
for (int i=0; i<M ; i++)
{
x = (float)(i+1);
tmp2 = (t- x)*3.14159;
if (t == x) {
tmp += y_g[row*M + i] ;
}
else
tmp += y_g[row*M +i] * sin(tmp2)/tmp2;
}
res_g[row*gs + col] = tmp;
}
""").build()
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
M = np.array(600).astype(np.int32)
prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event = cl.enqueue_copy(queue, res, res_g)
print("success")
event.wait()
return res,event
if __name__ == "__main__":
#
# this is the number i need to increase ( up to some 100 000)
numrows = 2000
Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))
x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
y= np.zeros(( numrows,np.size(x)))
y[:] = Gaussian(x).astype(np.float32)
y = y.astype(np.float32)
res,event = resample_opencl(y,'GPU')
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
#
# test plot if it worked
#
plt.figure()
plt.plot(x,y[1,:],'+')
plt.plot(t,res[1,:])

Re 1.
Your newlen has to be divisible by 200 because that is what you set as local dimensions (1,200). I increased this to 9600 and that still worked fine.
Update
After your update I would suggest not specifying local dimensions but let implementation to decide:
prg.resample(queue, res.shape, None,M, y_g, res_g)
Also it may improve the performance ifnewlen and numrows were multiply of 16.
It is not a rule that Nvidia GPU must perform better than Intel GPU especially that according to Wikipedia there is not a big difference in GFLOPS between them (549.89 vs 288–432). This GFLOPS comparison should be taken with grain of salt as one algorithm may be more suitable to one GPU than the other. In other words looking by this numbers you may expect one GPU to be typically faster than the other but that may vary from algorithm to algorithm.
Kernel for 100000 rows requires:
y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB
Quadro K1100M has 2GB of global memory and that should be sufficient for processing 100000 rows. Intel HD 4600 from what I found is limited by memory in the system so I suspect that shouldn't be a problem too.
Re 2.
The time is not measured correctly. Instead of measuring kernel execution time, the time of copying data back to host is being measured. So no surprise that this number is lower for CPU. To measure kernel execution time do:
event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
I don't know how to measure the whole thing including copying data back to host using OpenCL profiling events in pyopencl but using just python gives similar results:
start = time.time()
... #code to be measured
end = time.time()
print(end - start)

I think I figured out the issue:
IntelHd : turning off profiling fixes everything. Can run the code without any issues.
K1100M GPU still crashes but I suspect that this might be the timeout issue as I am using the same video card on my display.

VHDL - String indexing - RAM usage and total logic elements increase by over 100% each

I'm hoping someone with more VHDL experience can enlighten me! To summarise, I have an LCD entity and a Main entity which instantiates it. The LCD takes an 84-character wide string ("msg"), which seems to cause me huge problems as soon as I index it using a variable or signal. I have no idea what the reason for this is, however, since the string is displaying HEX values, and each clock cycle, I read a 16-bit value... I need to update 4 characters of the string for each nybble of this 16-bit value. This doesn't need to be done in a single clock cycle, since a new value is read after a large number of cycles... however, experimenting with incrementing a "t" variable, and only changing string values one "t" at a time makes no difference for whatever reason.
The error is: "Error (170048): Selected device has 26 RAM location(s) of type M4K" However, the current design needs more than 26 to successfully fit
Here is the compilation report with the problem:
Flow Status Flow Failed - Tue Aug 08 18:49:21 2017
Quartus II 64-Bit Version 13.0.1 Build 232 06/12/2013 SP 1 SJ Web Edition
Revision Name Revision1
Top-level Entity Name Main
Family Cyclone II
Device EP2C5T144C6
Timing Models Final
Total logic elements 6,626 / 4,608 ( 144 % )
Total combinational functions 6,190 / 4,608 ( 134 % )
Dedicated logic registers 1,632 / 4,608 ( 35 % )
Total registers 1632
Total pins 50 / 89 ( 56 % )
Total virtual pins 0
Total memory bits 124,032 / 119,808 ( 104 % )
Embedded Multiplier 9-bit elements 0 / 26 ( 0 % )
Total PLLs 1 / 2 ( 50 % )
The RAM summary table contains 57 rows, of "LCD:display|altsyncram:Mux####_rtl_0|altsyncram_####:auto_generated|ALTSYNCRAM"
Here is the LCD entity:
entity LCD is
generic(
delay_time : integer := 50000;
half_period : integer := 7
);
port(
clk : in std_logic;
SCE : out std_logic := '1';
DC : out std_logic := '1';
RES : out std_logic := '0';
SCLK : out std_logic := '1';
SDIN : out std_logic := '0';
op : in std_logic_vector(2 downto 0);
msg : in string(1 to 84);
jx : in integer range 0 to 255 := 0;
jy : in integer range 0 to 255 := 0;
cx : in integer range 0 to 255 := 0;
cy : in integer range 0 to 255 := 0
);
end entity;
The following code is what causes the problem, where a, b, c and d are variables which are incremented by 4 after each read:
msg(a) <= getHex(data(3 downto 0));
msg(b) <= getHex(data(7 downto 4));
msg(c) <= getHex(data(11 downto 8));
msg(d) <= getHex(data(15 downto 12));
Removing some of these lines causes the memory and logic element usages to both drop, but they still seem absurdly high, and I don't understand the cause.
Replacing a, b, c and d with integers, like 1, 2, 3 and 4 causes the problem to go away completely, with the logic elements at 22%, and RAM usage at 0%!
If anybody has any ideas at all, I'd be very grateful! I will post the full code below in case anybody needs it... but be warned, it's a bit messy, and I feel like the problem could be simple. Many thanks in advance!
Main.vhd
LCD.vhd

There are a few issues here.
The first is that HDL synthesis tools do an awful lot of optimization. What this basically means is if you don't properly connect up input and output parts to/from something it is likely (but not certain) to get eliminated by the optimizer.
The second is you have to be very careful with loops and functions. Basically loops will be unrolled and functions will be inlined, so a small ammount of code can generate an awful lot of logic.
The third is that under some cicumstances arrays will be translated to memory elements.
As pointed out in a comment this loop is the root cause of the large ammounts of memory usage.
for j in 0 to 83 loop
for i in 0 to 5 loop
pixels((j*6) + i) <= getByte(msg(j+1), i);
end loop;
end loop;
This has the potential to use a hell of a lot of memory resources. Each call to "getByte" requires a read port on (parts of) "ram" but blockrams only have two read ports. So "ram" gets duplicated to satisfy the need for more read ports. The inner loop is reading different parts of the same location so basically each iteration of the outer loop needs an independent read port on the ram. So that is about 40 copies of the ram. Reading the cyclone 2 datasheet each copy will require 2 m4k blocks
So why doesn't this happen when you use numbers instead of the variables a,b,c and d?
If the compiler can figure out something is a constant it can compute it at compile time. This would limit the number of calls to "pixels" that have to actually be translated to memory blocks rather that just having their result hardcoded. Still i'm surprised it's dropping to zero.
I notice your code doesn't actually have any inputs other than the clock and a "rx" input that doesn't actually seem to be being used for anything, so it is quite possible that the synthesizer may be figuring out a hell of a lot of stuff at build time. Often eliminating one bit of code can allow another bit to be eliminated until you have nothing left.

Minimum value over entire simulation

In a continuous model, how do I save the minimum value of a variable during the simulation?
When a simulation has finished I want to display a variable T_min with a graphical annotation that shows me the lowest value of a temperature T during the simulation.
For example, if the simulated temperature T was a sine function, the desired result for the value of T_min would be:
In discrete code this would look something like this:
T_min := Modelica.Constants.inf "Start value";
if T < T_min then
T_min := T;
else
T_min := T_min;
end if;
... but I would like a continuous implementation to avoid sampling, high number of events etc.

I'm not sure if Renes solution is optimal. The solution generates many state events generated by the two if conditions. Embedded in the following model:
model globalMinimum2
Real T, T_min;
Boolean is_true;
initial equation
T_min = T;
equation
T =time/10*sin(time);
// if statement ensures that 'T_min' doesn't integrate downwards...
// ... whenever der(T) is negative;
if T < T_min then
der(T_min) = min(0, der(T));
is_true=true;
else
der(T_min) = 0;
is_true=false;
end if;
end globalMinimum2;
The simulation log is the following:
Integration started at T = 0 using integration method DASSL
(DAE multi-step solver (dassl/dasslrt of Petzold modified by Dynasim))
Integration terminated successfully at T = 50
WARNING: You have many state events. It might be due to chattering.
Enable logging of event in Simulation/Setup/Debug/Events during simulation
CPU-time for integration : 0.077 seconds
CPU-time for one GRID interval: 0.154 milli-seconds
Number of result points : 3801
Number of GRID points : 501
Number of (successful) steps : 2519
Number of F-evaluations : 4799
Number of H-evaluations : 18822
Number of Jacobian-evaluations: 2121
Number of (model) time events : 0
Number of (U) time events : 0
Number of state events : 1650
Number of step events : 0
Minimum integration stepsize : 1.44e-005
Maximum integration stepsize : 5.61
Maximum integration order : 3
Perhaps it is better to detect two events as given in the following example:
model unnamed_2
Real T;
Real hold;
Real T_min;
Boolean take_signal;
initial equation
hold=T;
equation
T = time/10*sin(time);
when (T < pre(hold)) then
take_signal = true;
hold = T;
elsewhen (der(T) >=0) then
take_signal = false;
hold = T;
end when;
if (take_signal) then
T_min = T;
else
T_min = hold;
end if;
end unnamed_2;
The simulation log shows that this solutions is more efficient:
Log-file of program ./dymosim
(generated: Tue May 24 14:13:38 2016)
dymosim started
... "dsin.txt" loading (dymosim input file)
... "unnamed_2.mat" creating (simulation result file)
Integration started at T = 0 using integration method DASSL
(DAE multi-step solver (dassl/dasslrt of Petzold modified by Dynasim))
Integration terminated successfully at T = 50
CPU-time for integration : 0.011 seconds
CPU-time for one GRID interval: 0.022 milli-seconds
Number of result points : 549
Number of GRID points : 501
Number of (successful) steps : 398
Number of F-evaluations : 771
Number of H-evaluations : 1238
Number of Jacobian-evaluations: 373
Number of (model) time events : 0
Number of (U) time events : 0
Number of state events : 32
Number of step events : 0
Minimum integration stepsize : 4.65e-006
Maximum integration stepsize : 3.14
Maximum integration order : 1
Calling terminal section
... "dsfinal.txt" creating (final states)

It seems I was able to finde an answer to my own question simply by looking at the figure above
The code is quite simple:
model globalMinimum
Modelica.SIunits.Temperature T, T_min;
initial equation
T_min = T;
equation
// if statement ensures that 'T_min' doesn't integrate downwards...
// ... whenever der(T) is negative;
der(T_min) = if T < T_min then min(0, der(T)) else 0;
end globalMinimum;

intel MKL pardiso won't run parallel in fortran

I am trying to get the intel MKL version of pardiso to work with multiple cores. Im using it to solve a structurally symmetric system (mtype=1) with around 60K equations.
iparm= 0
iparm(1) = 1 !
iparm(2) = 3 !
iparm(3) = omp_get_max_threads() !
iparm(4) = 0 !
iparm(5) = 0 !
iparm(6) = 0 !
iparm(7) = 0 !
iparm(8) = 9 !
iparm(9) = 0 !
iparm(10) = 13
iparm(11) = 1
iparm(12) = 0
iparm(13) = 0
iparm(14) = 0
iparm(15) = 0
iparm(16) = 0
iparm(17) = 0
iparm(18) = -1
iparm(19) = -1
iparm(20) = 0
These are the my ipram parameters. When compiling I have
F90FLAGS = ${F77FLAGS} -I${SOLIDroot} -openmp -mkl=parallel -d-lines -debug
Before calling pardiso I also set the number of threads available to MKL and openmp
call mkl_set_num_threads(3)
call omp_set_num_threads(3)
call mkl_set_dynamic(0) ! disabling dynamic adjustment of the number of threads
As far as I have understood, all MKL functions will try to use multiple threads if allowed or enabled for "sufficiently" large problems. I already have some parallelism using OMP and the code runs on several cores. The region from which I call pardiso is serial. My question is, what else is needed to make pardiso work with multiple cores?
Tried with the default values for iparm, ie iparm(1)=0 and there was no change

I cannot add a comment (not enough reputations).
You can try setting the environment variable OMP_NUM_THREADS before you run the code and see if that works.

Is there a way to access DHR on the Apple 2 from Applesoft Basic

When using Applesoft Basic on the Apple 2 with an 80 column card, is there a way to create DHR graphics using only POKE?
I have found a number of solutions using third party extensions such as Beagle Graphics, but I really want to implement it myself. I've searched my Nibble magazine collection, and basic books, but have been unable to find any detailed information.
Wikipedia:
Double High-Resolution The composition
of the Double Hi-Res screen is very
complicated. In addition to the 64:1
interleaving, the pixels in the
individual rows are stored in an
unusual way: each pixel was half its
usual width and each byte of pixels
alternated between the first and
second bank of 64KB memory. Where
three consecutive on pixels were
white, six were now required in double
high-resolution. Effectively, all
pixel patterns used to make color in
Lo-Res graphics blocks could be
reproduced in Double Hi-Res graphics.
The ProDOS implementation of its RAM
disk made access to the Double Hi-Res
screen easier by making the first 8 KB
file saved to /RAM store its data at
0x012000 to 0x013fff by design. Also,
a second page was possible, and a
second file (or a larger first file)
would store its data at 0x014000 to
0x015fff. However, access via the
ProDOS file system was slow and not
well suited to page-flipping animation
in Double Hi-Res, beyond the memory
requirements.
Wikipedia says that DHR uses 64:1 interlacing, but gives no reference to the implementation. Additionally Wikipedia says you can use the /RAM disk to access, but again gives no reference to the implementation.
I am working an small program that plots a simple version of Connet's Circle Pattern. Speed isn't really as important as resolution.

A member of the comp.sys.apple2.programmer answered my question at: http://groups.google.com/group/comp.sys.apple2.programmer/browse_thread/thread/b0e8ec8911b8723b/78cd953bca521d8f
Basically you map in the Auxiliary memory from the 80 column card. Then plot on the HR screen and poke to the DHR memory location for the pixel you are trying to light/darken.
The best full example routine is:
5 HGR : POKE 49237,0 : CALL 62450 : REM clear hires then hires.aux
6 POKE 49246,0 : PG = 49236
7 SVN = 7 : HCOLOR= SVN : P5 = .5
9 GOTO 100
10 X2 = X * 4 : CL = CO : TMP = 8 : FOR I = 3 TO 0 STEP -1 : BIT = CL >= TMP:
CL = CL - BIT * TMP : TMP = TMP * P5
20 X1 = X + I: HCOLOR= SVN * BIT
30 XX = INT (X1 / SVN): H = XX * P5: POKE PG + (H= INT (H)),0
40 XX = INT (( INT (H) + (( X1 / SVN) - XX)) * SVN + P5)
50 HPLOT XX,Y: POKE PG, 0: NEXT : RETURN
100 FOR CO = 0 TO 15 : C8 = CO * 8
110 FOR X = C8 TO C8 + SVN: FOR Y = 0 TO 10 : GOSUB 10 : NEXT : NEXT
120 NEXT
130 REM color is 0 to 15
140 REM X coordinate is from 0 to 139
150 REM Y coordinate is from 0 to 191

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string