Does recursion in Excel give a wrong result on AMD processors? - excel

I try to use recursion in Excel, and just have changed my laptop from Intel to AMD, and saw a very strange issue.
Enable iterating computing with the maximum limit of 100 iterations
Try to compute, for example in cell A1 formula =A1+1
On my AMD laptop I've got 700 (!) instead of 100 on Intel (tried on several PCs). I couldn't tried this on another AMD PC - there is no any near. I tried to turn off multiflow calculation (only 1 core) - the same result - 700.
This issue appears only in case of constants in a formula. If we operate with links we get the same result as on Intel.
What is this could be?

Related

RK3399 yields invalid MIPI-DSI signal with correct timings and valid signal with incorrect timings

We build our own Yocto environment and distribution using a 5.10.119 kernel and mesa 20.3.
Currently we are trying to get an MIPI-DSI (ILI9881C) screen up and running on an SOM-RK3399v2 (Friendly-Elec) but have some troubles: We can get the screen to display an image, however, its shifted roughly 100px. This shift depends highly on the used MIPI-Clock (mbps).
All timings and clocks are correct and triple checked with the screen-vendor. We tried many configurations, did several Hardware revisions to our Mainboard and even tested the SOM-RK3399-Eval board from Friendly-Elec. All show the same behavior.
In an accident, we found an actual working configuration for the screen. However, on a mathematical basis, these settings should never work, but they do!
The screen vendor supplied us with the following timings:
H 800
HSW 20
HBP 20
HFP 20
V 1280
VSW 10
VBP 20
VFP 10
PCLK 68112 (60fps)
The driver implementation for the RK3399 MIPI-DSI selects mbps=510 for these timings.
But using these values results in the shifted image:
Notice how y100 is directly at the top of the screen, rather than 100px from the bottom.
Several try and errors later we found a configuration that works for the screen but shouldn't:
H 800
HSW 33
HBP 500
HFP 500
V 1280
VSW 10
VBP 20
VFP 10
PCLK 145173 (60fps)
MBPS 457
As you can see, those timings are ridiculously off the charts and PCKL does not fit to the (hardcoded) MBPS of 457. However, the screen shows a correct and nicely aligned image without flickering whatsoever.
We further diagnosed this and found that the RK3399 sends some strange or even malformed MIPI-DSI-Data-Stream when going from LP(LS) to HS to the screen using the correct timings, but sends a perfectly fine LS2HS-Data-Stream to the screen with incorrect timings:
Observe how the signal for the correct timings is hold high way to long (times 3) and the switch from low-power to high-speed seems to be corrupted.
Therefore, we assume that the shift in our image might be related to that incorrect LP2HS.
Have you seen such behavior before? Do you know what could yield this behavior? It seems like it does 3 blankings instead of 1?

Numexpr for Python returning all zero arrays on certain hardware configurations

I've recently discovered what appears to be a bug in Numexpr. Although I've already opened an issue on their Git hub, I figured I would avail myself of the collective wisdom here as well.
In a nutshell, evaluate sometimes (unpredictably) returns incorrect results when doing a straightforward array operation. The bug, which can be reproduced by the Python code below, results in a zero array being returned rather than the correct result. Although the sample code shows a multiplication, this bug has manifested for us on addition and exponentiation as well. Notably, there are no errors or warnings that are raised by Numexpr, the computational load appears normal (i.e. the RAM and CPU are taxed as expected when monitoring task manager), and the correct shape array is returned. It was a rather insidious bug to isolate for those reasons! In our tests, this bug has only manifested in the following hardware builds:
Windows Server 2012 r2, Intel Xeon 2680 v3,
2 processors, 48 logical cores
Windows 8.1, Intel Xeon 2690,
1 processor, 24 logical cores
In all the many thousands of runs of our software completed on our Windows 7, 64 bit, Intel i7 machines, this has never manifested. Furthermore, we have run the attached code many times (with bigger arrays and more iterations) and have not seen the error on the Windows 7, i7 machines. The Xeon computers, though, manifest it regularly. Unfortunately we don't have any other builds on which to test.
Other items of note:
We are running from the WinPython distribution 3.4.3.6.
We have not invoked any supporting Numexpr functions, just evaluate... so we are using its default settings.
The version of Numexpr is 2.4.4, as included in WinPython 3.4.3.6
Sample Code:
import numpy as np
import numexpr as ne
x = np.ones(1e6)
y = np.ones(1e6)
for ii in range(1000):
rr = ne.evaluate('x * y')
test = np.all(rr == 0)
if test:
print('Gotcha! %d' % ii)
print('Complete!')

multithreading or shared memory - Architecture

There are 3 parts to my application:
A numerical simulator solving a 21 variable diff equation by runge-kutta method - direct from numerical recipes in C, step size is 0.0001 s
A C code pinging a PIC based micrprocessor every 1s and receiving data at about 3600 samples per second over the USB-COM port; It sends relevant data to the front end over TCP/IP
A JAVA front end reading the data from the numerical simulator via SWIG (for the C code) and JNI, modifying the parameters with input from the microprocessor and finally plotting it to the GUI.
I want to recode the JAVA front end in C++ now, with the option of using HTML/Javascript for plotting.
Would rewriting the front end in C++ so that the numerical simulator runs on a separate thread be a good approach?
I don't understand threading though I have used it for the listening and plotting functions in the JAVA code. It seems like having it all run on multiple threads instead of separate processes would slow down my simulations.
Can I combine 1 , 2 and 3 into a single program or should they remain separate to retain the 0.0001 ms simulation speed and the ability to handle the large amount to microprocessor data.
Please help me pick a path forward!
Thanks in Advance!
On a multicore platform, multithreading will generally improve performance. However, GPOS such as Linux and Windows are not deterministic, so there are no guarantees.
That said, the computational performance of a modern PC is such that it will hardly be stretched by this task and data rate,so it hardly matters perhaps?

Can you write to [PC]?

According to the DCPU specification, the only time a SET instruction fails is if the a value is a literal.
So would the following work?
SET [PC],0x1000
A more useful version would be setting an offset of PC, so a rather strange infinite loop would be:
SET [PC+0x2],0x89C3 ; = SUB PC,0x2
Probably (= I think it should work but I didn't try).
This is called "self modifying" code and was quite common the 8bit era because of a) limited RAM and b) limited code size. Code like that is very powerful but error prone. If your code base grows, this can quickly become a maintenance nightmare.
Famous use cases:
Windows 95 used code like this to build graphics rendering code on the stack.
Viruses and trojans use this as an attack vector (write code on the stack or manipulate return addresses to simluate a JMP)
Simulate switch statements on the C64
There's no value for [PC], so I'm guessing you need to do it in a round-about way by storing PC in something you can use as a pointer (registry or memory).
SET A , PC
SET [A+3], 0x8dc3 ; SUB PC, 3 (if A can't be changed from outside SUB PC,2 works too.)

How to search for Possibilities to parallelize?

I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...

Resources