Run fortran routine with multi processor

Run fortran routine with multi processor - multithreading

I am new in programming and I am actually a mechanical engineer. For my research I have written a fortran routine for modelling a process.
This routine is quite slow because either has been written by me (and so it's not perfect computationally speaking) and it performs many iteration to reach convergence, so it need time.
But I have a 6 core-CPU and I think if I could exploit all of the cores the routine could run faster than it does now.
The routine is like this:
PROGRAM my routine
INCLUDE 'dimensions_of_arrays.dim'
INCLUDE 'subroutines.sub'
INCLUDE 'subroutines2.sub'
DECLARATION OF VARIABLES
..
.
DO LOOP OVER MANY STEPS
.
CALL MANY SUBROUTINES
.
.
.
PERFORM SOME ITERATION
END LOOP
.
WRITE RESULTS
END
In the file of the subroutines 'subroutines.sub' I have more than 20 subroutines, like this:
SUBROUTINE xxx(a,b)
INCLUDE 'dimensions_of_arrays.dim'
DECLARATION OF VARIABLES
COMMON/PATH1/PATH2/G,J,K
.
.
SOME CALCULATION
.
END
In the file 'dimensions_of_arrays.dim' there are common and parameters used during compilation.
Is it possible in your opinion using multi-processor with this routine? Trying not to modify it "heavily".
I use Intel Composer XE2011 with Visual Studio 2010 as compiler of the code.
Any help is very appreciated.
Thanks

Since you are using Intel Fortran, I suggest that your first step should be to add the automatic parallelization option. In Visual Studio on Windows this is project property Fortran > Optimization > Parallelization > Yes. While you're at it, I suggest setting option /QxHost. I don't remember if the old version you're using supports this as a project property - if it does, it would be Fortran > Code Generation > Intel Processor-Specific Optimization > Same as the host processor. Of course, you should be building a Release configuration to enable optimization.
This may give you enough performance boost to be satisfactory. If not the next step I would suggest would be to turn on optimization diagnostics and see what it says about why certain loops could not be parallelized.
You are using a quite-old version of the compiler - newer versions are much better at parallelization and optimization and I'd recommend you use the latest you have access to. If none of this produces the results you want, then I agree you'll need to "get your hands dirty" and add OpenMP directives, but this will require that you have a good understanding of how the program works, which variables should be shared and which private. An intermediate step would be to use the Intel parallelization directives, but these aren't a lot different from OpenMP.
When converting a serial program to parallel, especially an old Fortran code, you have to be very careful when it comes to global variables (COMMONs usually). These can either block parallelization or lead to incorrect results. The Intel Inspector XE tool (part of larger Intel Parallel Studio XE editions) can be good at finding these for you.

Related

Differences between GNU C++ 4.8.1 (MinGW) and Visual C++ 2013

I know question like this have been asked before, but there isn't the exact answer I'm searching for.
Today I was writing ACM-ICPC contest with my team. Usually we are using GNU C++ 4.8.1 compilator (which was available on contest). We had written code, which had time limit exceeded on test case 10. At the end of contest, less then 2 minutes remaining, I sent the exactly same submission with Visual C++ 2013 (same source file, different language) it got accepted and worked. There were more than 60 test cases and our code passed them all.
Once more I say that there were no differences between the source codes.
Now I'm just interested why it happened.
Anyone knows what the reason is?

Without knowing the exact compiler options you used, this answer might be a bit difficult to answer. Usually, compilers come with many options and provide some default values which are used as long as the user does not override them. This is also true for code optimization options. Both mentioned compilers are capable to significantly improve the speed of the generated binary when being told so. A wild guess would be that in our case, the optimization settings used by the GNU compiler did not improve the executable performance so much but the VC++ settings did. For example because not any flags were used in one case. Another wild guess would be that one compiler was generating a debug binary and the other did not (check for the option -g with GCC which switches debug symbol generation on).
On the other hand, depending on the program you created, it could of course be that VC++ was simply better in performing the optimization than g++.
If you are interested in easy increasing the performance, have a look at the high-level optimization flags at https://gcc.gnu.org/onlinedocs/gnat_ugn/Optimization-Levels.html or for the full story, at the complete list at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
More input on comparing compilers:
http://willus.com/ccomp_benchmark2.shtml?p1
http://news.dice.com/2013/11/26/speed-test-2-comparing-c-compilers-on-windows

Measure function time execution without modification code

I have found some piece of code (function) in library which could be improved by the optimization of compiler (as the main idea - to find good stuff to go deep into compilers). And I want to automate measurement of time execution of this function by script. As it's low-level function in library and get arguments it's difficult to extract this one. Thus I want to find out the way of measurement exactly this function (precise CPU time) without library/application/environment modifications. Have you any ideas how to achieve that?
I could write wrapper but I'll need in near future much more applications for performance testing and I think to write wrapper for every one is very ugly.
P.S.: My code will run on ARM (armv7el) architecture, which has some kind of "Performance Monitor Control" registers. I have learned about "perf" in linux kernel. But don't know is it what I need?

It is not clear if you have access to the source code of the function you want to profile or improve, i.e. if you are able to recompile the considered library.
If you are using a recent GCC (that is 4.6 at least) on a recent Linux system, you could use profilers like gprof (assuming you are able to recompile the library) or better oprofile (which you could use without recompiling), and you could customize GCC for your needs.
Be aware that like any measurements, profiling may alter the observed phenomenon.
If you are considering customizing the GCC compiler for optimization purposes, consider making a GCC plugin, or better yet, a MELT extension, for that purpose (MELT is a high-level domain specific language to extend GCC). You could also customize GCC (with MELT) for your own specific profiling purposes.

Easiest way to use GPU for parallel for loop

I currently have a parallel for loop similar to this:
int testValues[16]={5,2,2,10,4,4,2,100,5,2,4,3,29,4,1,52};
parallel_for (1, 100, 1, [&](int i){
int var4;
int values[16]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
/* ...nested for loops */
for (var4=0; var4<16; var4++) {
if (values[var4] != testValues[var4]) break;
}
/* ...end nested loops */
}
I have optimised as much as I can to the point that the only thing more I can do is add more resources.
I am interested in utilising the GPU to help process the task in parallel. I have read that embarassingly parallel tasks like this can make use of a modern GPU quite effectively.
Using any language, what is the easiest way to use the GPU for a simple parallel for loop like this?
I know nothing about GPU architectures or native GPU code.

as Li-aung Yip said in comments, the simplest way to use a GPU is with something like Matlab that supports array operations and automatically (more or less) moves those to the GPU. but for that to work you need to rewrite your code as pure matrix-based operations.
otherwise, most GPU use still requires coding in CUDA or OpenCL (you would need to use OpenCL with an AMD card). even if you use a wrapper for your favourite language, the actual code that runs on the GPU is still usually written in OpenCL (which looks vaguely like C). and so this requires a fair amount of learning/effort. you can start by downloading OpenCL from AMD and reading through the docs...
both those options require learning new ideas, i suspect. what you really want, i think, is a high level, but still traditional-looking, language targeted at the gpu. unfortunately, they don't seem to exist much, yet. the only example i can think of is theano - you might try that. even there, you still need to learn python/numpy, and i am not sure how solid the theano implementation is, but it may be the least painful way forwards (in that it allows a "traditional" approach - using matrices is in many ways easier, but some people seem to find that very hard to grasp, conceptually).
ps it's not clear to me that a gpu will help your problem, btw.

You might want to check out array fire.
http://www.accelereyes.com/products/arrayfire
If you use openCL, you need to download separate implementations for different device vendors, intel, AMD, and Nvidia.

You might want to look into OpenACC which enables parallelism via directives. You can port your codes (C/C++/Fortran) to heterogeneous systems while maintaining a source code that still runs well on a homogeneous system. Take a look into this introduction video. OpenACC is not GPU programming, but expressing parallelism into your code, which may be helpful to achieve performance improvements without too much knowledge in low-level languages such as CUDA or OpenCL. OpenACC is available in commercial compilers from PGI, Cray, and CAPS (PGI offers new users a free 30 day trial).

Why VC++ 2008 compiler takes too much time on link stage?

I have automatic generated code (around 18,000 lines, basically a wrap of data) and other about 2,000 lines code in a C++ project. The project turned on the link-time-optimization operation. /O2 and fast-code optimization. To compile the code, VC++ 2008 express takes incredibly long time (around 1.5 hours). After all, it has only 18,000 lines, why the compiler takes so much time?
a little explanation to the 18,000 code. It is plain C, not even C++ which includes many unpacked for-loop, a sample would be:
a[0].a1 = 0.1284;
a[0].a2 = 0.32186;
a[0].a3 = 0.48305;
a[1].a1 = 0.543;
..................
It basically fill a complex struct. But not so complex to compiler I guess.
The Debug mode is fast, only the Relase mode has this issue. Before I have the 18,000 lines of code, they are all fine. (that time the data is in external location). However, the release mode does many work which reduce the size of exe from 1,800kb to 700kb.
this issue does happen in link stage because all .obj files are generated. I have suspect on link-time-code-generation too but cannot figure out where is wrong.

Several factors influence link time, including but not limited to:
Computer speed, especially available memory
Libraries included in the build.
Programming paradigm - are you using boost by any chance?
18,000 lines of template metaprogramming compiling on even a new quad-core and 1.5 hours of linking wouldn't completely surprise me.

Historically, a common cause of slow C++ computation is excessive header file inclusion, usually a result of poor modularization. You can get a lot of redundant compilation by including the same big headers in lots of small source files. The usual reference in these cases is Lakos.
You don't state whether you are using the pre-compiled header, which is the quick and dirty substitute for a header file refactoring.

That's why we generate lots of DLLs for our debug builds, but generally link them in for our release builds. It's easier (for our particular purposes) to deal with more monolithic executables, but it takes a long time to link.

As said in one of the comments, you probably have Link Time Code Generation (/LTCG) enabled, which moves the majority of code generation and optimization to the link stage.
This enables some amazing optimizations, but also increases link times significantly.
The C++ team says the've significantly optimized the linker for VS 2010.

Bare metal cross compilers input

What are the input limitations of a bare metal cross compiler...as in does it not compile programs with pointers or mallocs......or anything that would require more than the underlying hardware....also how can 1 find these limitations..
I also wanted to ask...I built a cross compiler for target mips..i need to create a mips executable using this cross compiler...but i am not able to find where the executable is...as in there is 1 executable which i found mipsel-linux-cpp which is supposed to compile,assemble and link and then produce a.out but it is not doing so...
However the ./cc1 gives a mips assembly.......
There is an install folder which has a gcc executable which uses i386 assembly and then gives an exe...i dont understand how can the gcc exe give i386 and not mips assembly when i have specified target as mips....
please help im really not able to understand what is happ...
I followed the foll steps..
1. Installed binutils 2.19
2. configured gcc for mips..(g++,core)

I would suggest that you should have started two separate questions.
The GNU toolchain does not have any OS dependencies, but the GNU library does. Most bare-metal cross builds of GCC use the Newlib C library which provides a set of syscall stubs that you must map to your target yourself. These stubs include low-level calls necessary to implement stream I/O and heap management. They can be very simple or very complex depending on your needs. If the only I/O support is to a UART to stdin/stdout/stderr, then it is simple. You don't have to implement everything, but if you do not implement teh I/O stubs, you won't be able to use printf() for example. You must implement the sbrk()/sbrk_r() syscall is you want malloc() to work.
The GNU C++ library will work correctly with Newlib as its underlying library. If you use C++, the C runtime start-up (usually crt0.s) must include the static initialiser loop to invoke the constructors of any static objects that your code may include. The run-time start-up must also of course initialise the processor, clocks, SDRAM controller, timers, MMU etc; that is your responsibility, not the compiler's.
I have no experience of MIPS targets, but the principles are the same for all processors, there is a very useful article called "Building Bare Metal ARM with GNU" which you may find helpful, much of it will be relevant - especially porting the parts regarding implementing Newlib stubs.
Regarding your other question, if your compiler is called mipsel-linux-cpp, then it is not a 'bare-metal' build but rather a Linux build. Also this executable does not really "compile, assemble and link", it is rather a driver that separately calls the pre-processor, compiler, assembler and linker. It has to be configured correctly to invoke the cross-tools rather than the host tools. I generally invoke the linker separately in order to enforce decisions about which standard library to link (-nostdlib), and also because it makes more sense when a application is comprised of multiple execution units. I cannot offer much help other than that here since I have always used GNU-ARM tools built by people with obviously more patience than me, and moreover hosted on Windows, where there is less possibility of the host tool-chain being invoked instead (one reason why I have also avoided those tool-chains that rely on Cygwin)

EDIT
With more time available, I have rewritten my original answer in an attempt to provide something more useful.
I cannot provide a specific answer for your question. I have never tried to get code running on a MIPS machine. What I do have is plenty of experience getting a variety of "bare metal" boards up and running. All kinds of CPUs and all kinds of compilers and cross compilers. So I have an understanding of the principles that apply in all such situations. I will point out the kind of knowledge you will need to absorb before you can hope to succeed with a job like this, and hopefully I can list some links to resources to get you started on learning that knowledge.
I am worried you don't know that pointers are exactly the kind of thing a bare metal compiler can handle, they are a basic machine primitive. This tells me you are probably not an expert embedded developer who is just stuck in this particular scenario. Never mind. There isn't anything magic about programming an embedded system, and you can learn what you need to know.
The first step is getting to understand the relationship between C and the machine you wish to run code on. Basically C is a portable assembly language. This means that C is good for manipulating the basic operations of the machine. In this sense the basic operations of the machine are reading and writing memory locations, performing arithmetic and boolean operations on the data read from memory, and making branching and looping decisions based on that data. In particular the C concept of pointers allows you to manipulate data at locations in memory that you specify.
So far so good, but just doing raw computations in memory is not usually enough - you need a way to input and output data from memory. To do that you need to manipulate the hardware peripherals on your board. If the hardware peripherals are memory mapped then the machine registers used to control the peripherals look exactly like memory locations and C can manipulate them directly. Even in that case though, it is much more likely that doing useful I/O is best handled by extending the C core language with a library of routines provided just for that purpose. These library routines handle all the nasty details (timers, interrupts, non-memory mapped I/O) involved in manipulating the peripheral hardware on the board, and wrap them up with a convenient C function call interface. The idea is that you can go simply printf("hello world"); and the library call take care of the details of displaying the string.
An appropriately skilled developer knows how to adapt an existing I/O library to a new board, or how to develop new library routines to provide access to non-standard custom hardware. The classic way to develop these skills is to start with something simple, usually a LED for an output device, and a switch for an input device. Write a program that pulses a LED in a predictable way, or reads a switch and reflects in on a LED. The first time you get this working will be hugely satisfying.
Okay I have rambled enough. It is time to provide some more resources for you to study. The good news is that there's never been a better time to learn how things work at the interface between hardware and software. There is a wealth of freely available code and docs. Stackoverflow is a great resource as you know. Good luck! Links follow;
Embedded systems overview
Knowing the C language well is fundamental
Why not get your code working on a simulator before you try real hardware
Another emulated environment
Linux device drivers - an overlapping subject
Another book about bare metal programming

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string