I recently wrote a program in MatLab which relies heavily on MatLab's 'importdata' function and the 'lsqcurvefit' function from the optimization toolbox. This code takes approximately 15 seconds to execute using MatLab R2011b in Windows. When I transferred the code to a Linux (CentOS) machine, it took approximately 30 minutes.
Using the Profile tool, I determined that the bulk of the additional computation time was spent on the 'importdata' function and the 'lsqcurvefit' function. I cleared all variables in both environments, and I imported identical data files in both environments using 'importdata'. In Linux, this took ~5 seconds, whereas in Windows, this took ~0.1 seconds.
Is there any way to fix this problem? It's absolutely essential that the code operate rapidly in Linux. Both the memory and the processing speed of the Linux machine far exceed that of the Windows machine. After doing some reading, I tried increasing the Java heap memory, but this had no effect.
importdata itself is really only a wrapper which calls various other functions - for example, xlsread - depending on the type of input data. Since it takes multiple types of file as input (you can even load images with it, although why you would is another matter), it needs to work out what the file is, then call the appropriate function.
dlmread, on the other hand, only takes a specific type of file (ASCII-delimited numeric).
In Short:
Never use one-size-fits-all functions like importdata when you can use more specific functions. You can see a list of file formats and specific read/write functions here.
I don't think this is limited to Linux, either. A thousand repeats, loading of a small tab delimited file on a Windows machine:
dlmread : 0.756655 seconds
csvread : 1.332842 seconds
importdata : 69.508252 seconds
Related
I have a class that does some parsing of two large (~90K rows, 11 columns in the first and around ~20K, 5 columns in the second) CSV files. According to the specification I'm working with the CSV files can be externally changed (removing/adding of new rows; columns remain constant as well as the paths). Such updates can happen at any time (though highly unlikely that an update will be launched in time intervals shorter than a couple of minutes) and an update of any of the two files has to terminate the current processing of all that data (CSV, XML from an HTTP GET request, UDP telegrams), followed by re-parsing the content of each of the two (or just one if only one has changed).
I keep the CSV data (quite reduced since I apply multiple filters to remove unwanted entries) in memory to speed working with it and also to avoid unnecessary IO operations (opening, reading, closing file).
Right now I'm looking into the QFileSystemWatcher, which seems to be exactly what I need. However I'm unable to find any information on how it actually works internally.
Since all I need is to monitor 2 files for changes the number of files shouldn't be an issue. Do I need to run it in a separate thread (since the watcher is part of the same class where the CSV parsing happens) or is it safe to say that it can run without too much fuss (that is it works asynchronously like the QNetworkAccessManager)? My dev environment for now is a 64bit Ubuntu VM (VirtualBox) on a relatively powerful host (a HP Z240 workstation) however the target system is an embedded one. While the whole parsing of the CSV files takes just 2-3 seconds at the most I don't know how much performance impact there will be once the application gets deployed so additional overhead is something of a concern of mine.
I am considering the speed/memory cost of scaling a large number of small bash scripts. The permutations of scripts calling each other will be quite large, <=5000 scripts.
Is there a benchmark or way of profiling best performance for using a particular method to call each script, say
searching $PATH (could be many deep directories needing to be added to $PATH
using MYFILE=/full/directory/path/file.sh in a number of config files
Using a large number of alias commands to store the more heavily used scripts
The target system will be a scaling cluster of m8.xl type ec2 instances that are general compute optimized with standard ssd i/o.
The reason for using this method is the development of the scripts will be produced by several teams and the only common document for reference is a basic db with function=myfile -p param, return=x, etc. so if a developer needs a function or script they would call the script name and one of the methods above would point to the actual script.
UPDATE
To clarify, the speed of launching the script is not that important if it is only marginally faster to use a method. The concern is the weight of script names being housed in memory having a detrimental effect on the overall performance.
Several hundred or thousand scripts may be spawned in a burst over a few seconds, so the peak load is a concern factor.
Thx
Art
I am evaluating the tools that profile my python program. One of the interesting tools here is memory_profiler. Before moving forward, just want to know whethermemory_profiler affects runtime. The reason I am asking this question is that memory_profiler will output a lot of memory usages. So I am suspecting it might affect runtime.
Thanks
Derek
It depends how you are using memory_profiler. This can be used in two different ways:
To get memory usage line-by-line (run with python -m memory_profiler my_script.py). This needs to get memory information (from the OS) for every line executed within the profiled function. How this affects run-time depends on the amount of lines in the function: if it has a lot of lines with fast execution times, it might suppose a significant overhead. On the other hand, if the function to profile has few lines, and each lines has a significant computing time, then the overhead will be negligible.
To get memory as a function of time (run with mprof run my_script.py and plot with mprof plot). In this case the function that collects the memory usage is in a different process as the one that runs your script, hence the overhead is minimal (unless you are using all CPUs).
How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?
There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)
Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.