Running Pandoc command through subprocess.run() - python-3.x

I am running a Pandoc command through python's subprocess.run() and find the command runs very slowly compared to running it on the terminal. It is instantaneous on the terminal and takes 16 seconds in python.
The larger the files the more time it takes. My test file is only 4K and it takes 2 seconds to process it.
I am sending the command as a list like so:
['pandoc', '--defaults=defaults.yaml', '--bibliography', 'bibliography.bib']
In addition without the --bibliography argument, the command runs faster probably due to less processing.
The other thing that I find puzzling is I tried shell=True just for kicks and the running time that took 16 seconds on my files came down to 2 seconds. I know shell=Trueis a big no-no and what I am looking for is answers or alternatives on why this command is running so slow and what I can do to remedy it.
Thanks in advance.

Related

Running python3 multiprocessing job with slurm makes lots of core.###### files. What are they?

So I have a python3 job that is being run by slurm. The python job uses lots of multiprocessing, generating about 20 or so threads. The code is far from perfect, uses lots of memory, and occasionally reaches some unexpected data and throws an error. That in itself is not a problem, I don't need every one of the 20 process to complete.
The issue is that sometimes something is causing the program to create files named like core.356729, (the number after the dot changes), and these files are massive! Like GB of data. Eventually I end up with so many that I don't have any disk space left and all my jobs are stopped. I can't tell what they are, their contents are not human readable. Google searches for "core files slurm" or "core.number files" are not giving anything relevant.
The quick and dirty solution would be just to add a process that deletes these files as soon as they appear. But I'd rather understand why they are being created first.
Does anyone know what would create a file of the format "core.######"? Is there a name for this type of file? Is there any way to identify which slurm job created the file?
Those are core dump files used for debugging. They're essentially the contents of memory for the process that crashed. You can disable their creation with ulimit -c 0

Linux vs. Windows execution of lsqcurvefit and importdata in MatLab

I recently wrote a program in MatLab which relies heavily on MatLab's 'importdata' function and the 'lsqcurvefit' function from the optimization toolbox. This code takes approximately 15 seconds to execute using MatLab R2011b in Windows. When I transferred the code to a Linux (CentOS) machine, it took approximately 30 minutes.
Using the Profile tool, I determined that the bulk of the additional computation time was spent on the 'importdata' function and the 'lsqcurvefit' function. I cleared all variables in both environments, and I imported identical data files in both environments using 'importdata'. In Linux, this took ~5 seconds, whereas in Windows, this took ~0.1 seconds.
Is there any way to fix this problem? It's absolutely essential that the code operate rapidly in Linux. Both the memory and the processing speed of the Linux machine far exceed that of the Windows machine. After doing some reading, I tried increasing the Java heap memory, but this had no effect.
importdata itself is really only a wrapper which calls various other functions - for example, xlsread - depending on the type of input data. Since it takes multiple types of file as input (you can even load images with it, although why you would is another matter), it needs to work out what the file is, then call the appropriate function.
dlmread, on the other hand, only takes a specific type of file (ASCII-delimited numeric).
In Short:
Never use one-size-fits-all functions like importdata when you can use more specific functions. You can see a list of file formats and specific read/write functions here.
I don't think this is limited to Linux, either. A thousand repeats, loading of a small tab delimited file on a Windows machine:
dlmread : 0.756655 seconds
csvread : 1.332842 seconds
importdata : 69.508252 seconds

Benchmarking two binary file in linux

I have two binary file.This file is not make by me.
I need to benchmarking this files to see how works fast and well.
I try to use the time command but the big problem is :
to run in same time the files
to stop in same time the running files
to use time command with this two files.
If i use this solution Benchmarking programs on Linux and change the place of the file with time running command the output changed.
0m0.010s file1
0m0.017s file2
change the order in time command .
0m0.002s file2
0m0.013s file1
Thank you. Regards.
There are many ways to do what you want, but probably simplest one is to simply run one program in a loop many times (say 1000 or more) such that total execution time becomes something that is easy to measure - say, 50 seconds. Then repeat the same for another one.
This allows you to get much more accurate measurements, and also minimizes inherent jitter between runs.
Having said that, note that with run times as low as you observe, time to start process may be not a small fraction of total measurement you get. So, if you run a loop, be sure to consider price to start new process 1000 times.

Profiling very long running tasks

How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?
There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)

Benchmark a linux Bash script

Is there a way to benchmark a bash script's performance? the script downloads a remote file, and then makes calls to multiple commandline programs to manipulate. I would like to know (or as much as possible):
Total time
Time spent downloading
Time spent on each command called
-=[ I think these could be wrapped in "time" calls right? ]=-
Average download speed
uses wget
Total Memory used
Total CPU usage
CPU usage per command called
I'm able to make edits to the bash script to insert any benchmark commands needed at specific points (ie, between app calls). Not sure if some "top" ninja-ry could solve this or not. Not able to find anything useful (at least to limited understanding) in man file.
Will be running the benchmarks on OSX Terminal as well as Ubuntu (if either matter).
strace -o trace -c -Ttt ./scrip
-c is to trace the time spent by cpu on specific call.
-Ttt will tell you time in microseconds at time of each system call running.
-o will save output in file "trace".
You should be able to achieve this a number of ways. One way is to use time built-in function for each command of interest and capture the results. You may have to be careful about any pipes and redirects;
You may also consider trapping SIGCHLD, DEBUG, RETURN, ERR and EXIT signals and putting timing information in there, but you may not get some results.
This concept of CPU usage of each command won't give you any thing useful, all commands use 100% of cpu. Memory usage is something you can pull out but you should look at
If you want to get deep process statistics then you would want to use strace... See strace(1) man page for details. I doubt that -Ttt as it is suggest elsewhere is useful all that tells you are system call times and you want other process trace info.
You may also want to see ltrace and dstat tools.
A similar question is answered here Linux benchmarking tools

Resources