Use several thread when rendering pdf to image using mupdf

Use several thread when rendering pdf to image using mupdf - multithreading

Is it possible to run mutool.exe draw using several threads to increase PDF to Image conversion speed?
The command help list says something about -B and -T parameters, but I do not understand what maximum band_height does? What values should I set for -B?
-B - maximum band_height (pXm, pcl, pclm, ocr.pdf, ps, psd and png output only)
-T - number of threads to use for rendering (banded mode only)
Executing mutool with -B 100 -T 6 slightly increased conversion speed by 10% but not so much, the CPU usage spiked from 6% to 11%, but why not 60%?
mutool.exe draw -r 300 -B 100 -T 6 -o "C:\test%d.png" "C:\test-large.pdf"

Every system and PDF is different but lets use a single page without text for timings in my system.
I know this file is complex but not too unusual since without text, other objects behave as text would, without the complexity of font look-up etc. so rendering time is generally fairly similar for a given run.
Lets start with a low resolution since I know the file well enough to have found it fail due to Malloc in this machine around 300dpi.
mutool draw -Dst -r 50 -o complex.png complex.pdf
page complex.pdf 1 1691ms
total 1691ms (0ms layout) / 1 pages for an average of 1691ms
mutool draw -Dst -r 100 -o complex.png complex.pdf
page complex.pdf 1 3299ms
total 3299ms (0ms layout) / 1 pages for an average of 3299ms
mutool draw -Dst -r 200 -o complex.png complex.pdf
page complex.pdf 1 7959ms
total 7959ms (0ms layout) / 1 pages for an average of 7959ms
mutool draw -Dst -r 400 -o complex.png complex.pdf
page complex.pdf 1error: malloc of 2220451350 bytes failed
error: cannot draw 'complex.pdf'
So this is when "Banding" is required to avoid memory issues since my target is 400 dpi output.
You may see I used -D above so I need to remove that for threads, cannot use multiple threads without using display list. Lets start small since too large bands or too many threads can also malloc error.
mutool draw -st -B 32 -T 2 -r 400 -o complex.png complex.pdf
page complex.pdf 1 14111ms
total 14111ms (0ms layout) / 1 pages for an average of 14111ms
14 seconds for this file is not a bad result based on the progressive timings above, but perhaps on this 8 thread device I could do better? Lets try bigger bands and more threads.
mutool draw -st -B 32 -T 3 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12726ms
total 12726ms (0ms layout) / 1 pages for an average of 12726ms
mutool draw -st -B 256 -T 3 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12234ms
total 12234ms (0ms layout) / 1 pages for an average of 12234ms
mutool draw -st -B 256 -T 6 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12258ms
total 12258ms (0ms layout) / 1 pages for an average of 12258ms
So increasing threads up to 3 helps and upping the Band size helps, but 6 threads is no better. So is there another tweak we can consider, and playing around with many runs the best I got on this kit/configuration was 12 seconds.
mutool draw -Pst -B 128 -T 4 -r 400 -o complex.png complex.pdf
page complex.pdf 1 1111ms (interpretation) 10968ms (rendering) 12079ms (total)

Related

tcpdump capture limit size with latest capture

tcpdump -W 5 -C 10 -w capfile
I know what this command does, which is rotating buffer of 5 files (-W 5) and tcpdump switches to another file once the current file reaches 10,000,000 bytes, about 10MB (-C works in units of 1,000,000 bytes, so -C 10 = 10,000,000 bytes). The prefix of the files will be capfile (-w capfile), and a one-digit integer will be appended to each: how to save a new file when tcpdum file size reaches 10Mb
My question is what happens if I set -W to 1:
tcpdump -W 1 -C 10 -w capfile
Is this gonna only have 1 file with max size 10 MB contains the latest capture?

Why using pipe for sort (linux command) is slow?

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried
Method 1
awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted
Method 2
awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted
Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?
EDIT
awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to #Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

I think sort does not use threads when read from pipe.
I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4 > bf.txt"
I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.
Below are results for the first and second scenarios:
$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4 > bf.txt"
Command being timed: "bash -c seq 1 10000000 | sort --parallel 4 > bf.txt"
User time (seconds): 35.85
System time (seconds): 0.84
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2899
Voluntary context switches: 1920
Involuntary context switches: 1323
Swaps: 0
File system inputs: 0
File system outputs: 459136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
User time (seconds): 43.03
System time (seconds): 0.85
Percent of CPU this job got: 175%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1018004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2445
Voluntary context switches: 299
Involuntary context switches: 4387
Swaps: 0
File system inputs: 0
File system outputs: 308160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

You have more system calls, if you use the pipe.
seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059
Without the pipe the file is mapped into memory.
seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33
Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

Optimize PDF files (with Ghostscript or other)

Is Ghostscript the best option if you want to optimize a PDF file and reduce the file size?
I need to store alot of PDF files and therefore I need to optimize and reduce the file size as much as possible
Does anyone have any experience with Ghostscript and/or other?
command line
exec('gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/screen -sOutputFile='.$file_new.' '.$file);

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.
Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):
How to downsample images within pdf file?
Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \
If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:
gs \
-o output.pdf \
[...other options...] \
-dEmbedAllFonts=false \
-dSubsetFonts=true \
-dConvertCMYKImagesToRGB=true \
-dCompressFonts=true \
-c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" \
-c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" \
-f input.pdf
Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....
Update
One option which I had overlooked in my original answer is to add
-dDetectDuplicateImages=true
to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).
It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:
pdfinfo p1.pdf
Producer: libtiff / tiff2pdf - 20120922
CreationDate: Tue Jan 6 19:36:34 2015
ModDate: Tue Jan 6 19:36:34 2015
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 595 x 842 pts (A4)
Page rot: 0
File size: 20983 bytes
Optimized: no
PDF version: 1.1
Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:
pdfimages -list p1.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 7 0 52 52 19.2K 2.6%
This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.
If we concatenate 3 copies of this file with the help of pdftk like so:
pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf
then the result shows these image properties via pdfimages -list:
pdfimages -list p3.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 4 0 52 52 19.2K 2.6%
2 1 image 423 600 rgb 3 8 jpeg no 8 0 52 52 19.2K 2.6%
3 2 image 423 600 rgb 3 8 jpeg no 12 0 52 52 19.2K 2.6%
This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:
pdfinfo p3.pdf | grep Pages:
Pages: 3
Optimize PDF by replacing duplicate images with references
Now we can apply the above mentioned optimization with the help of Ghostscript
gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf
Checking:
pdfimages -list p3-optim.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
2 1 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
3 2 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
There is still one image listed per page -- but the PDF object ID is always the same now: 10.
ls -ltrh p1.pdf p3.pdf p3-optim.pdf
-rw-r--r--# 1 kp staff 20K Jan 6 19:36 p1.pdf
-rw-r--r-- 1 kp staff 60K Jan 6 19:37 p3.pdf
-rw-r--r-- 1 kp staff 16K Jan 6 19:40 p3-optim.pdf
As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.
The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

You can obtain good results by converting from PDF to Postscript, then back to PDF using
pdf2ps file.pdf file.ps
ps2pdf -dPDFSETTINGS=/ebook file.ps file-optimized.pdf
The value of argument -dPDFSETTINGS defines the quality of the images in the resulting PDF. Options are, from low to high quality: /screen, /default, /ebook, /printer, /prepress, see http://milan.kupcevic.net/ghostscript-ps-pdf/ for a reference.
The Postscript file can become quite large, but the results are worth it. I went from a 60 MB PDF to a 140 MB Postscript file, but ended up with a 1.1 MB optimized PDF.

I use Ghostscript with following options taken from here.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

You may find that pdftocairo (from Poppler) can make smaller PDFs but beware that it will strip some features (such as hyperlinks) away.

This worked for me
Convert your PDF to PS (this creates a large file
pdf2ps large.pdf very_large.ps
Convert the new PS back to a PDF
ps2pdf very_large.ps small.pdf
Source:
https://pandemoniumillusion.wordpress.com/2008/05/07/compress-a-pdf-with-pdftk/

You will lose in quality but if it's not an issue then ImageMagick's convert may proves helpful :
convert original.pdf reduced.pdf
Note that it doesn't always work : I once converted a 126 MB file into a 14 MB one using this command, but another time it doubled the size of a 350 Ko file.
Anyway it's worth giving it a try…
As mentioned in comments, of course there is no point in applying this command on a vector-based PDF, it will only be useful on rasterized images.
See also this post for related options.

Ghostscript comes with ps2pdf14 utility which can be used to optimise PDF file(s) but on some occasions size of "optimised" file may be bigger than original.

For the PDF which size is mainly due to embedded images (pdfimages -list is your friend), typically scanned documents, I would recommend the use of ocrmypdf which is quite good at optimizing, with optional OCR layer as a bonus.

How to measure CPU usage

I would like to log CPU usage at a frequency of 1 second.
One possible way to do it is via vmstat 1 command.
The problem is that the time between each output is not always exactly one second, especially on a busy server. I would like to be able to output the timestamp along with the CPU usage every second. What would be a simple way to accomplish this, without installing special tools?

There are many ways to do that. Except top another way is to you the "sar" utility. So something like
sar -u 1 10
will give you the cpu utilization for 10 times every 1 second. At the end it will print averages for each one of the sys, user, iowait, idle
Another utility is the "mpstat", that gives you similar things with sar

Use the well-known UNIX tool top that is normally available on Linux systems:
top -b -d 1 > /tmp/top.log
The first line of each output block from top contains a timestamp.
I see no command line option to limit the number of rows that top displays.
Section 5a. SYSTEM Configuration File and 5b. PERSONAL Configuration File of the top man page describes pressing W when running top in interactive mode to create a $HOME/.toprc configuration file.
I did this, then edited my .toprc file and changed all maxtasks values so that they are maxtasks=4. Then top only displays 4 rows of output.
For completeness, the alternative way to do this using pipes is:
top -b -d 1 | awk '/load average/ {n=10} {if (n-- > 0) {print}}' > /tmp/top.log

You might want to try htop and atop. htop is beautifully interactive while atop gathers information and can report CPU usage even for terminated processes.

I found a neat way to get the timestamp information to be displayed along with the output of vmstat.
Sample command:
vmstat -n 1 3 | while read line; do echo "$(date --iso-8601=seconds) $line"; done
Output:
2013-09-13T14:01:31-0700 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
2013-09-13T14:01:31-0700 r b swpd free buff cache si so bi bo in cs us sy id wa
2013-09-13T14:01:31-0700 1 1 4197640 29952 124584 12477708 12 5 449 147 2 0 7 4 82 7
2013-09-13T14:01:32-0700 3 0 4197780 28232 124504 12480324 392 180 15984 180 1792 1301 31 15 38 16
2013-09-13T14:01:33-0700 0 1 4197656 30464 124504 12477492 344 0 2008 0 1892 1929 32 14 43 10

To monitor the disk usage, cpu and load i created a small bash scripts that writes the values to a log file every 10 seconds.
This logfile is processed by logstash kibana and riemann.
# #!/usr/bin/env bash
# Define a timestamp function
LOGPATH="/var/log/systemstatus.log"
timestamp() {
date +"%Y-%m-%dT%T.%N"
}
#server load
while ( sleep 10 ) ; do
echo -n "$(timestamp) linux::systemstatus::load " >> $LOGPATH
cat /proc/loadavg >> $LOGPATH
#cpu usage
echo -n "$(timestamp) linux::systemstatus::cpu " >> $LOGPATH
top -bn 1 | sed -n 3p >> $LOGPAT
#disk usage
echo -n "$(timestamp) linux::systemstatus::storage " >> $LOGPATH
df --total|grep total|sed "s/total//g"| sed 's/^ *//' >> $LOGPATH
done

Linux space check

Collectively check space of files in linux...
I have nearly more than 100 of files ...to check the size collectively...
Edit: What I need is: I have a folder containing 1000 files and I need something so that I can calculate the total sum [of what?] of 100 files only which I need...not all 1000 files.....

This command will give you the size in kilobytes of all the individual files/directories in the current directory:
du -ks *
This command will give you the combined total size of the current directory:
du -ks .
If you need to recurse and get more detailed info, the find command might help.

If you want the total size of all files in the current directory (In "human readable format")
du -sh

This is a bit vague ... Assuming all you want is to get the total size of a bunch of files, there's any number of solutions.
If the files are all in the same directory, one very easy way is to just use
ls -lh | head -1
This prints a single line showing the "total" number, with a friendly "human-readable" (that's the -h option to ls) unit even.
Note that this does not work with wildcards, since then ls suppresses its "total"-line.

I'm no linux guru, but there should be some switch of the ls command that shows size.
If that fails, look into using du.

Using gdu:
aaa:vim70> gdu
5028 ./doc
4420 ./syntax
.
.
.
176 ./compiler
16 ./macros/hanoi
16 ./macros/life
48 ./macros/maze
20 ./macros/urm
200 ./macros
252 ./keymap
18000 .
You can use --max-depth to limit the depth of the search:
aaa:vim70> gdu --max-depth=1
5028 ./doc
136 ./print
76 ./colors
4420 ./syntax
420 ./indent
628 ./ftplugin
1260 ./autoload
64 ./plugin
800 ./tutor
3348 ./spell
176 ./compiler
200 ./macros
112 ./tools
844 ./lang
252 ./keymap
18000 .
Notice that the subdirectories of macros don't appear.
or even:
aaa:vim70> gdu --max-depth=0
18000 .
The default unit is kilobytes. You can use -h to get it in human readable form:
aaa:vim70> gdu --max-depth=1 -h
5.0M ./doc
136k ./print
76k ./colors
4.4M ./syntax
420k ./indent
628k ./ftplugin
1.3M ./autoload
64k ./plugin
800k ./tutor
3.3M ./spell
176k ./compiler
200k ./macros
112k ./tools
844k ./lang
252k ./keymap
18M .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Use several thread when rendering pdf to image using mupdf - multithreading

Related

tcpdump capture limit size with latest capture

Why using pipe for sort (linux command) is slow?

Optimize PDF files (with Ghostscript or other)

How to measure CPU usage

Linux space check

Categories

Resources