Is tesseract 3.00 multi-threaded? - multithreading

I read some other posts suggesting that they would add multi-threading support in 3.00. But I'm not sure if it's added in 3.00 when it was released.
Other than multi-threading, is running multiple processes of tesseract a feasible option to achieve concurrency?
Thanks.

One thing I've done is invoked GNU Parallel to run as many instances of Tess* as able on a multi-core system for multi-page documents converted to single page images.
It's a short program, easily compiled on most Linux distros (I'm using OpenSuSE 11.4).
Here's the command line that I use:
/usr/local/bin/parallel -j 4 \
/usr/local/bin/tesseract -psm 1 -l eng {} {.} \
::: /tmp/tmp/*.jpg
The -j 4 tells parallel to use all four CPU cores that I have on a server.
If you run this, and in another terminal do a 'top,' you'll see up to four processes at one time until it rummages through all of the JPG's in the directory specified.
Your load should never exceed the number of CPU cores in your system (if you run Linux).
Here's the link to GNU Parallel:
http://www.gnu.org/software/parallel/

No. You can browse the code in http://code.google.com/p/tesseract-ocr/source/browse/ None of the current code in trunk seems to make use of multi-threading. (at least looking through the base classes, api, and neural networking classes)

I did use parallel as well, on a Centos, this way:
ls | parallel --gnu "tesseract {} {.}"
I used the --gnu option as suggested from the stdout log which was:
parallel: Warning: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
the {} and {.} are placeholders for parallel: in this case you're telling tesseract to use the file listed as first argument, and the same file name without extension as second argument - everything is well explained in parallel man pages.
Now, if you have - say - three .tif files and you run tesseract three times, one for each file, summing up the execution time, and then you run the command above with time before parallel, you can easily check the speedup.

Related

What exactly happens when you download multiple files from firefox?

I'm trying to understand what happens when you download multiple files from Firefox at the same time. Is it using multithreading or multiprocessing? Assuming it's using multithreading, how many threads is it using? How can I figure this out?
One way to figure it out is to look at the source.
Also, a script like this might give you a good starting point:
while ps x | grep -i firefox | wc; do sleep 1; done
which prints something like this:
16 392 6607
16 392 6607
16 392 6607
...
then start downloading a file that takes more than a few seconds and see how many new processes it starts.
Depending upon your OS, you might want to consult the man page for ps to figure out how to report threads as well as processes. You might need a little more smarts than grep.
If you need to dive deeper, but don't want to go into the source, most OSes have some mechanism to view a trace of the system calls a program makes. This could be truss, strace, dtrace, ...
There is also a possibility that it uses no threads or processes to download files; it merely relies on select (man 2 select).
Happy hunting.

Why are some Bash commands both built-in and external?

Some commands are internal built-in Bash commands while others are external (other programs). I see why certain commands need to be built-in. Some of the reasons are:
If a command needs to change the internal state of the shell process.
If a command performs a very basic operation in the shell.
If a command is called often and needs to be made fast. An external command is executed by loading an external program and hence is slower.
But why are some commands both built-in and external, for example echo and test? I understand echo is used a lot and thus is built-in (Reason 3). But then why also have it as an external command and have a binary for it in /bin/echo? The built-in version of echo will always take precedence over the external version and thus, the external version is hardly ever used. So, why then have an external version of it at all?
It's exactly your point 3. When a command does very little (echo is a good example), spawning a new process dominates the run time behavior. With growing disks and bandwidth and code bases you always reach a spot when you have so much data and so many files (our code base at work has 100k files!!) that one less spawn per file makes a difference of minutes.
That's also why the typical built-in is a drop-in replacement which takes (perhaps a superset of) the same arguments as the binary.
You also ask why the old binary is still retained even though Bash has it as a built-in — the answer is that a lot of programs rely on the existence of that /bin/echo. It's actually standardized.
Bash is only one of many user interfaces and offline command interpreters. They all have different sets of built-ins. Some shells are purposefully small and rely a lot on what you could call "legacy" binaries. One example is ash and its successor, Dash. Dash is now the default /bin/sh in Ubuntu and Debian due to its speed, and is popular for embedded systems due to its small size. (But even Dash has builtins for echo, test and dozens of other commands, and provides a command history for interactive use.)

Shell script, parallel and multithreaded?

I have a shell script which actually has file/folder locations and command to run a javascript. I am running the job on a linux cluster with 8 processors and 4 CPU cores per processor. I want to run this on multiple processors, at the same time each job accessing multiple threads to reduce the total runtime of the script. My question is:
Is such a thing possible? If yes what would be the command or code snippet for this?
You may be able to use GNU parallel. Example:
parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
"This will recompress all files in the current directory with names ending in .gz using bzip2, running one job per CPU (-j+0) in parallel."
If you don't have parallel, then you can roll your own coordinator in a couple of ways:
Use make -j
Use bash in-built backgrounding (&)
You can use ssh for running commands on remote machines, and you can use nfs or its ilk for networked storage. However you do it, you will need to think about how to partition your jobs so that they can cooperate and, as needed, coalesce results.

How to set core dump naming scheme without su/sudo?

I am developing a MPI program on a Linux machine where I do not have sudo/su access. As my program currently segfaults, I would like to examine the core dumps via gdb. Unfortunately, as the program is multi-threaded, all the threads write to one core dump. So I would like to be able to append the PID to each separate core dump for every process.
I know there is a way to do it via /proc/sys/kernel/core_pattern, however I do not have access to write to this.
Thanks for any help.
It can be a pain to debug MPI apps on systems that are configured this way when you do not have root access. One option for working around this is to use Valgrind to get stack traces for your segfault(s). This will only be useful provided that your application will fail in a reasonable period of time when slowed down via Valgrind, and that it still segfaults at all in this case.
I usually run MPI apps under Valgrind like this:
% mpiexec -n 5 valgrind -q /path/to/my_app
That will send all of the Valgrind output to standard error. But if I want the output separated into different files, then you can get a bit fancier:
% mpiexec -n 5 valgrind -q --log-file='vg_out.%q{PMI_RANK}' /path/to/my_app
That's the setup for MPICH2. I think that for Open MPI you'll need to replace PMI_RANK with OMPI_MCA_ns_nds_vpid, but if that doesn't work for you then you'll need to check with the Open MPI developers on their discussion list. In either case, this will yield N files, where N is the size of MPI_COMM_WORLD, each named vg_out.0, vg_out.1, ..., to vg_out.$(($N-1)), each corresponding to a rank in MPI_COMM_WORLD.

How to set CPU load on a Red Hat Linux box?

I have a RHEL box that I need to put under a moderate and variable amount of CPU load (50%-75%).
What is the best way to go about this? Is there a program that can do this that I am not aware of? I am happy to write some C code to make this happen, I just don't know what system calls will help.
This is exactly what you need (internet archive link):
https://web.archive.org/web/20120512025754/http://weather.ou.edu/~apw/projects/stress/stress-1.0.4.tar.gz
From the homepage:
"stress is a simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPL."
Find a simple prime number search program that has source code. Modify the source code to add a nanosleep call to the main loop with whichever delay gives you the desired CPU load.
One common way to get some load on a system is to compile a large software package over and over again. Something like the Linux kernel.
Get a copy of the source code, extract the tar.bz2, go into the top level source directory, copy your kernel config from /boot to .config or zcat /proc/config.gz > .config, the do make oldconfig, then while true; do make clean && make bzImage; done
If you have an SMP system, then make -j bzImage is fun, it will spawn make tasks in parallel.
One problem with this is adjusting the CPU load. It will be a maximum CPU load except for when waiting on disk I/O.
You could possibly do this using a Bash script. Use " ps -o pcpu | grep -v CPU" to get the CPU Usage of all the processes. Add all those values together to get the current usage. Then have a busy while loop that basically keeps on checking those values, figuring out the current CPU usage, and waiting a calculated amount of time to keep the processor at a certain threshhold. More detail is need, but hopefully this will give you a good starting point.
Take a look at this CPU Monitor script I found and try to get some other ideas on how you can accomplish this.
It really depends what you're trying to test. If you're just testing CPU load, simple scripts to eat empty CPU cycles will work fine. I personally had to test the performance of a RAID array recently and I relied on Bonnie++ and IOZone. IOZone will put a decent load on the box, particularly if you set the file size higher than the RAM.
You may also be interested in this Article.
Lookbusy enables set value of CPU load.
Project site
lookbusy -c util[-high_util], --cpu-util util[-high_util]
i.e. 60% load
lookbusy -c 60
Use the "nice" command.
a) Highest priority:
$ nice -n -20 my_command
or
b) Lowest priority:
$ nice -n 20 my_command
A Simple script to load & hammer the CPU using awk. The script does mathematical calculations and thus CPU load peaks up on higher values passwd to loadserver.sh .
checkout the script # http://unixfoo.blogspot.com/2008/11/linux-cpu-hammer-script.html
You can probably use some load-generating tool to accomplish this, or run a script to take all the CPU cycles and then use nice and renice on the process to vary the percentage of cycles that the process gets.
Here is a sample bash script that will occupy all the free CPU cycles:
#!/bin/bash
while true ; do
true
done
Not sure what your goal is here. I believe glxgears will use 100% CPU.
So find any process that you know will max out the CPU to 100%.
If you have four CPU cores(0 1 2 3), you could use "taskset" to bind this process to say CPUs 0 and 1. That should load your box 50%. To load it 75% bind the process to 0 1 2 CPUs.
Disclaimer: Haven't tested this. Please let us know your results. Even if this works, I'm not sure what you will achieve out of this?

Resources