qsub: Specifying non-consecutive datasets with the -t option - linux

I'm submitting jobs to a Sun Grid Engine using the qsub command. The -t option to qsub enables me to specify the datasets upon which I want to call my script -- e.g.,
$ qsub . . . -t 101-103 my_script.sh
My question is, is it possible to specify non-consecutive datasets with the -t option? For example, say I wanted to run the script on 101 and 103, but not 102. How would I accomplish that?
And, more generally, how would I select arbitrarily numbered datasets?
I would like an answer that works in practice for a large number of datasets -- far beyond the two used in this toy example.

Not sure about that, but quoting from qsub's man page, on paragraph where -t is explained:
. . .
The task id range specified in the option argument may be a single
number, a simple range of the form n-m or a range with a step size.
Hence, the task id range specified by 2-10:2 would result in the
task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks,
. . .
So, maybe:
$ qsub . . . -t 101-103:2 my_script.sh
would do.

If the goal is to run regularly spaced datasets -- e.g., 1, 3, 5, . . . or 10, 15, 20, . . . -- then #chrk's answer is the one to use.
For arbitrarily numbered datasets, using -t is not possible. The same functionality can be attained, however, using the submit command (with the -f option) instead of qsub.
$ submit . . . -s my_script.sh -f my_datasets.txt
The file my_datasets.txt contains one dataset per line, as in
101
103
I'm not sure how specific this solution is to the particular configuration of my computing environment.

Related

How to execute sequential commands un linux

I have what I guess is a pretty basic question about linux. Let us say I have I series of files, which run in couples. Each couple of files is the input of my command, which produces a single output. Then a would like to execute the same command with the following couple, producing a new output.
Let us say, for the example, that I have four files: F1.R1, F1.R2, F2.R1, and F2.R2
My first command would be:
myfunction F1.R1 F1.R2 -output F1
And the second:
myfunction F2.R1 F2.R2 -output F2
I would like to produce a command so that all couples are treated sequentially until all files are processed.
Thanks a lot for your help.
Best
Loop over the .R1 files, replace R1 with R2 to get the other filename from the pair, then execute the command.
for file in *.R1
do
file2=${file/.R1/.R2}
output=${file%.R1}
myfunction "$file" "$file2" --output "$output"
done

Use more than one core in bash

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
I'm explaining: FILES is a list of files from the raw folder.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
Here readme tools:
https://github.com/vsbuffalo/scythe
Does anybody know how you can handle it?
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.
IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
Useful additional parameters are:
parallel --bar ... gives a progress bar
parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...

Execute the output of previous command line

I need to execute the result of a previous command, but I don't know how I can process.
I have a first command that returns an instruction to log in to the server and then I want to execute it just after.
my-first-command returns: docker login ...
For example:
> my-first-comnand | execute the result of my-first-command
This should do it I believe.
my-first-command | bash
I use $(!!) for this. As Charles points out, this may not be what everyone wants to do, but it works for me and suits my purpose better than the other answer.
$ find ./ -type f -name "some.sh"
$ $(!!)
!! is a variable that holds the last command, and putting into $( ) makes it get executed.
This is also useful for taking other actions on the output, since $( ) is treated as a variable.
Most handy way is to use backticks `your_command` to execute your sub-command inline and immediately use output in your main command.
Example:
`find ~/Library/Android/sdk/build-tools/* -d 0 | tail -1`/zipalign -f 4 ./app-release-unsigned.apk ./app-release.apk
In this example I firstly find the correct directory from where I will execute zipalign. There could be several directories as in my case (find returns two directories) so I getting last one using tail. And then I'm executing zipalign directly using previous result as path to correct zipalign binary.

creating multiple copies of a file in bash with a script

I am starting to learn how to use bash shell commands and scripting in Linux.
I want to create a script that will take a source file, and create a chosen number of named copies.
for example, I have the source as testFile, and I choose 15 copies, so it creates testFile1, 2, 3 ... 14, 15 in the same location.
To try and achieve this I have tried to make the following command:
for LABEL in {$X..$Y}; do cp $INPUT $INPUT$LABEL; done
However, instead of creating files from X to Y, it makes just one file with (for example) {1..5} appended instead of files 1, 2, 3, 4 and 5
How can I change it so it properly uses the variable as a number for the loop?
The brace expansion mechanism is a bit limited; it doesn't work with variables, only literals.
For what you want, you probably have the seq command, and could write:
INPUT=testFile
for num in $(seq 1 15)
do
cp "$INPUT" "$INPUT$num"
done
Using a C-style for loop :
$ x=0 y=15
$ for ((i=x; i<=y; i++)); do cp "$INPUT" "$INPUT$i"; done

grep but indexable?

I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Is there a command line utility similar to grep which has indexing ability?
The solutions below are rather simple. There are a lot of corner cases that they do not cover:
searching for start of line ^
filenames containing \n or : will fail
filenames containing white space will fail (though that can be fixed by using GNU Parallel instead of xargs)
searching for a string that matches the path of another files will be suboptimal
The good part about the solutions is that they are very easy to implement.
Solution 1: one big file
Fact: Seeking is dead slow, reading one big file is often faster.
Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . > .index
Use the index:
grep foo .index
Solution 2: one big compressed file
Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.
So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo
Solution 3: use index for finding potential candidates
Generating the index can be time consuming and you might not want to do that for every single change in the dir.
To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.
The sort -u is needed to avoid grepping the same file multiple times.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
Solution 4: append to the index
Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Append to the index:
find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
It can be even faster if you use pzstd instead of pbzip2/pbzcat.
Solution 5: use git
git grep can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.
The good part is that the .git index is smaller than the .index.bz2.
Index a dir:
git init
git add .
Append to the index:
git add .
Use the index:
git grep foo
Solution 6: optimize git
Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:
git gc --aggressive
This takes a while, but it packs the index very efficiently in few files.
Now you can do:
find .git -type f | xargs cat >/dev/null
git grep foo
git will do a lot of seeking into the index, but by running cat first, you put the whole index into RAM.
Adding to the index is the same as in solution 5, but run git gc now and then to avoid many small files, and git gc --aggressive to save more disk space, when the system is idle.
git will not free disk space if you remove files. So if you remove large amounts of data, remove .git and do git init; git add . again.
There is https://code.google.com/p/codesearch/ project which is capable of creating index and fast searching in the index. Regexps are supported and computed using index (actually, only subset of regexp can use index to filter file set, and then real regexp is reevaluted on the matched files).
Index from codesearch is usually 10-20% of source code size, building an index is fast like running classic grep 2 or 3 times, and the searching is almost instantaneous.
The ideas used in the codesearch project are from google's Code Search site (RIP). E.g. the index contains map from n-grams (3-grams or every 3-byte set found in your sources) to the files; and regexp is translated to 4-grams when searching.
PS And there are ctags and cscope to navigate in C/C++ sources. Ctags can find declarations/definitions, cscope is more capable, but has problems with C++.
PPS and there are also clang-based tools for C/C++/ObjC languages: http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html and clang-complete
I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Without addressing the indexing ability part, git grep will have, with Git 2.8 (Q1 2016) the abililty to run in parallel!
See commit 89f09dd, commit 044b1f3, commit b6b468b (15 Dec 2015) by Victor Leschuk (vleschuk).
(Merged by Junio C Hamano -- gitster -- in commit bdd1cc2, 12 Jan 2016)
grep: add --threads=<num> option and grep.threads configuration
"git grep" can now be configured (or told from the command line) how
many threads to use when searching in the working tree files.
grep.threads:
Number of grep worker threads to use.
ack is a code searching tool that is optimized for programmers, especially programmers dealing with large heterogeneous source code trees: http://beyondgrep.com/
Is some of your search examples where you only want to search a certain type of file, like only Java files? Then you can do
ack --java function
ack does not index the source code, but it may not matter depending on what your searching patterns are like. In many cases, only searching for certain types of files gives the speedup that you need because you're not also searching all those other XML, etc files.
And if ack doesn't do it for you, here is a list of many tools designed for searching source code: http://beyondgrep.com/more-tools/
We use a tool internally to index very large log files and make efficient searches of them. It has been open-sourced. I don't know how well it scales to large numbers of files, though. It multithreads by default, it searches inside gzipped files, and it caches indexes of previously searched files.
https://github.com/purestorage/4grep
This grep-cache article has a script for caching grep results. His examples were run on windows with linux tools installed, so it can easily be used on nix/mac with little modification. It's mostly just a perl script anyway.
Also, the filesystem itself (assuming your using *nix) often caches recently read data, causing future grep times to be faster since grep is effectively searching virt memory instead of disk.
The cache is usually located in /proc/sys/vm/drop_caches if you want manually erase it to see the speed increase from an uncached to a cached grep.
Since you mention various kinds of text files that are not really code, I suggest you have a look at GNU ID utils. For example:
cd /tmp
# create index file named 'ID'
mkid -m /dev/null -d text /var/log/messages.*
# query index
gid -r 'spamd|kernel'
These tools focus on tokens, so queries on strings of tokens are not possible. There is minimal integration in emacs for the gid command.
For the more specific case of indexing source code, I prefer to use GNU global, which I find more flexible. For example:
cd sourcedir
# index source tree
gtags .
# look for a definition
global -x main
# look for a reference
global -xr printf
# look for another kind of symbol
global -xs argc
Global natively supports C/C++ and Java, and with a bit of configuration, can be extended to support many more languages. It also has very good integration with emacs: successive queries are stacked, and updating a source file updates the index efficiently. However I'm not aware that it is able to index plain text (yet).

Resources