How to pass a list of files to parallel command and execute downstream command such as samtools? - linux

I have a list of files which I want to sort and index ,i listed all those files in a text file.
/run/media/punit/data1/GSE74246/tophat_output/CMP_SRR2753096/CMP_6792.bam
run/media/punit/data1/GSE74246/tophat_output/CMP_SRR2753104/CMP_7256.bam
The above one is just a list of my data which i want to sort and index.
Now i want to use this command
ls *.bam | parallel "samtools view -b -S {} | samtools sort - {.}; samtools index {.}.bam"
Meanwhile I have also files with
.bam
extension such as unmapped.bam which i dont want to sort and index
How can i exclude those "unmapped.bam" but since i dont have those unmapped.bam in my list but still i wonder if i use parallel then would it take those sort and index...

ls *.bam | grep -v unmapped | parallel ...

Related

grep search for pipe term Argument list too long

I have something like
grep ... | grep -f - *orders*
where the first grep ... gives a list of order numbers like
1393
3435
5656
4566
7887
6656
and I want to find those orders in multiple files (a_orders_1, b_orders_3 etc.), these files look something like
1001|strawberry|sam
1002|banana|john
...
However, when the first grep... returns too many order numbers I get the error "Argument list too long".
I also tried to give the grep command one order number at a time using a while loop but that's just way too slow. I did
grep ... | while read order; do grep $order *orders*; done
I'm very new to Unix clearly, explanations would be greatly appreciated, thanks!
The problem is the expansion of *orders* in grep ... | grep -f - *orders*. Your shell expands the pattern to the full list of files before passing that list to grep.
So we need to pass fewer "orders" files to each grep invocation. The find program is one way to do that, because it accepts wildcards and expands them internally:
find . -name '*orders*' # note this searches subdirectories too
Now that you know how to generate the list of filenames without running into the command line length limit, you can tell find to execute your second grep:
grep ... | find . -name '*orders*' -exec grep -f - {} +
The {} is where find places the filenames, and the + terminates the command and lets find know you're OK with passing multiple arguments to each invocation of grep -f, while still respecting the command line length limit by invoking grep -f more than once if the list of files exceeds the allowed length of a single command.

Quickly list random set of files in directory in Linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.
If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test
How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}
If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

How to reverse and sort files in Linux?

I'm trying to take files in a directory and reverse them and then sort them alphabetically.
So that
Cat
Dog
So the output Would be
God
Tac
If you use terminal write:
ls -r
It show you your files and directories in reversed order.
I don't know what kind of Graphical Interface you use. In Gnome you must show files as a list and sort by name (clicking it) diminishing
If you want to work with files and folder names, not to change their names, I hope this will help you:
ls your-path | rev | sort
This will do the trick in plain shell script:
(list the files of a directory, reverse their names and sort them alphabetically)
ls | rev | sort

Listing the results of the du command in alphabetical order

How can I list the results of the du command in alphabetical order?
I know I can use the find command to list them alphabetically, but without the directory size, I also use the -maxdepth option for both commands so that the listing only goes down one subdirectory.
Here's the question in italics:
Write a shell script that implements a directory size analyzer. In your script you may use common Linux commands. The script should list the disk storage occupied by each immediate subdirectory of a given argument or the current directory (if no argument is given) with the subdirectory names sorted alphabetically. Also, list the name of the subdirectory with the highest disk usage along with its storage size. If more than one subdirectory has the same highest disk usage, list any one of those subdirectories. Include meaningful brief comments. List of bash commands applicable for this script includes the following but not limited: cat, cut, du, echo, exit, for, head, if, ls, rm, sort, tail, wc. You may use bash variables as well as temporary files to hold intermediate results. Delete all temporary files at the end of the execution.
Here is my result after entering du $dir -hk --max-depth=2 | sort -o temp1.txt then cat temp1.txt in the command line:
12 ./IT_PLAN/Inter_Disciplinary
28 ./IT_PLAN
3 ./IT_PLAN/Core_Courses
3 ./IT_PLAN/Pre_reqs
81 .
9 ./IT_PLAN/IT_Electives
It should look like this:
28 ./IT_PLAN
3 ./IT_PLAN/Core_Courses
12 ./IT_PLAN/Inter_Disciplinary
9 ./IT_PLAN/IT_Electives
The subdirectory with the maximum disk space use:
28 ./IT_PLAN
Once again, I'm having trouble sorting the results alphabetically.
Try doing this :
du $dir -hk --max-depth=2 | sort -k2
-k2 is the column number 2
See http://www.manpagez.com/man/1/sort/
du $dir -hk --max-depth=2 |awk '{print $2"\t"$1}'|sort -d -k1 -o temp1.txt
and if you want to remove the ./ path
du $dir -hk --max-depth=2 |awk '{print $2"\t"$1}'|sed -e 's/\.\///g'|sort -d -k1 -o temp1.txt

Sort & uniq in Linux shell

What is the difference between the following to commands?
sort -u FILE
sort FILE | uniq
Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).
Note that there times when the piping is appropriate. For example:
sort FILE | uniq -c | sort -n
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
sort -u -o FILE FILE
This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).
There is one slight difference: return code.
The thing is that unless shopt -o pipefail is set the return code of the piped command will be return code of the last one. And uniq always returns zero (success). Try examining exit code, and you'll see something like this (pipefail is not set here):
pavel#lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel#lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0
Other than this, the commands are equivalent.
Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:
For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.
Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.
sort -u will be slightly faster, because it does not need to pipe the output between two commands
also see my question on the topic: calling uniq and sort in different orders in shell
I have worked on some servers where sort don't support '-u' option. there we have to use
sort xyz | uniq
Nothing, they will produce the same result

Resources