Generate file names for proper sequential sorting under shell globing - linux

I was generating a sequence of png images in my program, which files I was planning to get is passed through some tool that converts them to a video file. I am generating files one by one, in the proper sequence that I want them. I want to name them in such a way that the subsequent video conversion tool will take them in proper sequence under the file name globbing used by the shell ( I am using bash with Linux.). I tried adding a numeric sequence like 'scene1.png, scene10.png, scene12.png, but the shell doesn't sort globs numerically. I could pass a sorted list like this:
convert -antialias -delay 1x10 $(ls povs/*.png | sort -V) mymovie.mp4
But some programs do their own globbing and don't use shells globbing ( like FFmpeg), and so this approach does not always work. so I am looking for a scheme of naming files that are guaranteed to be in sequence as per shell globbing rules.

You may prefix your files with a zero padded integer.
This script emulates what ls * should output after renaming :
$ for i in {1..12};do
$ printf '%05d_%s\n' ${i} file${i}
$ done;
00000_file0
00001_file1
00002_file2
00003_file3
00004_file4
00005_file5
00006_file6
00007_file7
00008_file8
00009_file9
00010_file10
00011_file11

Related

shell script to process data from two input files

I have a lot of fastq files for ~100 samples (two per samples: reads1 and reads2). For each sample, I need to input the two fastq files into Prinseq, a perl program. Ideally, this would be great to do with a shell script for all samples so I don't have to manually call this program 100 times, but I don't know how to indicate two input files, just one (i.e., for i in *.fastq; do [perl commands]; done). If it helps, Prinseq command format is as follows:
perl prinseq-lite.pl -fastq [file for reads1] -fastq2 [file for
reads2] -derep [options]
This is probably a very easy answer, but I can't find it.
You can loop over all R1 files and use parameter expansion to chop off the R1_001.fastq part (and replace it by the R2 version):
for i in *_R1_001.fastq; do
perl prinseq-lite.pl -fastq "$i" -fastq2 "${i%R1_001.fastq}R2_001.fastq"
done

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

how to use do loop to read several files with similar names in shell script

I have several files named scale1.dat, scale2.dat scale3.dat ... up to scale9.dat.
I want to read these files in do loop one by one and with each file I want to do some manipulation (I want to write the 1st column of each scale*.dat file to scale*.txt).
So my question is, is there a way to read files with similar names. Thanks.
The regular syntax for this is
for file in scale*.dat; do
awk '{print $1}' "$file" >"${file%.dat}.txt"
done
The asterisk * matches any text or no text; if you want to constrain to just single non-zero digits, you could say for file in scale[1-9].dat instead.
In Bash, there is a non-standard additional glob syntax scale{1..9}.dat but this is Bash-only, and so will not work in #!/bin/sh scripts. (Your question has both sh and bash so it's not clear which you require. Your comment that the Bash syntax is not working for you suggests that you may need a POSIX portable solution.) Furthermore, Bash has something called extended globbing, which allows for quite elaborate pattern matching. See also http://mywiki.wooledge.org/glob
For a simple task like this, you don't really need the shell at all, though.
awk 'FNR==1 { if (f) close (f); f=FILENAME; sub(/\.dat/, ".txt", f); }
{ print $1 >f }' scale[1-9]*.dat
(Okay, maybe that's slightly intimidating for a first-timer. But the basic point is that you will often find that the commands you want to use will happily work on multiple files, and so you don't need shell loops at all in those cases.)
I don't think so. Similar names or not, you will have to iterate through all your files (perhaps with a for loop) and use a nested loop to iterate through lines or words or whatever you plan to read from those files.
Alternatively, you can copy your files into one (say, scale-all.dat) and read that single file.

Order of the file reading from a directory in linux

If in a directory, suppose there are 100 files with names like file.pcap1, file.pcap2, file.pcap3,....., file.pcap100. In a shell script, to read these file one-by-one, i have written a line like:
for $file in /root/*pcap*
do
Something
done
What is the order by which files are read? Are they read in the increasing order of the numbers which are at the end of the file names? Is this the same case for all types of linux machines?
It is sorted by file name. Just like the default ls (with no flags).
Also, you need to remove the $ in your foreach:
for file in /root/*pcap*
POSIX shell returns paths sorted using current locale:
If the pattern matches any existing filenames or pathnames, the
pattern shall be replaced with those filenames and pathnames, sorted
according to the collating sequence in effect in the current locale
It means pcap10 comes before pcap2. You probably want natural sorting order instead e.g., Python analog of natsort function (sort a list using a “natural order” algorithm).

grepping for a large binary value from an even larger binary file

As the title suggests I would like to grep a reasonably large (about 100MB) binary file, for a binary string - this binary string is just under 5K.
I've tried grep using the -P option, but this only seems to return matches when the pattern is only a few bytes - when I go up to about 100 bytes it no longer finds any matches.
I've also tried bgrep. This worked well originally, however, when I needed to extend the pattern to the length I have now I just get "invalid/empty search string" errors.
The irony is, in Windows I can use HxD to search the file and I finds it in a instance. What I really need though is a Linux command line tool.
Thanks for your help,
Simon
Say we have a couple of big binary data files. For a big one that shouldn't match, we create a 100MB file whose contents are all NUL bytes.
dd ibs=1 count=100M if=/dev/zero of=allzero.dat
For the one we want to match, create a hundred random megabytes.
#! /usr/bin/env perl
use warnings;
binmode STDOUT or die "$0: binmode: $!";
for (1 .. 100 * 1024 * 1024) {
print chr rand 256;
}
Execute it as ./mkrand >myfile.dat.
Finally, extract a known match into a file named pattern.
dd skip=42 count=10 if=myfile.dat of=pattern
I assume you want only the files that match (-l) and want your pattern to be treated literally (-F or --fixed-strings). I suspect you may have been running into a length limit with -P.
You may be tempted to use the --file=PATTERN-FILE option, but grep interprets the contents of PATTERN-FILE as newline-separated patterns, so in the likely case that your 5KB pattern contains newlines, you'll hit an encoding problem.
So hope your system's ARG_MAX is big enough and go for it. Be sure to quote the contents of pattern. For example:
$ grep -l --fixed-strings "$(cat pattern)" allzero.dat myfile.dat
myfile.dat
Try using grep -U which treats files as binary.
Also, how are you specifying the search pattern? It might just need escaping to survive shell parameter expansions
As the string you are searching is pretty long. You could benefit by an implementation of the Boyer-Moore search algorithm which is very efficient when search string is very long
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
The wiki also has links to some sample code.
You might want to look at a simple Python script.
match= (b"..."
b"...."
b"..." ) # Some byte string literal of immense proportions
with open("some_big_file","rb") as source:
block= read(len(match))
while block != match:
byte= read(1)
if not byte: break
block= block[1:]+read(1)
This might work reliably under Linux as well as Windows.

Resources