making reference to file two by two in a bash command - linux

I have this list of file that I have to analyse by pair (the a_1 with a_2, b_1 with b_2 and so on)
a_1.fq
a_2.fq
b_1.fq
b_2.fq
c_1.fq
...
I want to set a for loop to make reference to these pairs of file in a command, bellow
this is just an example of what I want to do (with a false syntax) :
$ for File1 File2 in *1.fq *2.fq; do STAR --readFilein File1 File2 ; done
Thank you a lot for your help

Use a function to process pairs and iterate the glob expansion:
process_pair()
{
while [ $# -gt 0 ]
do
f1=$1 # get 1st argument
shift # shift to next argument
f2=$1 # get 2nd argument
shift # shift to next for the next round
# Do stuffs with file1 and file2
printf 'f1=%s\tf2=%s\n' "$f1" "$f2"
done
}
# Submit pattern expansion as process_pair arguments
process_pair ./*_[12].fq

You can iterate over just one type of the files and use parameter expansion to device the second one:
for file1 in *1.fq; do
file2=${file1%1.fq}2.fq
...
The %pattern removes the pattern from the end of the variable's value.
You might want to check for the file2's existence before running the command.
Or, if you can get the files listed in the way the pairs are adjacent, you can populate $# with them and shift the parameters by two:
set -- [a-z]_[12].fq
while (( $# )) ; do
file1=$1
file2=$2
shift 2
...
done

You can try something like that:
for letter in {a..z}
do
# Your logic
echo "Working on $letter"
cat $letter\_1.fq
cat $letter\_2.fq
done

Related

Create variable equal to the lines in the file, and assign the variables the value from the file sequentially

I want to create a number of variables equal to the lines in a file and assign to each of those variables a value from the file sequentially.
Say,
file1 contains device1 device2 device3 .....
file2 contains olddevice1 olddevice2 olddevice3 .....
I want values as when I do echo $A = device1
Similarly echo $B = device2 and echo $Z = device26
I tried a for loop, and even an array, but couldn't get through it.
I have tried something like below:
iin=0
var=({A..Z})
for jin in `cat file1`
do
array[$iin]="$var=$jin";
iin=$(($iin+1));
var="$(echo $var | tr '[A-Y]Z' '[B-Z]A')"
printf '%s\n' "${array[#]}"
done`
I believe you're missing the point : variables have fix names in programming languages, like $A, $B, ..., $Z: while programming you need to specify those variables inside your program, you can't expect your program to invent it's own variables.
What you are looking for, are collections, like arrays, lists, ...:
You create a collection A and you can add values to it (A[n]=value_n, or A.SetAt(n, value_n), ..., depending on the kind of collection you're using.
With bash (v4 and later) something like this mapfile code should work:
mapfile -O 1 -t array1 < file1
mapfile -O 1 -t array2 < file2
# output line #2 from both files
echo "${array1[2]}" "${array2[2]}"
# output the last line from both files
echo "${array1[-1]}" "${array2[-1]}"
Notes: mapfile just loads an array, but with a few more options.
-O 1 sets the array subscript to start at 1 rather than the default 0; this isn't necessary, but it makes the code easier to read.

How do you interpret ${VAR#*:*:*} in Bourne Shell

I am using Bourne Shell. Need to confirm if my understanding of following is correct?
$ echo $SHELL
/bin/bash
$ VAR="NJ:NY:PA" <-- declare an array with semicolon as separator?
$ echo ${VAR#*} <-- show entire array without separator?
NJ:NY:PA
$ echo ${VAR#*:*} <-- show array after first separator?
NY:PA
$ echo ${VAR#*:*:*} <-- show string after two separator
PA
${var#pattern} is a parameter expansion that expands to the value of $var with the shortest possible match for pattern removed from the front of the string.
Thus, ${VAR#*:} removes everything up and including to the first :; ${VAR#*:*:} removes everything up to and including the second :.
The trailing *s on the end of the expansions given in the question don't have any use, and should be avoided: There's no reason whatsoever to use ${var#*:*:*} instead of ${var#*:*:} -- since these match the smallest amount of text possible, and * is allowed to expand to 0 characters, the final * matches and removes nothing.
If what you really want is an array, you might consider using a real array instead.
# read contents of string VAR into an array of states
IFS=: read -r -a states <<<"$VAR"
echo "${states[0]}" # will echo NJ
echo "${states[1]}" # will echo NY
echo "${#states[#]}" # count states; will emit 3
...which also gives you the ability to write:
printf ' - %s\n' "${states[#]}" # put *all* state names into an argument list

IFS and moving through single positions in directory

I have two questions .
I have found following code line in script : IFS=${IFS#??}
I would like to understand what it is exactly doing ?
When I am trying to perform something in every place from directory like eg.:
$1 = home/user/bin/etc/something...
so I need to change IFS to "/" and then proceed this in for loop like
while [ -e "$1" ]; do
for F in `$1`
#do something
done
shift
done
Is that the correct way ?
${var#??} is a shell parameter expansion. It tries to match the beginning of $var with the pattern written after #. If it does, it returns the variable $var with that part removed. Since ? matches any character, this means that ${var#??} removes the first two chars from the var $var.
$ var="hello"
$ echo ${var#??}
llo
So with IFS=${IFS#??} you are resetting IFS to its value after removing its two first chars.
To loop through the words in a /-delimited string, you can store the splitted string into an array and then loop through it:
$ IFS="/" read -r -a myarray <<< "home/user/bin/etc/something"
$ for w in "${array[#]}"; do echo "-- $w"; done
-- home
-- user
-- bin
-- etc
-- something

How to get value from command line using for loop

Following is the code for extracting input from command line into bash script:
input=(*);
for i in {1..5..1}
do
input[i]=$($i);
done;
My question is: how to get $1, $2, $3, $4 values from input command line, where command line code input is:
bash script.sh "abc.txt" "|" "20" "yyyy-MM-dd"
Note: Not using for i in "${#}"
#!/bin/bash
for ((i=$#-1;i>=0;i--)); do
echo "${BASH_ARGV[$i]}"
done
Example: ./script.sh a "foo bar" c
Output:
a
foo bar
c
I don't know what you have against for i in "$#"; do..., but you can certainly do it with shift, for example:
while [ -n "$1" ]; do
printf " '%s'\n" "$1"
shift
done
Output
$ bash script.sh "abc.txt" "|" "20" "yyyy-MM-dd"
'abc.txt'
'|'
'20'
'yyyy-MM-dd'
Personally, I don't see why you exclude for i in "$#"; do ... it is a valid way to iterate though the args that will preserve quoted whitespace. You can also use the array and C-style for loop as indicated in the other answers.
note: if you are going to use your input array, you should use input=("$#") instead of input=($*). Using the latter will not preserve quoted whitespace in your positional parameters. e.g.
input=("$#")
for ((i = 0; i < ${#input[#]}; i++)); do
printf " '%s'\n" "${input[i]}"
done
works fine, but if you use input=($*) with arguments line "a b", it will treat those as two separate arguments.
If I'm correctly understanding what you're trying to do, you can write:
input=("$#")
to copy the positional parameters into an array named input.
If you specifically want only the first five positional parameters, you can write:
input=("${#:1:5}")
Edited to add: Or are you asking, given a variable i that contains the integer 2, how you can get $2? If that's your question, then — you can use indirect expansion, where Bash retrieves the value of a variable, then uses that value as the name of the variable to substitute. Indirect expansion uses the ! character:
i=2
input[i]="${!i}" # same as input[2]="$2"
This is almost always a bad idea, though. You should rethink what you're doing.

Use grep to remove words from dictionary whose roots are already present

I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:
ablaze
able
abler
ablest
abloom
ably
would end up with only
ablaze
able
abloom
ably
because abler and ablest contain able which was previously used.
I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.
If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r (defined in a moment) and the prefix p (ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r) and also remember the prefix string in p.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k and p entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words
This is based on the assumption that the input file is sorted. In that case, when looking up each word, all matches after the first one can be safely skipped (because they will correspond to "the same word with a different suffix").
#/bin/bash
input=$1
while read -r word ; do
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this line
echo $word
# skip next lines that start with $word as prefix
skip=$(grep -c -E -e "^${word}" $input)
for ((i=1; i<$skip; i++)) ; do read -r word ; done
done <$input
Call as ./filter.sh input > output
This takes somewhat less than 2 minutes on all words of 4 or more letters found in my /usr/share/dict/american-english dictionary. The algorithm is O(n²), and therefore unsuitable for large files.
However, you can speed things up a lot if you avoid using grep at all. This version takes only 4 seconds to do the job (because it does not need to scan the whole file almost once per word). Since it performs a single pass over the input, its complexity is O(n):
#/bin/bash
input=$1
while true ; do
# use already-read word, or fail if cannot read new
if [ -n "$next" ] ; then word=$next; unset next;
elif ! read -r word ; then break; fi
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this word
echo ${word}
# skip words that start with $word as prefix
while read -r next ; do
unique=${next#$word}
if [ ${#next} -eq ${#unique} ] ; then break; fi
done
done <$input
Supposing you want to start with words that share the same first four (up to ten) letters, you could do something like this:
cp /usr/share/dict/words words
str="...."
for num in 4 5 6 7 8 9 10; do
for word in `grep "^$str$" words`; do
grep -v "^$word." words > words.tmp
mv words.tmp words
done
str=".$str"
done
You wouldn't want to start with 1 letter, unless 'a' is not in your dictionary, etc.
Try this BASH script:
a=()
while read -r w; do
[[ ${#a[#]} -eq 0 ]] && a+=("$w") && continue
grep -qvf <(printf "^%s\n" "${a[#]}") <<< "$w" && a+=("$w")
done < file
printf "%s\n" "${a[#]}"
ablaze
able
abloom
ably
It seems like you want to group adverbs together. Some adverbs, including those that can also be adjectives, use er and est to form comparisons:
able, abler, ablest
fast, faster, fastest
soon, sooner, soonest
easy, easier, easiest
This procedure is know as stemming in natural language processing, and can be achieved using a stemmer or lemmatizer. there are popular implementations in python's NLTK module but the problem is not completely solved. The best out the box stemmer is the snowball stemmer but it does not stem adverbs to their root.
import nltk
initial = '''
ablaze
able
abler
ablest
abloom
ably
fast
faster
fastest
'''.splitlines()
snowball = nltk.stem.snowball.SnowballStemmer("english")
stemmed = [snowball.stem(word) for word in initial]
print set(stemmed)
output...
set(['', u'abli', u'faster', u'abl', u'fast', u'abler', u'abloom', u'ablest', u'fastest', u'ablaz'])
the other option is to use a regex stemmer but this has its own difficulties I'm afraid.
patterns = "er$|est$"
regex_stemmer = nltk.stem.RegexpStemmer(patterns, 4)
stemmed = [regex_stemmer.stem(word) for word in initial]
print set(stemmed)
output...
set(['', 'abloom', 'able', 'abl', 'fast', 'ably', 'ablaze'])
If you just want to weed out some of the words, this gross command will work. Note that it'll throw out some legit words like best, but it's dead simple. It assumes you have a test.txt file with one word per line
egrep -v "er$|est$" test.txt >> results.txt
egrep is the same as grep -E. -v means throw out matching lines. x|y means if x or y match, and $ means end of line, so you'd be looking for words that end in er or est

Resources