Is there a way to make permutations for file names in a for loop in linux bash? - linux

The idea is that you have 3 text files lets name it A B C where you only have a unique column with strings (doesn't matter the content in this example). What you want is to make a join function between these three, so you'll have a join for A - B another one for B - C and a last one for A - C as if it is a permutation.
Let's make a graphic example.
The individual code would be
join -1 1 -2 1 A.txt B.txt > AB.txt
and so on for the other 2
Imagine A has
100
101
102
104
B has
101
103
104
105
C has
100
103
104
105
So A - B comparison (AB.txt) would be:
101
104
A - C comparison (AC.txt):
100
104
B - C comparison (BC.txt):
103
105
And you'll have three output file named after the comparisons AB.txt, AC.txt and BC.txt

A solution might look like this:
#!/usr/bin/env bash
# Read positional parameters into array
list=("$#")
# Loop over all but the last element
for ((i = 0; i < ${#list[#]} - 1; ++i)); do
# Loop over the elements starting with the first after the one i points to
for ((j = i + 1; j < ${#list[#]}; ++j)); do
# Run the join command and redirect to constructed filename
join "${list[i]}" "${list[j]}" > "${list[i]%.txt}${list[j]%.txt}".txt
done
done
Notice that the -1 1 -2 1 is the default behaviour for join and can be skipped.
The script has to be called with the filenames as the parameters:
./script A.txt B.txt C.txt

A function that does nothing but generate the possible combinations of two among its arguments:
#!/bin/bash
combpairs() {
local a b
until [ $# -lt 2 ]; do
a="$1"
for b in "${#:2}"; do
echo "$a - $b"
done
shift
done
}
combpairs A B C D E
A - B
A - C
A - D
A - E
B - C
B - D
B - E
C - D
C - E
D - E

I would put the files in an array, and use the index like this:
files=(a.txt b.txt c.txt) # or files=(*.txt)
for ((i=0; i<${#files[#]}; i++)); do
f1=${files[i]} f2=${files[i+1]:-$files}
join -1 1 -2 1 "$f1" "$f2" > "${f1%.txt}${f2%.txt}.txt"
done
Using echo join to debug (and quoting >), this is what would be executed:
join -1 1 -2 1 a.txt b.txt > ab.txt
join -1 1 -2 1 b.txt c.txt > bc.txt
join -1 1 -2 1 c.txt a.txt > ca.txt
Or for six files:
join -1 1 -2 1 a.txt b.txt > ab.txt
join -1 1 -2 1 b.txt c.txt > bc.txt
join -1 1 -2 1 c.txt d.txt > cd.txt
join -1 1 -2 1 d.txt e.txt > de.txt
join -1 1 -2 1 e.txt f.txt > ef.txt
join -1 1 -2 1 f.txt a.txt > fa.txt
LC_ALL=C; files(*.txt) would use all .txt files in the current directory, sorted by name, which may be relevant.

One in GNU awk:
$ gawk '{
a[ARGIND][$0] # hash all files to arrays
}
END { # after hashing
for(i in a) # form pairs
for(j in a)
if(i<j) { # avoid self and duplicate comparisons
f=ARGV[i] ARGV[j] ".txt" # form output filename
print ARGV[i],ARGV[j] > f # output pair info
for(k in a[i])
if(k in a[j])
print k > f # output matching records
}
}' a b c
Output, for example:
$ cat ab.txt
a b
101
104
All files are hashed in the memory in the beginning so if the files are huge, you may run out of memory.

Another variation
declare -A seen
for a in {A,B,C}; do
for b in {A,B,C}; do
[[ $a == $b || -v seen[$a$b] || -v seen[$b$a] ]] && continue
seen[$a$b]=1
comm -12 "$a.txt" "$b.txt" > "$a$b.txt"
done
done

Related

Executing Concatenation for all rows

I'm working with GWAS data.
Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.
Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):
$ cat SNPs.map
0 rs6923761 0 0
0 rs7903146 0 0
$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C
bash commands I used:
echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv
Output:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.
One awk idea that replaces all of the current code:
awk '
BEGIN { printf "IID" }
# process 1st file:
FNR==NR { printf ",%s", $2; next }
# process 2nd file:
FNR==1 { print "" } # terminate 1st line of output
{ printf $1 # print 1st column
for (i=7;i<=NF;i=i+2) # loop through columns 7-NF, incrementing index +2 on each pass
printf ",%s%s", $i, $(i+1) # print (i)th and (i+1)th columns
print "" # terminate line
}
' SNPs.map SNPs.ped
NOTE: remove comments to declutter code
This generates:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.

Merge Two files of columns but insert columns of second file into columns of first file

Assume two files with same amount of columns.
file_A:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
and
file_B:
A B C D E
A B C D E
A B C D E
A B C D E
A B C D E
I want to merge two files in order like
file_C:
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
I have found a solution in the community like this
paste file_A file_B | awk '{print $1,$6,$2,$7,$3,$8,$4,$9,$5,$10}'
But considering amount of columns is like 100 for each file or not constant, I want to know if there is a better method.
Thanks in advance.
You can use a loop in awk, for example
paste file_A file_B | awk '{
half = NF/2;
for(i = 1; i < half; i++)
{
printf("%s %s ", $i, $(i+half));
}
printf("%s %s\n", $half, $NF);
}'
or
paste file_A file_B | awk '{
i = 1; j = NF/2 + 1;
while(j < NF)
{
printf("%s %s ", $i, $j);
i++; j++;
}
printf("%s %s\n", $i, $j);
}'
The code assumes that the number of columns in awk's input is even.
Use this Perl one-liner after paste to print alternating columns:
paste file_A file_B | perl -F'\t' -lane 'print join "\t", #F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ];'
Example:
Create tab-delimited input files:
perl -le 'print join "\t", 1..5 for 1..2;' > file_A
perl -le 'print join "\t", "A".."E" for 1..2;' > file_B
head file_A file_B
Prints:
==> file_A <==
1 2 3 4 5
1 2 3 4 5
==> file_B <==
A B C D E
A B C D E
Paste files side by side, also tab-delimited:
paste file_A file_B | perl -F'\t' -lane 'print join "\t", #F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ];'
Prints:
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
$#F : last index of the array #F with the input fields, split on tab.
0 .. ( $#F - 1 ) / 2 : array of indexes of the array #F, from the start (0) to half of the array. These are all indexes that correspond to file_A.
map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 : map takes the above array of indexes from 0 to half of the length of #F, and returns a new array, with twice the number of elements. Its elements alternate: (a) the index corresponding to file_A ($_) and (b) that index plus half the length of the array ($_ + ( #F/2 )), which is the corresponding index from file_B.
#F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ] : a slice of array #F with the specified indexes, namely alternating fields from file_A and file_B.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perldata: Slices
With one awk script parsing the files:
FNR==NR {
rec[NR] = $0
next
}
{
split(rec[FNR], fields)
for (i=1;i<=NF;i++) $i = fields[i] FS $i
print
}
Usage:
awk -f tst.awk file_A file_B

Select rows in one file based on specific values in the second file (Linux)

I have two files:
One is "total.txt". It has two columns: the first column is natural numbers (indicator) ranging from 1 to 20, the second column contains random numbers.
1 321
1 423
1 2342
1 7542
2 789
2 809
2 5332
2 6762
2 8976
3 42
3 545
... ...
20 432
20 758
The other one is "index.txt". It has three columns:(1.indicator, 2:low value, 3: high value)
1 400 5000
2 600 800
11 300 4000
I want to output the rows of "total.txt" file with first column matches with the first column of "index.txt" file. And at the same time, the second column of output results must be larger than (>) the second column of the "index.txt" and smaller than (<) the third column of the "index.txt".
The expected result is as follows:
1 423
1 2342
2 809
2 5332
2 6762
11 ...
11 ...
I have tried this:
awk '$1==(awk 'print($1)' index.txt) && $2 > (awk 'print($2)' index.txt) && $1 < (awk 'print($2)' index.txt)' total.txt > result.txt
But it failed!
Can you help me with this? Thank you!
You need to read both files in the same awk script. When you read index.txt, store the other columns in an array.
awk 'FNR == NR { low[$1] = $2; high[$1] = $3; next }
$2 > low[$1] && $2 < high[$1] { print }' index.txt total.txt
FNR == NR is the common awk idiom to detect when you're processing the first file.
Use join like Barmar said:
# To join on the first columns
join -11 -21 total.txt index.txt
And if the files aren't sorted in lexical order by the first column then:
join -11 -21 <(sort -k1,1 total.txt) <(sort -k1,1 index.txt)

How to find common rows in multiple files using awk

I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
aba 0 0 1
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
aba 0 0 0 1
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
The below code does the same and returns the rows only if the key column is found in all the N files (3 files in this case).
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
Output:
xxx 0 0
aba 0 0
aba 0 0 1
However, now I would like to get the output if 'x' files have the key columns.
For example x=2 i.e. rows which are common in two files based on key columns 1 and 2. The output in this case would be:
xyz 0 0
abc 1 1
In real scenario I do have to specify different values for x. Can anybody suggest an edit to this or a new solution.
First attempt
I think you just need to modify the END block a little, and the command invocation:
awk -v num_files=${x:-0} '
…
…script as before…
…
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
'
Basically, this takes a command line parameter based on $x, defaulting to 0, and assigning it to the awk variable num_files. In the END block, the code checks for num_files being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC discounts any -v var=value options and either a command line script or -f script.awk, so the ARGC-1 term remains correct. The array ARGV contains awk (or whatever name you invoked it with) in ARGV[0] and the files to be processed in ARGV[1] through ARGV[ARGC-1].) The loop then checks for the required number of matches and prints as before. You can change == to >= if you want the 'or more' option.
Does it work?
I observed in a comment:
I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.
The response was:
It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of x.
That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk program up to the END block is the same as in the question; the only change is in the END processing block.
#!/bin/bash
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1;;
esac
done
shift $(($OPTIND - 1))
awk -v num_files=${num_files:-$#} '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
' "$#"
Sample runs (data files from question):
$ bash common.sh file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1
$
That shows different answers depending on the value specified via -n. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc/1) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.
Rewrite
However, here's a rewrite, using some of the same ideas, but working more thoroughly.
#!/bin/bash
# SO 30428099
# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.
usage()
{
echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1
}
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) usage;;
esac
done
shift $(($OPTIND - 1))
if [ "$#" = 0 ]
then usage
fi
# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines
awk -v num_files=${num_files:-$#} '
{
keys[$1,$2]++;
if (++key_file[$1,$2,FILENAME] == 1)
files[$1,$2]++
#printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
# FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
sep = lines[$1,$2] ? RS : ""
#printf "B: [[\n%s\n]]\n", lines[$1,$2]
lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
#printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
#print "END"
for (key in files)
{
#print "Key =", key, "; files =", files[key]
if (files[key] == num_files)
{
#printf "TAG\n%s\nEND\n", lines[key]
print lines[key]
}
}
}
' "$#"
Sample output (given the data files from the question):
$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$
You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines when files[$1,$2] == 1. You can separate the recorded information with SUBSEP instead of RS and OFS if you prefer.
Can't you simply use uniq to search for repeated lines in you files?
Something like:
cat file1.txt file2.txt file3.txt | uniq -d
For your complete scenario, you could use uniq -c to get the number of repetition for each line, and filter this with grep.

join command in linux says that files aren't sorted but they are

I have 2 files, I want to join them. they are both sorted.
I sorted them by these commands:
$sort -n -k1,1 f1 > t1
$echo $?
0
$mv t1 f1
$sort -n -k1,1 f2 > t1
$echo $?
0
$mv t1 f2
now I run the join command
$join -1 1 -2 1 f1 f2 > fjoin
$echo $?
1
It says those files aren't sorted
$cat f1
0 0
5 0
9 0
10 0 <----- problem is here
$cat f2
0 1
3 1
11 2 <----- problem is here
I suggest to remove sort's option -n.
From man join:
Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use sort -k 1b,1 if join has no options, or use join -t '' if sort has no options. Note, comparisons honor the rules specified by LC_COLLATE. If the input is not sorted and some lines cannot be joined, a warning message will be given.

Resources