How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl? - linux

Thousands of files ends with *.tab. First column in each file is a header. Every file have their own headers (so they are different). I don't mind to have one header from any file.
Number of rows are equal in all the files and so have an order. My desired output have the same order.
Example files in a directory
test_1.tab
test_2.tab
.
.
.
.
test_1990.tab
test_2000.tab
test_1.tab
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 .....0
test_2000.tab
Pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0
Pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1
Pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1
.
.
.
Pro_2000 1 0 0 0 0 1 1 1 1 1 0 .....0
desired output
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 0 ..... 1 0 0 0 0 1 1 1 1 1 0 0
My code
for i in *.tab/; do paste allCol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allCol.tab ; done
paste <(cut -f1 test1.tab) allCol.tab > final.tab
rm allCol.tab
It takes a quite time like 3 hrs. Which is a better way?
Also, is there any other command to cross check this output file vs all input files? like diff or wc?

Try this.
#!/bin/bash
TMP=tmp
mkdir "$TMP"
RESULT=result
#read each file and append the contents of each line in them
#to a new file for each line in the tmp directory
for f in *.tab; do
i=1
while read -r l; do
echo "$l" >> "$TMP"/"$i"
((i++))
done < <(cut -f2- "$f")
done
#integrate each file in tmp dir into a single line of the $RESULT file
exec 1>>$RESULT
for f in "$TMP"/*; do
while read -r l; do
printf '%s\t' "$l"
done < <(cat "$f")
echo
done
rm -r "$TMP"
This algorithm can be split on a number of processors and the task would get done faster.
You can also add to it things like checking if $TMP was created successfully.

A recursive function is a good tool. As a first cut -- short, but simple:
pasteAll() {
first=$1; shift
case $# in
0) cut -f 2- "$first" ;;
*) paste <(cut -f 2- "$first") <(pasteAll "$#") ;;
esac
}
set -- *.tab
paste <(cut -f 1 "$1") <(pasteAll "$#")
Checking that all files and lines were included -- if every input file contains an identical number of lines -- is as simple as checking the output file's line count and the number of columns in its last line.

Related

How to filter a matrix based on another column

I want to filter a matrix file using a column from another file.
I have 2 tab-separated files. One includes a matrix. I want to filter my matrix file based on the first column of FileB. If the headers(column names) of this matrix file (FileA) are present in the first column of File B, I want to filter them to use in a new file. All solutions I could try were based on filtering rows, not fields. Any help is appreciated. Thanks!
FileA
A B C D E F G H I J K L M N
R1 0 0 0 0 0 0 0 0 0 1 0 0 1 1
R2 1 1 0 1 0 0 0 0 1 0 1 0 0 0
R3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
R4 1 1 0 1 0 0 0 1 0 1 0 1 0 0
R5 0 0 0 0 1 0 1 0 1 0 1 0 1 0
FileB
A Green
B Purple
K Blue
L Blue
Z Green
M Purple
N Red
O Red
U Red
My expected output is:
ExpectedOutput
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
Oh, what the heck, I'm not sure having you post an R script is really going to make any difference other than satisfying my need to be pedantic so here y'go:
$ cat tst.awk
NR == FNR {
outFldNames2Nrs[$1] = ++numOutFlds
next
}
FNR == 1 {
$0 = "__" FS $0
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
outFldNr = outFldNames2Nrs[$inFldNr]
out2inFldNrs[outFldNr] = inFldNr
}
}
{
printf "%s", $1
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2inFldNrs[outFldNr]
if (inFldNr) {
printf "%s%s", OFS, $inFldNr
}
}
print ""
}
$ awk -f tst.awk fileB fileA
__ A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
I'm using the term "field name" to apply to the letter at the top of each column ("field" in awk). Try to figure the rest out for yourself from looking at the man pages and adding "prints" if/when useful and then feel free to ask questions if you have any.
I added __ at the front of your header line so you'd have the same number of columns in every line of output - that makes it easier to pass along to other tools to manipulate further but it's easy to tweak the code to not do that if you don't like it.
As #EdMorton mentions, bash may not be a suitable tool to manipulate
complex data structure as a table from maintainability and robustness
point of view.
Here is a bash script example just for information:
#!/bin/bash
declare -A seen
declare -a ary include
while read -r alpha color; do
seen["$alpha"]=1
done < FileB
while read -r -a ary; do
if (( $((nr++)) == 0 )); then # handle header line
echo -n " "
for (( i=0; i<${#ary[#]}; i++ )); do
alpha="${ary[$i]}"
if [[ ${seen["$alpha"]} = 1 ]]; then
echo -n " $alpha"
include[$((i+1))]=1
fi
done
else
echo -n "${ary[0]}"
for (( i=1; i<${#ary[#]}; i++ )); do
if [[ ${include[$i]} = 1 ]]; then
echo -n " ${ary[$i]}"
fi
done
fi
echo
done < FileA
If python is your option, you can say instead something like:
import pandas as pd
dfb = pd.read_csv("./FileB", sep="\s+", header=None)
vb = [x[0] for x in dfb.values.tolist()]
dfa = pd.read_csv("./FileA", sep="\s+")
va = dfa.columns.tolist()
print(dfa[sorted(set(va) & set(vb))])
Output:
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0

Loop every three consecutive rows in linux

I have a file hundred.txt containing 100 rows.
For example:
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
....
1 0 0 1
I need to manipulate some calculations within every 3 consecutive rows, for instance, I need to use the Row1-Row3 first to do my calculation:
1 0 0 1
1 1 0 1
1 0 1 0
then the Row2-Row4:
1 1 0 1
1 0 1 0
1 0 1 0
...... the Row98-Row100.
Each output will generate a file (e.g. Row1.txt, Row2.txt,... Row98.txt), How can I solve this problem? Thank you.
bash isn't a great choice for data processing tasks, but it is possible (albeit slow):
{ read row1
read row2
count=0
while read row3; do
# Do something with rows 1-3
{ echo $row1 $row2 $row3; } > Row$((count+=1)).txt
# Slide the window
row1=$row2
row2=$row3
done
} < hundred.txt
awk to the rescue!
$ awk 'NR>2{printf "%s", a2 ORS a1 ORS $0 ORS > FILENAME"."(++c)}
{a2=a1;a1=$0}' file
for the input file
$ cat file
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
generates these 3
$ head file.{1..3}
==> file.1 <==
1 0 0 1
1 1 0 1
1 0 1 0
==> file.2 <==
1 1 0 1
1 0 1 0
1 0 1 0
==> file.3 <==
1 0 1 0
1 0 1 0
0 1 1 0
you can embed your computation is the script and output only the results but you didn't provide any details on that.
Explanation
NR>2 starting third row
printf ... start printing last 3 rows
> FILENAME"."(++c) to a file derived from input filename with counter suffix
a2=a1;a1=$0 update last two rows
if your rolling window is small n you can scale this script by changing NR>(n-1) and keeping track of last rows in a(n-1)...a1 and printing accordingly. If n is large, better to use a array (or better a circular array).
This is perhaps the most generic version...
$ awk -v n=3 'NR>n-1{fn=FILENAME"."c;
for(i=c+1;i<c+n;i++) printf "%s\n", a[(i-n)%n] > fn;
print > fn}
{a[(c++)%n]=$0}' file
One hundred rows of four binary-valued columns is not too much; just read it all in at once.
mapfile -t rows < inputfile
for r in "${!rows[#]}"; do # loop by row index
(( r >= 2 )) || continue
# process "${rows[r-2]}" "${rows[r-1]}" and "${rows[r]}"
# into file Row$((r-1))
done
If the quantity of data grows significantly, you really want to use a better tool, such as Python+numpy (because your data looks like binary matrices).

how to select only those lines which have same string in all columns with first line of different header in linux

I have a text file with 200 columns like:
sample1 0 12 11 23 12
sample2 3 16 89 12 0
sample3 0 0 0 0 0
sample4 33 22 0 0 0
sample5 0 0 0 0 0
And I want only those lines which have only 0 from column 2 to 6. desired out put is:
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Like this, for example:
$ awk '!$2 && !$3 && !$4 && !$5 && !$6' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Which is the same as:
$ awk '!($2 || $3 || $4 || $5 || $6)' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
As per your comment
that is for example but i want to do that from column 2 to 200th
This can be a way:
$ awk '{for (i=2;i<=200;i++) if ($i) {next}}1' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Note that $i refers to the field in the position i. $i is true when it has got a "true" value. Hence, $i will be false when it is 0.
Based on that approach, we loop through all values. In case one value is True, meaning not 0, then we do next, which means that the line is not analyzed any more. For the rest of the cases (2nd to 200th column being 0 or empty), the next is not accomplished so it interprets the 1, which makes {print $0} to be executed.

Matlab string operation

I have converted a string to binary as follows
message='hello my name is kamran';
messagebin=dec2bin(message);
Is there any method for storing it in array?
I am not really sure of what you want to do here, but if you need to concatenate the rows of the binary representation (which is a matrix of numchars times bits_per_char), this is the code:
message = 'hello my name is kamran';
messagebin = dec2bin(double(message));
linearmessagebin = reshape(messagebin',1,numel(messagebin));
Please note that the double conversion returns your ASCII code. I do not have access to a Matlab installation here, but for example octave complains about the code you provided in the original question.
NOTE
As it was kindly pointed out to me, you have to transpose the messagebin before "serializing" it, in order to have the correct result.
If you want the result as numeric matrix, try:
>> str = 'hello world';
>> b = dec2bin(double(str),8) - '0'
b =
0 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 1
0 1 1 0 1 1 1 1
0 1 1 1 0 0 1 0
0 1 1 0 1 1 0 0
0 1 1 0 0 1 0 0
Each row corresponds to a character. You can easily reshape it into to sequence of 0,1

Convert column pattern

I have this kind of file:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
The RS separator is an empty line by default.
If there was a double blank line, we have to substitute on of them by a pattern $1 0 0, where $1 means the increased "number" before the $1 0 * record.
If the separator is empty line + 1 empty line we have to increase the $1 by 1.
If the separator is empty line + 2 empty line we have to increase the $1 by 2.
...
and I need to get this output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Thanks in advance!
awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
Output
$ awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Explanation
NF{f=0;n=$1;print;next}: if the current line has data, unset flag f, save the number in the first field to n, print the line and skip the rest of the script
{print;f=1}: We only reach this action if the current line is blank. If so, print the line and set the flag f
f{print ++n " 0 0"}: We only execute this action if the flag f is set which only happens if the previous line was blank. If we enter this action, print the missing fields with an incremented n
You can try something like this. The benefit of this way is that your input file need not have an empty line for the missing numbers.
awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
Input File:
[jaypal:~/Temp] cat file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
Script Output:
[jaypal:~/Temp] awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1

Resources