How to delete the first subset of each set of column in a data file? - linux

I have a data file with more than 40000 column. In header each column's name begins with C1 , c2, ..., cn and each set of c has one or several subset for example c1. has 2 subsets. I need to delete first column(subset) of each set of c. for example if input looks like :
input:
c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444
1 1 0 1 0 0 0 1
2 0 1 0 0 1 0 1
3 0 1 0 0 1 1 0
4 1 0 1 0 0 1 0
5 1 0 1 0 0 1 0
6 1 0 1 0 0 1 0
I need the output be like:
c1.31012 c2.87634 c2.22233 c3.44444
1 0 0 0 1
2 1 0 1 1
3 1 0 1 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 1 0 0 0
Any suggestion please?
update: If there be no space between digits in row (which is th real situation of my data set) then what should I do? my mean is that my real data looks like this:
input:
c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444
1 1010001
2 0100101
3 0100110
4 1010010
5 1010010
6 1010010
and output:
c1.31012 c2.87634 c2.22233 c3.44444
1 0001
2 1011
3 1010
4 0000
5 0000
6 0000
7 1000

Perl solution: It first reads the header line, uses a regex to extract the column name before a dot, and keeps a list of column numbers to keep. It then uses the indices to print only the wanted columns from the header and remaining lines.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #header = split ' ', <>;
my $last = q();
my #keep;
for my $i (0 .. $#header) {
my ($prefix) = $header[$i] =~ /(.*)\./;
if ($prefix eq $last) {
push #keep, $i + 1;
}
$last = $prefix;
}
unshift #header, q();
say join "\t", #header[#keep];
while (<>) {
my #columns = split;
say join "\t", #columns[#keep];
}
Update:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #header = split ' ', <>;
my $last = q();
my #keep;
for my $i (0 .. $#header) {
my ($prefix) = $header[$i] =~ /(.*)\./;
if ($prefix eq $last) {
push #keep, $i;
}
$last = $prefix;
}
say join "\t", #header[#keep];
while (<>) {
my ($line_number, $all_digits) = split;
my #digits = split //, $all_digits;
say join "\t", $line_number, join q(), #digits[#keep];
}

Related

How to filter a matrix based on another column

I want to filter a matrix file using a column from another file.
I have 2 tab-separated files. One includes a matrix. I want to filter my matrix file based on the first column of FileB. If the headers(column names) of this matrix file (FileA) are present in the first column of File B, I want to filter them to use in a new file. All solutions I could try were based on filtering rows, not fields. Any help is appreciated. Thanks!
FileA
A B C D E F G H I J K L M N
R1 0 0 0 0 0 0 0 0 0 1 0 0 1 1
R2 1 1 0 1 0 0 0 0 1 0 1 0 0 0
R3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
R4 1 1 0 1 0 0 0 1 0 1 0 1 0 0
R5 0 0 0 0 1 0 1 0 1 0 1 0 1 0
FileB
A Green
B Purple
K Blue
L Blue
Z Green
M Purple
N Red
O Red
U Red
My expected output is:
ExpectedOutput
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
Oh, what the heck, I'm not sure having you post an R script is really going to make any difference other than satisfying my need to be pedantic so here y'go:
$ cat tst.awk
NR == FNR {
outFldNames2Nrs[$1] = ++numOutFlds
next
}
FNR == 1 {
$0 = "__" FS $0
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
outFldNr = outFldNames2Nrs[$inFldNr]
out2inFldNrs[outFldNr] = inFldNr
}
}
{
printf "%s", $1
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2inFldNrs[outFldNr]
if (inFldNr) {
printf "%s%s", OFS, $inFldNr
}
}
print ""
}
$ awk -f tst.awk fileB fileA
__ A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
I'm using the term "field name" to apply to the letter at the top of each column ("field" in awk). Try to figure the rest out for yourself from looking at the man pages and adding "prints" if/when useful and then feel free to ask questions if you have any.
I added __ at the front of your header line so you'd have the same number of columns in every line of output - that makes it easier to pass along to other tools to manipulate further but it's easy to tweak the code to not do that if you don't like it.
As #EdMorton mentions, bash may not be a suitable tool to manipulate
complex data structure as a table from maintainability and robustness
point of view.
Here is a bash script example just for information:
#!/bin/bash
declare -A seen
declare -a ary include
while read -r alpha color; do
seen["$alpha"]=1
done < FileB
while read -r -a ary; do
if (( $((nr++)) == 0 )); then # handle header line
echo -n " "
for (( i=0; i<${#ary[#]}; i++ )); do
alpha="${ary[$i]}"
if [[ ${seen["$alpha"]} = 1 ]]; then
echo -n " $alpha"
include[$((i+1))]=1
fi
done
else
echo -n "${ary[0]}"
for (( i=1; i<${#ary[#]}; i++ )); do
if [[ ${include[$i]} = 1 ]]; then
echo -n " ${ary[$i]}"
fi
done
fi
echo
done < FileA
If python is your option, you can say instead something like:
import pandas as pd
dfb = pd.read_csv("./FileB", sep="\s+", header=None)
vb = [x[0] for x in dfb.values.tolist()]
dfa = pd.read_csv("./FileA", sep="\s+")
va = dfa.columns.tolist()
print(dfa[sorted(set(va) & set(vb))])
Output:
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0

BASH Script - Check if consecutive numbers in a string are above a value

I am echoing some data from an Oracle DB cluster, via a bash script. Currently, my output into a variable in the script from SQLPlus is:
11/12 0 0 0 0 0 0 1 0 1 0 5 4 1 0 0 0 0 0 0 0 0 0 0 0
What I'd like to be able to do is evaluate that string of numbers, excluding the first one (the date), to see if any consecutive 6 of the numbers are above a certain value, lets say 10.
I only want the logic to return true if all 6 consecutive values were above "10".
So for example, if the output was:
11/12 0 0 8 10 5 1 1 0 8 10 25 40 6 2 0 0 0 0 0 0 0 0 0 0
The logic should return false/null/zero, anything I can handle negatively.
But if the string looked like this:
11/12 0 0 0 0 5 9 1 0 1 10 28 10 12 19 15 11 6 7 0 0 0 0
Then it would return true/1 etc..
Is there any bash component that I can make use of to do this? I've been stuck on this part for a while now.
For variety, here is a solution not depending on awk:
#!/usr/bin/env bash
contains() {
local nums=$* count=0 threshold=10 limit=6 i
for i in ${nums#* }; do
if (( i >= threshold )); then
(( ++count >= limit )) && return 0
else
count=0
fi
done
return 1
}
output="11/12 0 0 0 0 5 9 1 0 1 10 28 10 12 19 15 11 6 7 0 0 0 0"
if contains "$output"; then
echo "Yaaay!"
else
echo "Noooo!"
fi
Say your string is in $S, then
echo $S | awk '
{ L=0; threshold = 10; reqLength = 6;
for (i = 2; i <= NF; ++i) {
if ($i >= threshold) {
L += 1
if (L >= reqLength) {
exit(1);
}
} else {
L = 0
}
}
}'
would do it. ($? will be 1 if you have enough numbers exceeding your threshold)

Loop every three consecutive rows in linux

I have a file hundred.txt containing 100 rows.
For example:
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
....
1 0 0 1
I need to manipulate some calculations within every 3 consecutive rows, for instance, I need to use the Row1-Row3 first to do my calculation:
1 0 0 1
1 1 0 1
1 0 1 0
then the Row2-Row4:
1 1 0 1
1 0 1 0
1 0 1 0
...... the Row98-Row100.
Each output will generate a file (e.g. Row1.txt, Row2.txt,... Row98.txt), How can I solve this problem? Thank you.
bash isn't a great choice for data processing tasks, but it is possible (albeit slow):
{ read row1
read row2
count=0
while read row3; do
# Do something with rows 1-3
{ echo $row1 $row2 $row3; } > Row$((count+=1)).txt
# Slide the window
row1=$row2
row2=$row3
done
} < hundred.txt
awk to the rescue!
$ awk 'NR>2{printf "%s", a2 ORS a1 ORS $0 ORS > FILENAME"."(++c)}
{a2=a1;a1=$0}' file
for the input file
$ cat file
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
generates these 3
$ head file.{1..3}
==> file.1 <==
1 0 0 1
1 1 0 1
1 0 1 0
==> file.2 <==
1 1 0 1
1 0 1 0
1 0 1 0
==> file.3 <==
1 0 1 0
1 0 1 0
0 1 1 0
you can embed your computation is the script and output only the results but you didn't provide any details on that.
Explanation
NR>2 starting third row
printf ... start printing last 3 rows
> FILENAME"."(++c) to a file derived from input filename with counter suffix
a2=a1;a1=$0 update last two rows
if your rolling window is small n you can scale this script by changing NR>(n-1) and keeping track of last rows in a(n-1)...a1 and printing accordingly. If n is large, better to use a array (or better a circular array).
This is perhaps the most generic version...
$ awk -v n=3 'NR>n-1{fn=FILENAME"."c;
for(i=c+1;i<c+n;i++) printf "%s\n", a[(i-n)%n] > fn;
print > fn}
{a[(c++)%n]=$0}' file
One hundred rows of four binary-valued columns is not too much; just read it all in at once.
mapfile -t rows < inputfile
for r in "${!rows[#]}"; do # loop by row index
(( r >= 2 )) || continue
# process "${rows[r-2]}" "${rows[r-1]}" and "${rows[r]}"
# into file Row$((r-1))
done
If the quantity of data grows significantly, you really want to use a better tool, such as Python+numpy (because your data looks like binary matrices).

Convert column pattern

I have this kind of file:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
The RS separator is an empty line by default.
If there was a double blank line, we have to substitute on of them by a pattern $1 0 0, where $1 means the increased "number" before the $1 0 * record.
If the separator is empty line + 1 empty line we have to increase the $1 by 1.
If the separator is empty line + 2 empty line we have to increase the $1 by 2.
...
and I need to get this output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Thanks in advance!
awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
Output
$ awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Explanation
NF{f=0;n=$1;print;next}: if the current line has data, unset flag f, save the number in the first field to n, print the line and skip the rest of the script
{print;f=1}: We only reach this action if the current line is blank. If so, print the line and set the flag f
f{print ++n " 0 0"}: We only execute this action if the flag f is set which only happens if the previous line was blank. If we enter this action, print the missing fields with an incremented n
You can try something like this. The benefit of this way is that your input file need not have an empty line for the missing numbers.
awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
Input File:
[jaypal:~/Temp] cat file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
Script Output:
[jaypal:~/Temp] awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1

Pattern decoding II [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Pattern decoding
I have some new question concerning to the previous post about pattern decoding:
I have almost the same data file, BUT there are double empty (blank) lines, which have to be taken into account in the decoding.
So, the double empty lines mean that there was a street/grout (for definitions see the previous post: Pattern decoding) in which there was zero (0) house, but we have to count these kind of patterns too. (Yes, you may think, that this is absolutely wrong statement, because there is no street without at least one house, but this is just an analogy, so please, just accept it as it is.)
Here is the new data file, with the double lines:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 9
0 0 # <--- Group 10
I need to convert this into an elegant way as it has been done before, but in this case we have to take into account these new groups too, and indicate them in this way, following Kent for example: roupIdx houseIdx numberOfRooms, where the houseIdx let equal to zero houseIdx = 0 and the numberOfRooms let equal to zero too numberOfRooms = 0. So, I need to get this kind of output for example:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 1
8 0 0
9 0 1
10 0 1
Can we tune the previous code in this way?
UPDATE: the new second empty line indicates a new group. If there was an additional empty new line after the empty line, as in this case
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
we just treat the new empty line (the second one in the 2 blank lines) as a new group, and indicate them as group_index 0 0. See the desired output above!
Try:
$ cat houses.awk
BEGIN{max=1;group=1}
NF==0{
empty++
if (empty==1) group++
next
}
{ max = ($1 > max) ? $1 : max
if (empty<=1){
a[group,$1]++
} else {
a[group,$1]=-1
}
empty=0
}
END{for (i=1;i<=group;i++){
for (j=0;j<=max;j++){
if (a[i,j]>=1)
print i , j , a[i,j]
if (a[i,j]==-1)
print i, j, 0
}
printf "\n"
}
}
Command:
awk -f houses.awk houses
Output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1

Resources