how to select only those lines which have same string in all columns with first line of different header in linux - linux

I have a text file with 200 columns like:
sample1 0 12 11 23 12
sample2 3 16 89 12 0
sample3 0 0 0 0 0
sample4 33 22 0 0 0
sample5 0 0 0 0 0
And I want only those lines which have only 0 from column 2 to 6. desired out put is:
sample3 0 0 0 0 0
sample5 0 0 0 0 0

Like this, for example:
$ awk '!$2 && !$3 && !$4 && !$5 && !$6' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Which is the same as:
$ awk '!($2 || $3 || $4 || $5 || $6)' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
As per your comment
that is for example but i want to do that from column 2 to 200th
This can be a way:
$ awk '{for (i=2;i<=200;i++) if ($i) {next}}1' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Note that $i refers to the field in the position i. $i is true when it has got a "true" value. Hence, $i will be false when it is 0.
Based on that approach, we loop through all values. In case one value is True, meaning not 0, then we do next, which means that the line is not analyzed any more. For the rest of the cases (2nd to 200th column being 0 or empty), the next is not accomplished so it interprets the 1, which makes {print $0} to be executed.

Related

How to filter a matrix based on another column

I want to filter a matrix file using a column from another file.
I have 2 tab-separated files. One includes a matrix. I want to filter my matrix file based on the first column of FileB. If the headers(column names) of this matrix file (FileA) are present in the first column of File B, I want to filter them to use in a new file. All solutions I could try were based on filtering rows, not fields. Any help is appreciated. Thanks!
FileA
A B C D E F G H I J K L M N
R1 0 0 0 0 0 0 0 0 0 1 0 0 1 1
R2 1 1 0 1 0 0 0 0 1 0 1 0 0 0
R3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
R4 1 1 0 1 0 0 0 1 0 1 0 1 0 0
R5 0 0 0 0 1 0 1 0 1 0 1 0 1 0
FileB
A Green
B Purple
K Blue
L Blue
Z Green
M Purple
N Red
O Red
U Red
My expected output is:
ExpectedOutput
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
Oh, what the heck, I'm not sure having you post an R script is really going to make any difference other than satisfying my need to be pedantic so here y'go:
$ cat tst.awk
NR == FNR {
outFldNames2Nrs[$1] = ++numOutFlds
next
}
FNR == 1 {
$0 = "__" FS $0
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
outFldNr = outFldNames2Nrs[$inFldNr]
out2inFldNrs[outFldNr] = inFldNr
}
}
{
printf "%s", $1
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2inFldNrs[outFldNr]
if (inFldNr) {
printf "%s%s", OFS, $inFldNr
}
}
print ""
}
$ awk -f tst.awk fileB fileA
__ A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
I'm using the term "field name" to apply to the letter at the top of each column ("field" in awk). Try to figure the rest out for yourself from looking at the man pages and adding "prints" if/when useful and then feel free to ask questions if you have any.
I added __ at the front of your header line so you'd have the same number of columns in every line of output - that makes it easier to pass along to other tools to manipulate further but it's easy to tweak the code to not do that if you don't like it.
As #EdMorton mentions, bash may not be a suitable tool to manipulate
complex data structure as a table from maintainability and robustness
point of view.
Here is a bash script example just for information:
#!/bin/bash
declare -A seen
declare -a ary include
while read -r alpha color; do
seen["$alpha"]=1
done < FileB
while read -r -a ary; do
if (( $((nr++)) == 0 )); then # handle header line
echo -n " "
for (( i=0; i<${#ary[#]}; i++ )); do
alpha="${ary[$i]}"
if [[ ${seen["$alpha"]} = 1 ]]; then
echo -n " $alpha"
include[$((i+1))]=1
fi
done
else
echo -n "${ary[0]}"
for (( i=1; i<${#ary[#]}; i++ )); do
if [[ ${include[$i]} = 1 ]]; then
echo -n " ${ary[$i]}"
fi
done
fi
echo
done < FileA
If python is your option, you can say instead something like:
import pandas as pd
dfb = pd.read_csv("./FileB", sep="\s+", header=None)
vb = [x[0] for x in dfb.values.tolist()]
dfa = pd.read_csv("./FileA", sep="\s+")
va = dfa.columns.tolist()
print(dfa[sorted(set(va) & set(vb))])
Output:
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0

BASH Script - Check if consecutive numbers in a string are above a value

I am echoing some data from an Oracle DB cluster, via a bash script. Currently, my output into a variable in the script from SQLPlus is:
11/12 0 0 0 0 0 0 1 0 1 0 5 4 1 0 0 0 0 0 0 0 0 0 0 0
What I'd like to be able to do is evaluate that string of numbers, excluding the first one (the date), to see if any consecutive 6 of the numbers are above a certain value, lets say 10.
I only want the logic to return true if all 6 consecutive values were above "10".
So for example, if the output was:
11/12 0 0 8 10 5 1 1 0 8 10 25 40 6 2 0 0 0 0 0 0 0 0 0 0
The logic should return false/null/zero, anything I can handle negatively.
But if the string looked like this:
11/12 0 0 0 0 5 9 1 0 1 10 28 10 12 19 15 11 6 7 0 0 0 0
Then it would return true/1 etc..
Is there any bash component that I can make use of to do this? I've been stuck on this part for a while now.
For variety, here is a solution not depending on awk:
#!/usr/bin/env bash
contains() {
local nums=$* count=0 threshold=10 limit=6 i
for i in ${nums#* }; do
if (( i >= threshold )); then
(( ++count >= limit )) && return 0
else
count=0
fi
done
return 1
}
output="11/12 0 0 0 0 5 9 1 0 1 10 28 10 12 19 15 11 6 7 0 0 0 0"
if contains "$output"; then
echo "Yaaay!"
else
echo "Noooo!"
fi
Say your string is in $S, then
echo $S | awk '
{ L=0; threshold = 10; reqLength = 6;
for (i = 2; i <= NF; ++i) {
if ($i >= threshold) {
L += 1
if (L >= reqLength) {
exit(1);
}
} else {
L = 0
}
}
}'
would do it. ($? will be 1 if you have enough numbers exceeding your threshold)

How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl?

Thousands of files ends with *.tab. First column in each file is a header. Every file have their own headers (so they are different). I don't mind to have one header from any file.
Number of rows are equal in all the files and so have an order. My desired output have the same order.
Example files in a directory
test_1.tab
test_2.tab
.
.
.
.
test_1990.tab
test_2000.tab
test_1.tab
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 .....0
test_2000.tab
Pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0
Pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1
Pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1
.
.
.
Pro_2000 1 0 0 0 0 1 1 1 1 1 0 .....0
desired output
Pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0
Pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1
Pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1
.
.
.
Pro_200 0 0 0 0 1 1 1 1 1 1 0 0 ..... 1 0 0 0 0 1 1 1 1 1 0 0
My code
for i in *.tab/; do paste allCol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allCol.tab ; done
paste <(cut -f1 test1.tab) allCol.tab > final.tab
rm allCol.tab
It takes a quite time like 3 hrs. Which is a better way?
Also, is there any other command to cross check this output file vs all input files? like diff or wc?
Try this.
#!/bin/bash
TMP=tmp
mkdir "$TMP"
RESULT=result
#read each file and append the contents of each line in them
#to a new file for each line in the tmp directory
for f in *.tab; do
i=1
while read -r l; do
echo "$l" >> "$TMP"/"$i"
((i++))
done < <(cut -f2- "$f")
done
#integrate each file in tmp dir into a single line of the $RESULT file
exec 1>>$RESULT
for f in "$TMP"/*; do
while read -r l; do
printf '%s\t' "$l"
done < <(cat "$f")
echo
done
rm -r "$TMP"
This algorithm can be split on a number of processors and the task would get done faster.
You can also add to it things like checking if $TMP was created successfully.
A recursive function is a good tool. As a first cut -- short, but simple:
pasteAll() {
first=$1; shift
case $# in
0) cut -f 2- "$first" ;;
*) paste <(cut -f 2- "$first") <(pasteAll "$#") ;;
esac
}
set -- *.tab
paste <(cut -f 1 "$1") <(pasteAll "$#")
Checking that all files and lines were included -- if every input file contains an identical number of lines -- is as simple as checking the output file's line count and the number of columns in its last line.

Searching file for a string in the first field depending on the input from another file and piping the result to new file

I have an input file like below
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
76742 1 79515 79206 79207 79516 0 0 0 0
76743 1 79516 79207 79208 79517 0 0 0 0
76744 1 79517 79208 79209 79518 0 0 0 0
76745 1 79518 79209 79210 79519 0 0 0 0
76746 1 79519 79210 79211 79520 0 0 0 0
In another file File 2 I have only numbers like
76737
76738
76739
76740
76741
I have to compare these each numbers from File2.txt with the numbers in the first line of the File1.txt and if it matches, the complete line from File1.txt would be output to model.txt
The output would be
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
can anybody suggest me with AWK, SED etc?
This can be very easily done using awk
awk 'FNR==NR{ value[$1]; next} $1 in value || FNR < 5'
Test
$ awk 'FNR==NR{ value[$1]; next} $1 in value || FNR < 5' file2 file1
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
If you are not interested in the leading headers in the output, the script can be further simplified as
awk 'FNR==NR{ value[$1]; next} $1 in value' file2 file1
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
What it does?
FNR==NR Checks if the number of records read from the current file is equal to total number of records read. Basically this evaluates true only for the first file, that is here for file2
value[$1]; next Creates an associative array indexed by $1, the value from the file2
$1 in value checks if the column 1 is present in the associative array
EDIT
Print only the first occurence.
You can use delete to remove the entry from the associative array once the line has been printed. This ensures that the line is not printed for the second occurence.
awk 'FNR==NR{ value[$1]; next} $1 in value{ print; delete value[$1] }'

Matlab string operation

I have converted a string to binary as follows
message='hello my name is kamran';
messagebin=dec2bin(message);
Is there any method for storing it in array?
I am not really sure of what you want to do here, but if you need to concatenate the rows of the binary representation (which is a matrix of numchars times bits_per_char), this is the code:
message = 'hello my name is kamran';
messagebin = dec2bin(double(message));
linearmessagebin = reshape(messagebin',1,numel(messagebin));
Please note that the double conversion returns your ASCII code. I do not have access to a Matlab installation here, but for example octave complains about the code you provided in the original question.
NOTE
As it was kindly pointed out to me, you have to transpose the messagebin before "serializing" it, in order to have the correct result.
If you want the result as numeric matrix, try:
>> str = 'hello world';
>> b = dec2bin(double(str),8) - '0'
b =
0 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 1
0 1 1 0 1 1 1 1
0 1 1 1 0 0 1 0
0 1 1 0 1 1 0 0
0 1 1 0 0 1 0 0
Each row corresponds to a character. You can easily reshape it into to sequence of 0,1

Resources