keep groups of lines with specific keywords (bash) - linux

I have a text file with plenty of lines in this format (the lines between every two # defined as a group):
# some str for test
hdfv 12 9 b
cgj 5 11 t
# another string to examine
kinj 58 96 f
dfg 7 26 u
fds 9 76 j
---
key.txt:
string to
---
output:
# another string to examine
kinj 58 96 f
dfg 7 26 u
fds 9 76 j
I should search some keywords(string,to) from lines which starts with # and if the keywords does not exist in key.txt (a file with two columns) then I should remove that line and the following lines(of that group).I've written this code without result!(key words are together in input file as the example )
cat input.txt | while IFS=$'#' read -r -a myarray
do
a=${myarray[1]}
b=${myarray[0]}
unset IFS
read -r a x y z <<< "$a"
key=$(echo "$x $y")
if grep "$key" key.txt > /dev/null
then
echo $key exists
else
grep -v -e "$a" -e "$b" input.txt > $$ && mv $$ input.txt
fi
done
can some one help me?

A simple way to get correct block is using awk and correct Record Selector:
awk 'FNR==NR {a[$0];next} { RS="#";for (i in a) if ($0~i) print}' key.txt input.txt
another string to examine
kinj 58 96 f
dfg 7 26 u
fds 9 76 j
This should reinsert the # that is used and remove the extra empty line. I may be simpler ways to do this, but this works.
awk 'FNR==NR {a[$0];next} { RS="#";for (i in a) if ($0~i) {sub(/^ /,RS);sub(/\n$/,x);print}}' key.txt input.txt
#another string to examine
kinj 58 96 f
dfg 7 26 u
fds 9 76 j

Related

I have two huge sequencefiles where i want to extract the same linenumbers from file1 in file2

I have my two sequencefiles and I have a list of rows/lines of interest from file1. I want to extract the lines with the same linenumber as in file1. The list is just 1 column of numbers.
I tried using awk in a loop, but all I get is an empty file as output file.
My code looks like this:
for i in <listfile>;
do awk -F lnr="$i" 'NR==lnr' <file2> > outputfile
The output file is created but is just empty.
I could not find this question being asked before, but if so sorry for wasting your time
If I understand the question - file 1 has a list of "line numbers" and you desire to print those lines in file2:
awk 'FNR==NR{line[$1]=1;next}{if(line[FNR]==1)print FNR, $0}' file1 file2
Given the input...
for i in {a..z}; do echo $i; done > /tmp/list-1
for i in {z..a}; do echo $i; done > /tmp/list-2
The current line of each file will be stored in FNR, so you can use that.
$ awk -v a=4 -v b=9 'FNR >= a && FNR <= b { print FILENAME, NR, FNR, $0 }' /tmp/list-*
Sample output:
/tmp/list-1 4 4 d
/tmp/list-1 5 5 e
/tmp/list-1 6 6 f
/tmp/list-1 7 7 g
/tmp/list-1 8 8 h
/tmp/list-1 9 9 i
/tmp/list-2 30 4 w
/tmp/list-2 31 5 v
/tmp/list-2 32 6 u
/tmp/list-2 33 7 t
/tmp/list-2 34 8 s
/tmp/list-2 35 9 r

Adding a number to column [line by line]

I have a text file named text: The row and columns are:
1 A 18 -180
2 B 19 -180
3 C 20 -150
50 D 21 -100
128 E 22 -130
10 F 23 -0
10 G 23 -0
What I want to do is to print out the 4th column with adding a constant number to each of the lines (except ==0). To do this is what I have done.
#!/bin/bash
FILE="/dir/text"
while IFS= read -r line
do
echo "$line"
done <"$FILE"
I can read the fourth column, but at the same time I want to put an argument $1 which will add a constant number to all of the lines in the fourth column except any line of the fourth column has ==0.
UPDATE:
The Desired output would be like: [the line has zeros are ignored]
-160
-160
-130
-80
-110
For example, the program name is example.sh. I want to add a number to the fourth column using an argument. Therefore it would be:
example.sh $1
where $1 could be any number I want to add in the 4th column.
You should awk here which will be faster than bash.
awk -v number="100" '$4!=0{$4+=number} 1' Input_file
number is an awk variable where you could set its value as per your need.
Explanation: Adding detailed explanation for above code.
awk -v number="100" ' ##Starting awk program from here and creating a variable number whose value is 100.
$4!=0{ ##Checking condition if 4th column is NOT zero then do following.
$4+=number ##Adding variable number to 4th column here.
}
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##mentioning Input_file name here.
In order to preserve your formatting using awk while adding the values to the 4th field, you can calculate the new value of the 4th field and then use sub to change the value without forcing awk to recalculate the fields and removing the whitespace.
For example, with your file stored as text and adding a value of 180 to the 4th field (except where 0), you could do:
awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Doing so would produce the following output:
$ awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
1 A 18 0
2 B 19 0
3 C 20 30
50 D 21 80
128 E 22 50
10 F 23 -0
10 G 23 -0
If called withing a shell script, you could pass your $1 parameter as:
awk -v n="$1" '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Though I would suggest checking that an argument has been provided to the script with:
[ -z "$1" ] && {
echo "error: value require as argument"
exit 1
}
or you can provide a default value -- up to you.
With bash:
while read -ra a; do [[ ${a[3]} != -0 ]] && ((a[3]+=42)); echo "${a[#]}"; done < file
Output:
1 A 18 -138
2 B 19 -138
3 C 20 -108
50 D 21 -58
128 E 22 -88
10 F 23 -0
10 G 23 -0

Sum all the numbers in a file given by positional parameter

I want to sum all the numbers in a file (columns and lines) given by the first parameter, but my program shows sum=sum+$i instead of the numeric sum:
sum=0;
file=$1
for i in $file
do
sum=sum+$i;
done;
echo "The sum is: " $sum
Input file:
$cat file.txt
10 20 10
40
50
Expected output :
The sum is: 21
Maybe if there is an awk method to solve this?
Try this -
$cat file1.txt
10 20 10
40
50
$awk '{for(i=1;i<=NF;i++) {sum+=$i}} END {print sum}' file1.txt
130
OR
$xargs < file1.txt| tr ' ' + | bc
130
cat file.txt | xargs | sed -e 's/\ /+/g' | bc
You can also use a simple read and an array to sum the value relying on word splitting to separate the values into an array via the default IFS (Internal Field Separator), e.g.
#!/bin/bash
declare -i sum=0
fn="${1:-/dev/stdin}" ## read from file as 1st argument (default stdin)
while read -r line; do ## read each line
a=( $line ) ## separate values into array
for i in ${a[#]}; do ## for each value in array
((sum += i)) ## add to sum
done
done <"$fn"
echo "sum: $sum"
Example Input File
$ cat dat/numfile.txt
10 20 10
40
50
Example Use/Output
$ bash sumnumfile.sh dat/numfile.txt
sum: 130
Another for some awks (at least mawk and gawk):
$ awk -v RS="[^0-9]" '{s+=$1}END{print s}' file
130

AWK--Comparing the value of two variables in two different files

I have two text files A.txt and B.txt. Each line of A.txt
A.txt
100
222
398
B.txt
1 2 103 2
4 5 1026 74
7 8 209 55
10 11 122 78
What I am looking for is something like this:
for each line of A
search B;
if (the value of third column in a line of B - the value of the variable in A > 10)
print that line of B;
Any awk for doing that??
How about something like this,
I had some troubles understanding your question, but maybe this will give you some pointers,
#!/bin/bash
# Read intresting values from file2 into an array,
for line in $(cat 2.txt | awk '{print $3}')
do
arr+=($line)
done
# Linecounter,
linenr=0
# Loop through every line in file 1,
for val in $(cat 1.txt)
do
# Increment linecounter,
((linenr++))
# Loop through every element in the array (containing values from 3 colum from file2)
for el in "${!arr[#]}";
do
# If that value - the value from file 1 is bigger than 10, print values
if [[ $((${arr[$el]} - $val )) -gt 10 ]]
then
sed -n "$(($el+1))p" 2.txt
# echo "Value ${arr[$el]} (on line $(($el+1)) from 2.txt) - $val (on line $linenr from 1.txt) equals $((${arr[$el]} - $val )) and is hence bigger than 10"
fi
done
done
Note,
This is a quick and dirty thing, there is room for improvements. But I think it'll do the job.
Use awk like this:
cat f1
1
4
9
16
cat f2
2 4 10 8
3 9 20 8
5 1 15 8
7 0 30 8
awk 'FNR==NR{a[NR]=$1;next} $3-a[FNR] < 10' f1 f2
2 4 10 8
5 1 15 8
UPDATE: Based on OP's edited question:
awk 'FNR==NR{a[NR]=$1;next} {for (i in a) if ($3-a[i] > 10) print}'
and see how simple awk based solution is as compared to nested for loops.

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

Resources