Print between two patterns with filepath/filename in a directory - linux

I need a command that prints data between two strings (Hello and End) along with the file name and file path on each line. Here is the input and output. Appreciate your time and help
Input
file1:
Hello
abc
xyz
End
file2:
Hello
123
456
End
file3:
Hello
Output:
/home/test/seq/file1 abc
/home/test/seq/file1 xyz
/home/test/seq/file2 123
/home/test/seq/file2 456
I tried awk and sed but not able to print the file with the path.
awk '/Hello/{flag=1;next}/End/{flag=0}flag' * 2>/dev/null

With awk:
awk '!/Hello/ && !/End/ {print FILENAME,$0} ' /home/test/seq/file?
Output:
/home/test/seq/file1 abc
/home/test/seq/file1 xyz
/home/test/seq/file2 123
/home/test/seq/file2 456

If your file contains lines above Hello and/or below End, then you can use a flag to control printing as you had attempted in your question, e.g.
awk -v f=0 '/End/{f=0} f == 1 {print FILENAME, $0} /Hello/{f=1}' file1 file2 file..
This would handle the case where your input file contained, e.g.
$cat file
some text
some more
Hello
abc
xyz
End
still more text
The flag f is a simple ON/OFF flag to control printing and placing the end rule first with the actual print in the middle eliminates the need for any next command.

Related

How to combine all files in a directory, adding their individual file names as a new column in final merged file

I have a directory with files that looks like this:
CCG02-215-WGS.format.flt.txt
CCG05-707-WGS.format.flt.txt
CCG06-203-WGS.format.flt.txt
CCG04-967-WGS.format.flt.txt
CCG05-710-WGS.format.flt.txt
CCG06-215-WGS.format.flt.txt
Contents of each files look like this
1 9061390 14 93246140
1 58631131 2 31823410
1 108952511 3 110694548
1 168056494 19 23850376
etc...
Ideal output would be a file, let's call it all-samples.format.flt.txt, that would contain the concatenation of all files, but an additional column that displays which sample/file the row came from ( some minor formatting involved to remove the .format.flt.txt ):
1 9061390 14 93246140 CCG02-215-WGS
...
1 58631131 2 31823410 CCG05-707-WGS
...
1 108952511 3 110694548 CCG06-203-WGS
...
1 168056494 19 23850376 CCG04-967-WGS
Currently, I have the following code which works for individual files.
awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' CCG05-707-WGS.format.flt.txt
#OUTPUT
1 58631131 2 31823410 CCG05-707-WGS
...
However, when I try to apply it to all files, using the star, it adds the first filename it finds to all the files as the 4th column.
awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' *
#OUTPUT, 4th column should be as seen in previous code block
1 9061390 14 93246140 CCG02-215-WGS
...
1 58631131 2 31823410 CCG02-215-WGS
...
1 108952511 3 110694548 CCG02-215-WGS
...
1 168056494 19 23850376 CCG02-215-WGS
I feel like the solution may just lie in adding an additional parameter to awk... but I'm not sure where to start.
Thanks!
UPDATE
Using OOTB awk var FILENAME solved the issue, plus some elegant formatting logic for the file names.
Thank #RavinderSingh13!
awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/..*/,"",file)} {print $0,file}' *.txt
With your shown samples please try following awk code. We need to use FILENAME OOTB variable here of awk. Then whenever there is first line of any txt file(all txt files passed to this program) then remove everything from . to till last of value and in main program printing current line followed by file(file's name as per requirement)
awk '
BEGIN { OFS="\t" }
FNR==1{
file=FILENAME
sub(/\..*/,"",file)
}
{
print $0,file
}
' *.txt
OR in a one-liner form try following awk code:
awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/\..*/,"",file)} {print $0,file}' *.txt
You may use:
Any version awk:
awk -v OFS='\t' 'FNR==1{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt
Or in gnu-awk:
awk -v OFS='\t' 'BEGINFILE{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt

Getting Issues In Adding A Tab Using AWK

Sample Logs
Location Number Status Comment
Delhi 919xxx Processed Test File 1
Mumbai 918xxx Got Stucked Test File 123
I'm trying to add one tab space after Status using AWK, but getting error.
Sample Query
awk '{$3 = $3 "\t"; print}' z
Getting Output As
Location Number Status Comment
Delhi 919xxx Processed Test File 1
Mumbai 918xxx **Got** **Stucked** Test File 123
As it is taking 'Got Stucked' as multiple fields please suggest.
If you only want one tab after the header text Status to make it look better, use sub to the first record only:
$ awk 'NR==1 {sub(/Status/,"Status\t")} 1' file
Location Number Status Comment
Delhi 919xxx Processed Test File 1
Mumbai 918xxx Got Stucked Test File 123
This way awk won't rebuild the record and replace FS with OFS etc.
#JamesBrown's answer sounds like what you asked for but also consider:
$ awk -F' +' -v OFS='\t' '{$1=$1}1' file | column -s$'\t' -t
Location Number Status Comment
Delhi 919xxx Processed Test File 1
Mumbai 918xxx Got Stucked Test File 123
The awk converts every sequence of 2+ spaces to a tab so the result is a tab-separated stream which column can then convert to a visually aligned table if that's your ultimate goal. Or you could generate a CSV to read into Excel or similar:
$ awk -F' +' -v OFS=',' '{$1=$1}1' file
Location,Number,Status,Comment
Delhi,919xxx,Processed,Test File 1
Mumbai,918xxx,Got Stucked,Test File 123
$ awk -F' +' -v OFS=',' '{for(i=1;i<=NF;i++) $i="\""$i"\""}1' file
"Location","Number","Status","Comment"
"Delhi","919xxx","Processed","Test File 1"
"Mumbai","918xxx","Got Stucked","Test File 123"
or more robustly:
$ awk -F' +' -v OFS=',' '{$1=$1; for(i=1;i<=NF;i++) { gsub(/"/,"\"\"",$i); if ($i~/[[:space:],"]/) $i="\""$i"\"" } }1' file
Location,Number,Status,Comment
Delhi,919xxx,Processed,"Test File 1"
Mumbai,918xxx,"Got Stucked","Test File 123"
If your input fields aren't always separated by at least 2 blank chars then tell us how they are separated.
Try using
awk -F' ' '{$3 = $3 "\t"; print}' z
The problem is that awk consider (by default) a single space as the separator between two column. This means that Got Stucked are actually two different columns.
With -F' ' you tell awk to use a double space as the separator to distinguish between two columns.

Replacing a substring in certain lines of file A by string in file B

I have 2 files at the moment, file A and file B. Certain lines in file A contain a substring of some line in file B. I would like to replace these substrings with the corresponding string in file B.
Example of file A:
#Name_1
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2
Example of file B:
#Name_1_Date_Place
#Name_2_Date_Place
The desired output I would like:
#Name_1_Date_Place
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2_Date_Place
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2
What I have so far:
I was able to get the order of the names in File B corresponding to those in File A. So I was thinking to use a while loop here which goes through every line of file B, and then finds and replaces the corresponding substring in file A with that line, however I'm not sure how to put this into a bash script.
The code I have so far is, but which is not giving a desired output:
grep '#' fileA.txt > fileAname.txt
while read line
do
replace="$(grep '$line' fileB.txt)"
sed -i 's/'"$line"'/'"$replace"'/g' fileA.txt
done < fileAname.txt
Anybody has an idea?
Thanks in advance!
You can do it with this script:
awk 'NR==FNR {str[i]=$0;i++;next;} $1~"^#" {for (i in str) {if(match(str[i],$1)){print str[i];next;}}} {print $0}' B.txt A.txt
This awk should work for you:
awk 'FNR==NR{a[$0]; next} {for (i in a) if (index(i, $0)) $0=i} 1' fileB fileA
#Name_1_Date_Place
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2_Date_Place
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2

Get specified content between two lines

I have a file t.txt of the form:
abc 0
file content1
file content2
abc 1
file content3
file content4
abc 2
file content5
file content6
Now I want to retain all the content between abc 1 and abc 2 i.e. I want to retain:
file content3
file content4
For this I am using sed:
sed -e "/abc\s4/, /abc\s5/ p" t.txt > j.txt
But when I do so I get j.txt to be a copy of t.txt. I dont know where I am making the mistake..can someone please help
You can use this sed:
$ sed -n '/abc 1/,/abc 2/{/abc 1/d; /abc 2/d; p}' file
file content3
file content4
Explanation
/abc 1/,/abc 2/ selectr range of lines from the one containing abc 1 to the one containing abc 2. It could also be /^abc 1$/ to match the full string.
p prints the lines. So for example sed -n '/file/p' file will print all the lines containing the string file.
d deletes the lines.
'/abc 1/,/abc 2/p' alone would print the abc 1 and abc 2 lines:
$ sed -n '/abc 1/,/abc 2/p' file
abc 1
file content3
file content4
abc 2
So you have to explicitly delete them with {/abc 1/d; /abc 2/d;} and then print the rest with p.
With awk:
$ awk '$0=="abc 2" {f=0} f; $0=="abc 1" {f=1}' file
file content3
file content4
It triggers the flag f whenever abc 1 is found and untriggers it when abc 2 is found. f alone will be true and hence print the lines.

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You
quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)
As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.
Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt
This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

Resources