split the file based on header and footer lines - linux

I have a text file structured like this:
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
[timestamp6] footer with space
[timestamp7] junk
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
[timestamp12] footer with space
[timestamp13] junk
[timestamp14] header with space
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
[timestamp19] footer with space
I need to find each part between header and footer and save it in another file. For example the file1 should contain (with or without timestamps; doesn't matter):
data1
data2
data3
..
and the next pack should be saved as file2 and so on.
This seems like a routine process, but I haven't find a solution yet.
I have this sed command that finds the first packet.
sed -n "/header/,/footer/{p;/footer/q}" file
But I don't know how to iterate that over the next matches. Maybe I should delete the first match after copying it to another file and repeat the same command

I would harness GNU AWK for this task following way, let file.txt content be
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
[timestamp6] footer with space
[timestamp7] junk
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
[timestamp12] footer with space
[timestamp13] junk
[timestamp14] header with space
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
[timestamp19] footer with space
then
awk '/header/{c+=1;p=1;next}/footer/{close("file" c);p=0}p{print $0 > ("file" c)}' file.txt
produces file1 with content
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
and file2 with content
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
and file3 with content
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
Explanation: my code has 3 pattern-action pairs, for line containing header I increase counter c by 1 and set flag p to 1 and go to next line so no other action is undertaken, for line cotaining footer I close file named file followed by current counter number and set flag p to 0. For lines where p is set to true I print current line ($0) to file named file followed by current counter number. If required adjust /header/ and /footer/ to contant solely on lines which are header and footer lines.
(tested in GNU Awk 5.0.1)

Using any awk:
$ awk '/footer/{f=0} f{print > out} /header/{close(out); out="file" (++c); f=1}' file
$ head file?*
==> file1 <==
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
==> file2 <==
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
==> file3 <==
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..

A very naive approach, coded fast, could be improved, but seems to work, in awk:
BEGIN {
i = 0
}
{
if ($0 == "header") {
write = 1
} else if ($0 == "footer") {
write = 0
i = i + 1
} else {
if (write == 1) {
print $0 > "file"i
}
}
}

Based on THIS REGEX, here is a ruby:
ruby -e 'cnt=1
$<.read.scan(/^.*\bheader\b.*\s+([\s\S]*?)(?=^.*\bfooter\b)/){
|b| File.write("File_#{cnt}.txt", b[0])
cnt+=1
}' file
Produces:
$ head File_*
==> File_1.txt <==
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
==> File_2.txt <==
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
==> File_3.txt <==
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
If you want to remove the timestamps:
ruby -e 'cnt=1
$<.read.scan(/^.*\bheader\b.*\s+([\s\S]*?)(?=^.*\bfooter\b)/){ |b|
File.write("File_#{cnt}.txt", b[0].gsub(/^\[[^\]]+\]\s+/,""))
cnt+=1
}' file
$ head File_*
==> File_1.txt <==
data1
data2
data3
..
==> File_2.txt <==
data4
data5
...
==> File_3.txt <==
data6
data7
data8
..
Note: If you want to include the header and/or footer, just move the capture group to include what you want.

This might work for you (GNU csplit and sed):
csplit -qf file -b '%d' --supp file '/header/' '{*}' && sed -i '/footer/,$d' file? && rm file0
Use csplit to split file into multiple filen files on header suppressing the matching line.
Use sed to delete footer and any following lines`.
Remove the unwanted file0 file.
Alternative:
sed -En '/header/{x;s/.*/echo $((0&+1))/e;x};/header/,/footer/!b;//b;G;s/(.*)\n/echo "\1" >>file/e' file

Related

replace pattern in file 2 with pattern in file 1 if contingency is met

I have two tab delimted data files the file1 looks like:
cluster_j_72 cluster-32 cluster-32 cluster_j_72
cluster_j_75 cluster-33 cluster-33 cluster_j_73
cluster_j_8 cluster-68 cluster-68 cluster_j_8
the file2 looks like:
NODE_148 67545 97045 cluster-32
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster-68
I would like to confirm that, for a given row, in file1 columns 2 and 3; as well as 1 and 4 are identical. If this is the case then I would like to take the value for that row from column 2 (file 1) find it in file2 and replace it with the value from column 1 (file 1). Thus the new output of file 2 would look like this (note because column 1 and 4 dont match for cluster 33 (file1) the pattern is not replaced in file2):
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8
I have been able to get the contingency correct (here printing the value from file1 i'd like to use to replace a value in file2):
awk '{if($2==$3 && $1==$4){print $1}}'file1
If I could get sed to draw values ($2 and $1) from file1 while looking in file 2 this would work:
sed 's/$2(from file1)/$1(from file1)/' file2
But I don't seem to be able to nest this sed in the previous awk statement, nor get sed to look for a pattern originating in a different file than it's looking in.
thanks!
You never need sed when you're using awk since awk can do anything that sed can do.
This might be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
if ( ($1 == $4) && ($2 == $3) ) {
map[$2] = $1
}
next
}
$4 in map { $4 = map[$4] }
{ print }
$ awk -f tst.awk file1 file2
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8

Split file using awk at pattern

Here is an example of the data that I have in a row in example.tsv:
somedata1:data1#||#somedata2:data2#||#somedata1:data3#||#somedata2:data4
I wanted to do two things:
Split the data from the pattern '#||#' and write it to other file. The number of columns after splitting is not fixed.
I have tried the awk command:
awk -F"#\|\|#" '{print;}' example.tsv > splitted.tsv
Output of the first file should be:
column 1
somedata1:data1
somedata2:data2
somedata1:data3
somedata2:data4
Next I want split the data in splitted.tsv based on the ':'.
somedata1
data1
data3
And write it to a file.
Is there a way we could do this in a single awk command?
You need to escape the | correctly. Then use split
awk -F'#\\|\\|#' '{split($2,a,":");print a[2]}' file
data2
To print all data out in a table:
awk -F'#\\|\\|#' '{for (i=1;i<=NF;i++) print $i}' file
somedata:data1
somedata:data2
somedata:data3
somedata:data1
To split the data even more:
awk -F'#\\|\\|#' '{for (i=1;i<=NF;i++) {split($i,a,":");print a[1],a[2]}}' file
somedata data1
somedata data2
somedata data3
somedata data1
For the first split, you could try
$ awk 'BEGIN{print "column1"}{gsub(/#\|\|#/,"\n"); print }' file
column1
somedata:data1
somedata:data2
somedata:data3
somedata:data1
To then split on :, you could do:
$ awk -F: 'BEGIN{print "column1","column2"}
{gsub(/#\|\|#/,"\n"); gsub(/:/," ");print }' file
column1 column2
somedata data1
somedata data2
somedata data3
somedata data1

to copy specific line numbers from one file to specific line numbers of another file using awk

I have a .txt file with 71 lines and I have another 12 set of files(file1 to file12). I want to copy first 5 lines from .txt file to file1 on specific line numbers similarly next 5 lines from .txt to file2 again on specific line numbers and so on.
This is my current code:
n = 1
sed -i '52,56d' $dumpfile
awk'{print $'"$n"',$'"$n+1"',$'"$n+2"',$'"$n+3"'}' sample.txt > $dumpfile
n=$(($n + 1))
In $dumpfile I have put my 12 files.
Sample file (12files; file1, file2...)
...........
................
..............
abc = 4,1,3
def = 1,2,6
dfg = 28,36,4
tyu = 68,47,6
rty = 65,6,97
file (sample.txt)
abc = 1,2,3
def = 4,5,6
dfg = 2,3,4
tyu = 8,7,6
rty = 5,6,7
abc = 21,2,32
def = 64,53,6
dfg = 28,3,4
tyu = 18,75,6
rty = 5,63,75
...........
...........
I want to replace these five lines of (file1... file12) with five lines of sample.txt file. Line number of lines to be replaced in file1 to file12 are same in all the 12 files, where as in sample.txt file first set of 5 lines will go in file1, second set of 5 lines will go in file2 and so on upto file12.
What you need is something like, this (uses GNU awk for ARGIND and inplace editing):
awk -i inplace -v start=52 '
NR==FNR {new[NR]=$0; next}
FNR==start {print new[ARGIND-1]; c=5}
!(c&&c--)
' RS="" sample.txt RS='\n' file1 file2 ... file12
but until you post some testable sample input and the associated output it's just a guess and, obviously, untested.

Get specified content between two lines

I have a file t.txt of the form:
abc 0
file content1
file content2
abc 1
file content3
file content4
abc 2
file content5
file content6
Now I want to retain all the content between abc 1 and abc 2 i.e. I want to retain:
file content3
file content4
For this I am using sed:
sed -e "/abc\s4/, /abc\s5/ p" t.txt > j.txt
But when I do so I get j.txt to be a copy of t.txt. I dont know where I am making the mistake..can someone please help
You can use this sed:
$ sed -n '/abc 1/,/abc 2/{/abc 1/d; /abc 2/d; p}' file
file content3
file content4
Explanation
/abc 1/,/abc 2/ selectr range of lines from the one containing abc 1 to the one containing abc 2. It could also be /^abc 1$/ to match the full string.
p prints the lines. So for example sed -n '/file/p' file will print all the lines containing the string file.
d deletes the lines.
'/abc 1/,/abc 2/p' alone would print the abc 1 and abc 2 lines:
$ sed -n '/abc 1/,/abc 2/p' file
abc 1
file content3
file content4
abc 2
So you have to explicitly delete them with {/abc 1/d; /abc 2/d;} and then print the rest with p.
With awk:
$ awk '$0=="abc 2" {f=0} f; $0=="abc 1" {f=1}' file
file content3
file content4
It triggers the flag f whenever abc 1 is found and untriggers it when abc 2 is found. f alone will be true and hence print the lines.

Batch script for editing a text file

We have a txt file that is getting populated by a database dump, but there are CR and LF breaks, that we don't want. Basically, I am trying to edit C:\app.txt, remove all CRs and LFs, and then add !## in front of "_TEXT_", and add a CR in front of "!##_TEXT_". This way, I only have CR in the places I want them, not all over the place.
I have tried using change.exe by Bruce Gunthrie, which worked well in a 32 bit environment, but doesn't work on a 64bit PC.
Any help would be greatly appreciated.
I saw some similar posts here, but have trouble reading the codes because they are too complex, so I didn't know how to adapt them for our environment.
Thanks
Luke
eg.
_TEXT_ data data1
data2 data3 data4
_TEXT_ data data1
data2 data3 data4
Should read:
_TEXT_ data data1 data2 data3 data4
_TEXT_ data data1 data2 data3 data4
Is PowerShell an option? (I would guess so, a 64-bit computer with a current Windows version has PowerShell installed.)
$(
$line = ''
switch -wildcard -File C:\app.txt {
'_TEXT_*' {
if ($line) { $line }
$line = "!##$_"
}
default {
$line += ' ' + $_
}
}
if ($line) { $line }
) -join "`r"
This will join the lines with CR, as you wished. Pipe the result to Set-Content to write it to a file.

Resources