Combine matching lines using sed or awk?

Combine matching lines using sed or awk? - linux

I have a file like the following:
1,
cake:01351
12,
bun:1063
scone:13581
biscuit:1931
14,
jelly:1385
I need to convert it so that when a number is read at the start of a line it is combined with the line beneath it, but if there is no number at the start the line is left as is. This would be the output that I need:
1,cake:01351
12,bun:1063
scone:13581
biscuit:1931
14,jelly:1385
Having a lot of trouble achieving this with sed, it seems it may not be the best way for what I think should be quite simple.
Any suggestions greatly appreciated.

A very basic sed implementation:
sed -e '/^[0-9]/{N;s/\n//;}'
This relies on the first character on only the 'number' lines being a number (as you specified).
It
matches lines starting with a number, ^[0-9]
brings in the next line, N
deletes the embedded newline, s/\n//

This is a file on my intranet. I can't recall where I found the handy sed one-liner. You might find something if you search for 'sed one-liner'
Have you ever needed to combine lines of text, but it's too tedious to do it by hand.
For example, imagine that we have a text file with hundreds of lines which look like this:
14/04/2003,10:27:47,0
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
14/04/2003,10:30:51,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
14/04/2003,10:34:02,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
Each date (14/04/2003) is the start of a data record, and it continues on the next four lines.
We would like to input this to Excel as a 'comma separated value' file, and see each record in its own row.
In our example, we need to append any line starting with a G or I to the preceding line, and insert a comma, so as to produce the following:
14/04/2003,10:27:47,0,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.005,IdVg,3.000,...
14/04/2003,10:30:51,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
14/04/2003,10:34:02,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
This is a classic application of a 'regular expression' and, once again, sed comes to the rescue.
The editing can be done with a single sed command:
sed -e :a -e '$!N;s/\n\([GI]\)/,\1/;ta' -e 'P;D' filename >newfilename
I didn't say it would be obvious, or easy, did I?
This is the kind of command you write down somewhere for the rare occasions when you need it.

Try a regular expression, such as:
sed '/[0-9]\+,/{N}s/\n//)'
That checks the first line for a number (0-9) and a comma, then replaces the new line with nothing, removing it.

Another awk solution, less cryptic than some other answers:
awk '/^[0-9]/ {n = $0; getline; print n $0; next} 1'

$ awk 'ORS= /^[0-9]+,$/?" ":"\n"' file
1, cake:01351
12, bun:1063
scone:13581
biscuit:1931
14, jelly:1385

Related

Simple way to remove multi-line string using sed

Using sed, is there a way to remove multiple lines from a text file based on some starting and ending expressions?
I have known markers in the file and want to remove everything between (markers inclusive). I have seen some really complicated solutions and I would like to do this without resorting to micro commands.
My file looks something like this:
cat /tmp/foobar.txt
this is line 1
this is line 3
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
asm-*.jar,\
aspectj*.jar,\
bootstrap.jar,\
catalina-ant.jar,\
catalina-ha.jar,\
catalina-ssi.jar,\
catalina-storeconfig.jar
the end leave me
and me
I want to remove everything starting at tomcat.util all the away to the last .jar

tldr;
I think this is the simplest way, ad no need for the assembly like micro commands
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
which produces
this is line 1
this is line 3
the end leave me
and me
if you wanted to remove the lines in the file rather than spit out the output to stdout then use the inline flag, so
sed -i '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
So... how does this work?
sed commands, like vi commands operate on an address. Normally we don't specify an address and that simply applies the command to all lines of the file, eg when replacing the for that in a file we'd normally do
sed -i 's/the/that/g' /tmp/foobar.txt
ie applying the substitute or s command to all lines in the file.
In this case you want to delete some lines so we can use the delete or d command. But we need to tell it where to delete. So we need to give it an address.
The format of a sed command is
[addr][!]command[options]
(see the docs )
If no address is specified then the command is applied to all lines, if the ! is specified then it is applied to all lines that don't match the pattern. So far so good.
The trick here is that addr can be a single address or a range of addresses. The address can be a line number or a regex pattern. You use a , between two addresses to to specify a range.
so to delete line 5 to 8 inclusive you could do
sed -i '5,8d' /tmp/foobar.txt
in this case rather than knowing the line number we know some "markers" and we can use Regex instead, so the first marker, a line starting with tomcat.util is found by the regex
/^tomcat\.util.*$/
The second marker is a bit more tricky but if we look we can see that the final line to remove is the first one that does not end with a \, so we can match a line that consists of "anything but does not end with \"
/^.*[^\]$/
While the second marker could match a whole bunch of lines if we make a range out of these two regexes, the range means that the second "address" is the first line after the first address that matches the regex.
Putting that all together, we want to delete (d) all lines in the range from the address that is found by the regex matching a line starting with tomcat.util and ending with a line that does not end in \ ie
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
hope that helps ;-)
Cheers
Karl

Awk is generally more useful than sed for anything spanning lines. Using any awk in any shell on every Unix box:
$ awk '!/\.jar/{f=0} /tomcat\.util/{f=1} !f' file
this is line 1
this is line 3
the end leave me
and me

This might work for you (GNU sed):
sed -n '/tomcat\.util/{:a;n;/\.jar/ba};p' file
Turn off implicit printing using the -n option.
Match on a line containing tomcat.util.
Continue fetching lines until such a line does not match one containing .jar.
Print all other lines.
Alternative:
sed -E '/tomcat\.util/{:a;$!N;/\.jar(,\\)?$/s/\n//;ta;D}' file
Gather up lines beginning tomcat.util and ending either .jar,\ or .jar, removing newlines until the end-of-file or a mis-match and then delete the collection.

Linux remove whitespace first line

i have the file virt.txt contains:
0302 000000 23071SOCIETY 117
0602 000000000000000001 PAYMENT BANK
I want to remove 3 whitespaces from 6th to 8th column to the first line only.
I do:
sed '1s/[[:blank:]]+[[:blank:]]+[[:blank:]]//6' virt.txt
it'KO
please help

Your regex would consume all the available blanks from a sequence of three or more (in a quite inefficient way) and replace the sixth occurrence of that. Because your first input line does not contain six or more separate stretches of three or more whitespace characters, it actually did nothing. But you can in fact use sed to do exactly what you say you want:
sed '1s/^\(.....\) /\1/' virt.txt
(or for convenience, if you have sed -E or the variant sed -r which works on some platforms, but neither of these is standard):
sed -E '1s/^(.{5}) {3}/\1/' virt.txt # -E is not portable
The parentheses capture the first five characters into a back reference, and we then use the first back reference \1 as the replacement string, effectively replacing only the text which matched outside the parentheses.
If your sed supports the -i option, you can use that to modify the file directly; but this is also not standard, so the most portable solution is to write the result to a new file, then move it back on top of the original file if you want to replace it.
sed is convenient if you are familiar with it, but as you are clearly not, perhaps a better approach would be to use a different language, ideally one which is not write-only for many users, like sed.
If you know the three characters will always be spaces, just do a static replacement.
awk 'NR==1 { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
On the first line (NR is the current input line number) replace the input line $0 with a catenation of the substrings on both sides of the part you want to cut.
For a simple replacement like that, you can also use basic Unix text manipulation utilities, though it's rather inefficient and inelegant:
head -n 1 virt.txt | cut -c1-5,9- >newfile.txt
tail -n +2 virt.txt >>newfile.txt
If you need to check that the three characters are spaces, the Awk script only needs a minor tweak.
awk 'NR==1 && /^.{5} {3}/ { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
You should vaguely recognize the regex from above. Awk is less succinct, but as a consequence also quite a lot more readable, than sed.

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:

If a non-awk solution is OK:
grep -c '^.$'

You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.

In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

Using the awk command in linux terminal to ignore repeats?

So I am working on a program that needs to scan a file with a format that causes trouble when I use the awk function.
Basically, the trouble I am having is that there are repeats that I don't want to have. For example, file looks like this:
abcd
abcde
I do a line by line search for the string "abcd", and I only want it to give me the first line, not both. Is there anything I can add to the awk function so that it searches for just the thing I'm looking for and nothing more?
I apologize if this question is dumb, but I really couldn't figure out a way to search for the problem I'm having online, and from what I've read about awk, I couldn't find a way to fix my problem.
I also cannot edit the file at all, unfortunately.
Thanks for the help!

Just an another way,
awk '/(^| )abcd( |$)/' file
It prints the line which contains the string abcd preceded by a space or a starting pattern and followed by a space or an end pattern.
Explanation:
(^| ) Matches the start of a line OR a space.
abcd Matches the Literal abcd.
( |$) Matches a space or a line end.
| called a logical OR operator.
If a line matches the above mentioned pattern then it will be printed. You don't need to specify '{print}'. AWK would do it automatically.

You can do:
awk '/^abcd$/' file
or you can do
awk '$0=="abcd"' file
or if its the second field that needs to be only abcd not abcde
awk '$2=="abcd"' file
All this will only match abcd and not abcde

How to delete the line that matches a pattern and the line after it with sed?

I have a file that looks something like:
good text
good text
FLAG bad text
bad text
good text
good text
good test
bad Text FLAG bad text
bad text
good text
I need to delete any line containing "FLAG" and I always need to delete the one line immediately following the "FLAG" line too.
"FLAG" lines come irregularly enough that I can't rely on any sort of line number strategy.
Anyone know how to do this with sed?

Using an extension of the GNU version of sed:
sed -e '/FLAG/,+1 d' infile
It yields:
good text
good text
good text
good text
good test
good text

This works, and doesn't depend on any extensions:
sed '/FLAG/{N
d
}' infile
N reads the next line into the pattern space, then d deletes the pattern space.

Here is one way with awk:
awk '/FLAG/{f=1;next}f{f=0;next}1' file
or
awk '/FLAG/{getline;next}1' file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combine matching lines using sed or awk? - linux

A very basic sed implementation: sed -e '/^[0-9]/{N;s/\n//;}' This relies on the first character on only the 'number' lines being a number (as you specified). It matches lines starting with a number, ^[0-9] brings in the next line, N deletes the embedded newline, s/\n//

Try a regular expression, such as: sed '/[0-9]\+,/{N}s/\n//)' That checks the first line for a number (0-9) and a comma, then replaces the new line with nothing, removing it.

Another awk solution, less cryptic than some other answers: awk '/^[0-9]/ {n = $0; getline; print n $0; next} 1'

$ awk 'ORS= /^[0-9]+,$/?" ":"\n"' file 1, cake:01351 12, bun:1063 scone:13581 biscuit:1931 14, jelly:1385

Related

Simple way to remove multi-line string using sed

Linux remove whitespace first line

How to count number of lines with only 1 character?

Using the awk command in linux terminal to ignore repeats?

How to delete the line that matches a pattern and the line after it with sed?

Categories

Resources