I want to remove multiple line of text on linux - linux

Just like this.
Before:
1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
After:
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
I want remove the section having no alphabet like section 2.
I think that I should use perl or sed. But I don't know how to do.
I tried like this. But it didn't work.
sed 's/[0-9]\n[0-9]\n%s\n//'

sed is for doing s/old/new/ on individual lines, that is all. For anything else you should be using awk:
$ awk -v RS= -v ORS='\n\n' '/[[:alpha:]]/' file
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The above is simply this:
RS= tells awk the input records are separated by blank lines.
ORS='\n\n' tells awk the output records must also be separated by blank lines.
/[[:alpha:]]/ searches for and prints records that contain alphabetic characters.

Simple enough in Perl. The secret is to put Perl in "paragraph mode" by setting the input record separator ($/) to an empty string. Then we only print records if they contain a letter.
#!/usr/bin/perl
use strict;
use warnings;
# Paragraph mode
local $/ = '';
# Read from STDIN a record (i.e. paragraph) at a time
while (<>) {
# Only print records that include a letter
print if /[a-z]/i;
}
This is written as a Unix filter, i.e. it reads from STDIN and writes to STDOUT. So if it's in a file called filter, you can call it like this:
$ filter < your_input_file > your_output_file
Alternatively this is a simple command line script in Perl (-00 is the command line option to put Perl into paragraph mode):
$ perl -00 -ne'print if /[a-z]/' < your_input_file > your_output_file

If there's exactly one blank line after each paragraph you can use a long awk oneliner (three patterns, so probably not a oneliner actually):
$ echo '1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
' | awk '/[^[:space:]]/ { accum = accum $0 "\n" } /^[[:space:]]*$/ { if(on) print accum $0; on = 0; accum = "" } /[[:alpha:]]/ { on = 1 }'
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The idea is to accumulate non-blank lines, setting flag once an alphabetical character found, and on a blank input line, flush the whole accumulated paragraph if that flag is set, reset accum to empty string and reset flag to zero.
(Note that if the last line of input is not necessarily empty you might need to add an END block that checks if currently there's a paragraph unflushed and flush it as needed.)

This might work for you (GNU sed):
sed ':a;$!{N;/^$/M!ba};/[[:alpha:]]/!d' file
Gather up lines delimited by an empty line or end-of-file and delete the latest collection if it does not contain an alpha character.
This presupposes that the file format is fixed as in the example. To be more accurate use:
sed -r ':a;$!{N;/^$/M!ba};/^[1-9][0-9]*\n[0-9]{2}:[0-9]{2}\n[[:alpha:]]+\n?$/!d' file

Similar to the solution of Ed Morton but with the following assumptions:
The text blocks consist of 2 or 3 lines.
If there is a third line, it contains characters from any alphabet.
In essence, under these conditions we only need to check for a third field:
awk 'BEGIN{RS=;ORS="\n\n";FS="\n"}(NF<3)' file
or similar without BEGIN:
awk -v RS= -v ORS='\n\n' -F '\n' '(NF<3)' file

Related

Linux split a file in two columns

I have the following file that contains 2 columns :
A:B:IP:80 apples
C:D:IP2:82 oranges
E:F:IP3:84 grapes
How is possible to split the file in 2 other files, each column in a file like this:
File1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
File2
apples
oranges
grapes
Try:
awk '{print $1>"file1"; print $2>"file2"}' file
After runningl that command, we can verify that the desired files have been created:
$ cat file1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
And:
$ cat file2
apples
oranges
grapes
How it works
print $1>"file1"
This tells awk to write the first column to file1.
print $2>"file2"
This tells awk to write the second column to file2.
Perl 1-liner using (abusing) the fact that print goes to STDOUT, i.e. file descriptor 1, and warn goes to STDERR, i.e. file descriptor 2:
# perl -n means loop over the lines of input automatically
# perl -e means execute the following code
# chomp means remove the trailing newline from the expression
perl -ne 'chomp(my #cols = split /\s+/); # Split each line on whitespace
print $cols[0] . "\n";
warn $cols[1] . "\n"' <input 1>col1 2>col2
You could, of course, just use cut -b with the appropriate columns, but then you would need to read the file twice.
Here's an awk solution that'll work with any number of columns:
awk '{for(n=1;n<=NF;n++)print $n>"File"n}' input.txt
This steps through each field on the line and prints the field to a different output file based on the column number.
Note that blank fields -- or rather, lines with fewer fields than other lines, will cause line numbers to mismatch. That is, if your input is:
A 1
B
C 3
Then File2 will contain:
1
3
If this is a concern, mention it in an update to your question.
You could of course do this in bash alone, in a number of ways. Here's one:
while read -r line; do
a=($line)
for m in "${!a[#]}"; do
printf '%s\n' "${a[$m]}" >> File$((m+1))
done
done < input.txt
This reads each line of input into $line, then word-splits $line into values in the $a[] array. It then steps through that array, printing each item to the appropriate file, named for the index of the array (plus one, since bash arrays start at zero).

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

How to remove only the first occurrence of a line in a file using sed

I have the following file
titi
tata
toto
tata
If I execute
sed -i "/tat/d" file.txt
It will remove all the lines containing tat. The command returns:
titi
toto
but I want to remove only the first line that occurs in the file containing tat:
titi
toto
tata
How can I do that?
You could make use of two-address form:
sed '0,/tat/{/tat/d;}' inputfile
This would delete the first occurrence of the pattern.
Quoting from info sed:
A line number of `0' can be used in an address specification like
`0,/REGEXP/' so that `sed' will try to match REGEXP in the first
input line too. In other words, `0,/REGEXP/' is similar to
`1,/REGEXP/', except that if ADDR2 matches the very first line of
input the `0,/REGEXP/' form will consider it to end the range,
whereas the `1,/REGEXP/' form will match the beginning of its
range and hence make the range span up to the _second_ occurrence
of the regular expression.
If you can use awk, then this makes it:
$ awk '/tata/ && !f{f=1; next} 1' file
titi
toto
tata
To save your result in the current file, do
awk '...' file > tmp_file && mv tmp_file file
Explanation
Let's activate a flag whenever tata is matched for the first time and skip the line. From that moment, keep not-skipping these lines.
/tata/ matches lines that contain the string tata.
{f=1; next} sets flag f as 1 and then skips the line.
!f{} if the flag f is set, skip this block.
1, as a True value, performs the default awk action: {print $0}.
Another approach, by Tom Fenech
awk '!/tata/ || f++' file
|| stands for OR, so this condition is true, and hence prints the line, whenever any of these happens:
tata is not found in the line.
f++ is true. This is the tricky part: first time f is 0 as default, so first f++ will return False and not print the line. From that moment, it will increment from an integer value and will be True.
Here's the general way to do it:
$ cat file
1 titi
2 tata
3 toto
4 tata
5 foo
6 tata
7 bar
$
$ awk '/tat/{ if (++f == 1) next} 1' file
1 titi
3 toto
4 tata
5 foo
6 tata
7 bar
$
$ awk '/tat/{ if (++f == 2) next} 1' file
1 titi
2 tata
3 toto
5 foo
6 tata
7 bar
$
$ awk '/tat/{ if (++f ~ /^(1|2)$/) next} 1' file
1 titi
3 toto
5 foo
6 tata
7 bar
Note that with the above approach you can skip whatever occurrence(s) of an RE you like (1st, 2nd, 1st and 2nd, whatever) and you only specify the RE once (as opposed to having to duplicate it for some alternative solutions).
Clear, simple, obvious, easily maintainable, extensible, etc....
Here is one way of doing it with sed:
sed ':a;$!{N;ba};s/\ntat[^\n]*//' file
titi
toto
tata
This might work for you (GNU sed):
sed '/pattern/{x;//!d;x}' file
Print all lines other than those containing the pattern as normal. Otherwise if the line contains the pattern and hold space does not (the first occurrence), delete that line.
You may find the first matching line number with grep and pass it to sed for deletion.
sed "$((grep -nm1 tat file.txt || echo 1000000000:) | cut -f 1 -d:) d" file.txt
grep -n combined with cut finds the line number to be deleted. grep -m1 ensures at most one line number is found. echo handles the case when there is no match so as not to return an empty result. sed "[line number] d" deletes the line.

How to use Linux command(sed?) to delete specific lines in a file?

I have a file that contains a matrix. For example, I have:
1 a 2 b
2 b 5 b
3 d 4 b
4 b 7 b
I know it's easy to use sed command to delete specific lines with specific strings. But what if I only want to delete those lines where the second field's value is b (i.e., second line and fourth line)?
You can use regex in sed.
sed -i 's/^[0-9]\s+b.*//g' xxx_file
or
sed -i '/^[0-9]\s+b.*/d' xxx_file
The "-i" argument will modify the file's content directly, you can remove "-i" and output the result to other files as you want.
Awk just work fine, just use code as below:
awk '{if ($2 != "b") print $0;}' file
if you want get more usage about awk, just man it!
awk:
cat yourfile.txt | awk '{if($2!="b"){print;}}'

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources