Extract certain text from each line of text file using UNIX or perl - linux

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?

You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3

You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100

My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Related

sed replacing first occurence of characters in each line of file only if they are first 2 characters

Is it possible using sed to replace the first occurrence of a character or substring in line of file only if it is the first 2 characters in the line?
For example we have this text file:
15 hello
15 h15llo
1 hello
1 h15loo
Using the following command: sed -i 's/15/0/' file.txt
Will give this output
0 hello
0 h15llo
1 hello
1 h0loo
What I am trying to avoid is it considering the characters past the first 2.
Is this possible?
Desired output:
0 hello
0 h15llo
1 hello
1 h15loo
You can use
sed -i 's/^15 /0 /' file.txt
sed -i 's/^15\([[:space:]]\)/0\1/' file.txt
sed -i 's/^15\(\s\)/0\1/' file.txt
Here, the ^ matches the start of string position, 15 matches the 15 substring and then a space matches a space.
The second and third solutions are the same, instead of a literal space, they capture a whitespace char into Group 1 and the group value is put back into the result using the \1 placeholder.

How can I use grep and regular expression to display names with just 3 characters

I am new to grep and UNIX. I have a sample of data and want to display all the first names that only contain three characters e.g. Lee_example. but I having some difficulty doing that. I am currently using this code cat file.txt|grep -E "[A-Z][a-z]{2}" but it is displaying all the names that contain at least 3 characters and not only 3 characters
Sample data
name
number
Lee_example
1
Hector_exaple
2
You need to match the _ after the first name.
grep -E "[A-Z][a-z]{2}_"
With awk:
awk -F_ 'length($1)==3{print $1}'
-F_ tells awk to split the input lines by _. length($1) == 3 checks whether the first fields (the name) is 3 characters long and {print $1} prints the name in that case.

replace only white spaces (no tabs, no line end) of a tabular file with underscores

I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!
[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.

I want to remove multiple line of text on linux

Just like this.
Before:
1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
After:
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
I want remove the section having no alphabet like section 2.
I think that I should use perl or sed. But I don't know how to do.
I tried like this. But it didn't work.
sed 's/[0-9]\n[0-9]\n%s\n//'
sed is for doing s/old/new/ on individual lines, that is all. For anything else you should be using awk:
$ awk -v RS= -v ORS='\n\n' '/[[:alpha:]]/' file
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The above is simply this:
RS= tells awk the input records are separated by blank lines.
ORS='\n\n' tells awk the output records must also be separated by blank lines.
/[[:alpha:]]/ searches for and prints records that contain alphabetic characters.
Simple enough in Perl. The secret is to put Perl in "paragraph mode" by setting the input record separator ($/) to an empty string. Then we only print records if they contain a letter.
#!/usr/bin/perl
use strict;
use warnings;
# Paragraph mode
local $/ = '';
# Read from STDIN a record (i.e. paragraph) at a time
while (<>) {
# Only print records that include a letter
print if /[a-z]/i;
}
This is written as a Unix filter, i.e. it reads from STDIN and writes to STDOUT. So if it's in a file called filter, you can call it like this:
$ filter < your_input_file > your_output_file
Alternatively this is a simple command line script in Perl (-00 is the command line option to put Perl into paragraph mode):
$ perl -00 -ne'print if /[a-z]/' < your_input_file > your_output_file
If there's exactly one blank line after each paragraph you can use a long awk oneliner (three patterns, so probably not a oneliner actually):
$ echo '1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
' | awk '/[^[:space:]]/ { accum = accum $0 "\n" } /^[[:space:]]*$/ { if(on) print accum $0; on = 0; accum = "" } /[[:alpha:]]/ { on = 1 }'
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The idea is to accumulate non-blank lines, setting flag once an alphabetical character found, and on a blank input line, flush the whole accumulated paragraph if that flag is set, reset accum to empty string and reset flag to zero.
(Note that if the last line of input is not necessarily empty you might need to add an END block that checks if currently there's a paragraph unflushed and flush it as needed.)
This might work for you (GNU sed):
sed ':a;$!{N;/^$/M!ba};/[[:alpha:]]/!d' file
Gather up lines delimited by an empty line or end-of-file and delete the latest collection if it does not contain an alpha character.
This presupposes that the file format is fixed as in the example. To be more accurate use:
sed -r ':a;$!{N;/^$/M!ba};/^[1-9][0-9]*\n[0-9]{2}:[0-9]{2}\n[[:alpha:]]+\n?$/!d' file
Similar to the solution of Ed Morton but with the following assumptions:
The text blocks consist of 2 or 3 lines.
If there is a third line, it contains characters from any alphabet.
In essence, under these conditions we only need to check for a third field:
awk 'BEGIN{RS=;ORS="\n\n";FS="\n"}(NF<3)' file
or similar without BEGIN:
awk -v RS= -v ORS='\n\n' -F '\n' '(NF<3)' file

Swapping the first word with itself 3 times only if there are 4 words only using sed

Hi I'm trying to solve a problem only using sed commands and without using pipeline. But I am allowed to pass the result of a sed command to a file or te read from a file.
EX:
sed s/dog/cat/ >| tmp
or
sed s/dog/cat/ < tmp
Anyway lets say I had a file F1 and its contents was :
Hello hi 123
if a equals b
you
one abc two three four
dany uri four 123
The output should be:
if if if a equals b
dany dany dany uri four 123
Explanation: the program must only print lines that have exactly 4 words and when it prints them it must print the first word of the line 3 times.
I've tried doing commands like this:
sed '/[^ ]*.[^ ]*.[^ ]*/s/[^ ]\+/& & &/' F1
or
sed 's/[^ ]\+/& & &/' F1
but I can't figure out how i can calculate with sed that there are only 4 words in a line.
any help will be appreciated
$ sed -En 's/^([^[:space:]]+)([[:space:]]+[^[:space:]]+){3}$/\1 \1 &/p' file
if if if a equals b
dany dany dany uri four 123
The above uses a sed that supports EREs with a -E option, e.g. GNU and OSX seds).
If the fields are tab separated
sed 'h;s/[^[:blank:]]//g;s/[[:blank:]]\{3\}//;/^$/!d;x;s/\([^[:blank:]]*[[:blank:]]\)/\1\1\1/' infile

Resources