vi sed or awk. every line in a text file. replace 9 characters starting at position 75 - vim

I have a huge file
from line 3 to end of (#lines in file -1 )
starting at character position 75 on the line. I need to change the string to 123456789.
thought suggestions? I can't the existing characters per line are not duplicates so I can't search on that.
The joys of hiding pii data

In vim, you can do this:
%s/\(^.\{75\}\)\#<=........./1234567890/g
which basically does a lookbehind of 75 characters (which starts at the beginning of the line), and replaces the rest of the line with your string.

Let's consider this test file:
$ cat testfile
.........-.........-.........-.........-.........-.........-.........-....ReplaceMeKeep
.........-.........-.........-.........-.........-.........-.........-....OldData..Keep
Using sed
This replaces the nine characters starting with column 75 on with 123456789:
$ sed -E 's/(.{74}).{0,9}/\1123456789/' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
Using awk
This puts the new string in place of the first nine characters starting at position 75:
$ awk '{print substr($0,1,74) "123456789" substr($0,75+9)}' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep

Related

sed replacing first occurence of characters in each line of file only if they are first 2 characters

Is it possible using sed to replace the first occurrence of a character or substring in line of file only if it is the first 2 characters in the line?
For example we have this text file:
15 hello
15 h15llo
1 hello
1 h15loo
Using the following command: sed -i 's/15/0/' file.txt
Will give this output
0 hello
0 h15llo
1 hello
1 h0loo
What I am trying to avoid is it considering the characters past the first 2.
Is this possible?
Desired output:
0 hello
0 h15llo
1 hello
1 h15loo
You can use
sed -i 's/^15 /0 /' file.txt
sed -i 's/^15\([[:space:]]\)/0\1/' file.txt
sed -i 's/^15\(\s\)/0\1/' file.txt
Here, the ^ matches the start of string position, 15 matches the 15 substring and then a space matches a space.
The second and third solutions are the same, instead of a literal space, they capture a whitespace char into Group 1 and the group value is put back into the result using the \1 placeholder.

replace only white spaces (no tabs, no line end) of a tabular file with underscores

I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!
[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.

I want to remove multiple line of text on linux

Just like this.
Before:
1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
After:
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
I want remove the section having no alphabet like section 2.
I think that I should use perl or sed. But I don't know how to do.
I tried like this. But it didn't work.
sed 's/[0-9]\n[0-9]\n%s\n//'
sed is for doing s/old/new/ on individual lines, that is all. For anything else you should be using awk:
$ awk -v RS= -v ORS='\n\n' '/[[:alpha:]]/' file
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The above is simply this:
RS= tells awk the input records are separated by blank lines.
ORS='\n\n' tells awk the output records must also be separated by blank lines.
/[[:alpha:]]/ searches for and prints records that contain alphabetic characters.
Simple enough in Perl. The secret is to put Perl in "paragraph mode" by setting the input record separator ($/) to an empty string. Then we only print records if they contain a letter.
#!/usr/bin/perl
use strict;
use warnings;
# Paragraph mode
local $/ = '';
# Read from STDIN a record (i.e. paragraph) at a time
while (<>) {
# Only print records that include a letter
print if /[a-z]/i;
}
This is written as a Unix filter, i.e. it reads from STDIN and writes to STDOUT. So if it's in a file called filter, you can call it like this:
$ filter < your_input_file > your_output_file
Alternatively this is a simple command line script in Perl (-00 is the command line option to put Perl into paragraph mode):
$ perl -00 -ne'print if /[a-z]/' < your_input_file > your_output_file
If there's exactly one blank line after each paragraph you can use a long awk oneliner (three patterns, so probably not a oneliner actually):
$ echo '1
19:22
abcde
2
19:23
3
19:24
abbff
4
19:25
abbc
' | awk '/[^[:space:]]/ { accum = accum $0 "\n" } /^[[:space:]]*$/ { if(on) print accum $0; on = 0; accum = "" } /[[:alpha:]]/ { on = 1 }'
1
19:22
abcde
3
19:24
abbff
4
19:25
abbc
The idea is to accumulate non-blank lines, setting flag once an alphabetical character found, and on a blank input line, flush the whole accumulated paragraph if that flag is set, reset accum to empty string and reset flag to zero.
(Note that if the last line of input is not necessarily empty you might need to add an END block that checks if currently there's a paragraph unflushed and flush it as needed.)
This might work for you (GNU sed):
sed ':a;$!{N;/^$/M!ba};/[[:alpha:]]/!d' file
Gather up lines delimited by an empty line or end-of-file and delete the latest collection if it does not contain an alpha character.
This presupposes that the file format is fixed as in the example. To be more accurate use:
sed -r ':a;$!{N;/^$/M!ba};/^[1-9][0-9]*\n[0-9]{2}:[0-9]{2}\n[[:alpha:]]+\n?$/!d' file
Similar to the solution of Ed Morton but with the following assumptions:
The text blocks consist of 2 or 3 lines.
If there is a third line, it contains characters from any alphabet.
In essence, under these conditions we only need to check for a third field:
awk 'BEGIN{RS=;ORS="\n\n";FS="\n"}(NF<3)' file
or similar without BEGIN:
awk -v RS= -v ORS='\n\n' -F '\n' '(NF<3)' file

Sed - Conditional Matching of pattern

I want to do the following:
Find pattern 1, then find the first instance of pattern 2. After doing so, I want to print the next line. This is for a sed script. I'm pretty lost on how to do this, since sed doesn't have if statements.
This might work for you (GNU sed):
sed -n '/first/,${/second/{n;p;q}}' file
Set -n option to emulate grep i.e. only print what you want. Focus on the range from first to the end of the file ($). Then match second and get the next line (n), print (p) and quit (q).
If filename j.txt contains below content:
10 20 30
40 50 60
10 90 80
sed -n '/10/p' j.txt | sed -n '/20/,+1p'
First it will search for pattern1 (10) and then it will search for pattern2 (20) and print corresponding next line with content match line
Output will be:
10 20 30
10 90 80

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources