replace only white spaces (no tabs, no line end) of a tabular file with underscores - string

I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!

[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.

Related

Sed Insert a newline before match

I'm trying to insert a string in multiple text files at a random line number. before adding the string in the text files i want to add a newline.
For example, a text file has 4 paragraphs.
paragraph 1
paragraph 2
paragraph 3
paragraph 4
I want the output to be
paragraph 1
STRING
paragraph 2
paragraph 3
paragraph 4
My code is working fine, but its not adding the empty newline before the string.
$ for i in *.txt; do sed -i "$(shuf -n 1 -e 2 4 6)i \n\rSTRING \n\r" $i ; done
The i command is actually i\, from the GNU manual:
'i\'
'TEXT'
insert TEXT before a line.
So the backslash before the n is "eaten" by the i command. Add an extra backslash and it should work.

How to replace extra spaces in text file with comma in linux

my text file has 3 or more than 3 spaces, now I want to replace the 3 or more than 3 spaces with a comma and it should not replace if the file has less than 3 spaces
ex:
input:
a b 3 c d 6 9
output:
a b,3,c,d,6,9
You can do it easily with sed:
$ sed -r 's/ {3,}/,/g' file
a b 3,c,d,6,9
The -r flag instructs the sed to use the extended regular expression syntax which we need for the {min,max} interval operator in the s/// search/replace command. With it we say: for each occurrence (note the g, or global flag in the end) of the space character which is repeated 3 or more times (no upper limit), replace it with ,. Pass through all other characters.

Insert space after 3 characters in specific column in CSV file

In the file below I want to separate the month part and the date part of the value in the 5th column with a single space character.
Input File:
22144842,860998142,1001409110,DLY,Jan4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan2 2016,13:00,17:00
Required Output File:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
How could I do this using the AWK language or the sed command ?
If you can assume a 3 letter month name in all cases and none of the preceding fields ever contain a comma, you should be able to do this using sed:
sed -r 's/([^,]*,){4}[A-Z][a-z]{2}/& /' file
The first four fields are described by zero or more characters that are not a comma [^,]* followed by a comma. The month name is described by an uppercase letter followed by two lowercase ones. The replacement is everything that is matched & with a space added afterwards.
awk -F, -v OFS=, '{sub(/.../, "& ", $5)}1' File
or
awk -F, -v OFS=, '{sub(/[A-Za-z]+/, "& ", $5)}1' File
Output:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
Replace the first 3 characters(/.../) of the 5th field with the same 3 characters (&) followed by a space. Or, Replace the sequence of characters at the beginning of the 5th field with the sequence (&)followed by space.
This might work for you (GNU sed):
sed -r 's/([^,]{0,3})([^,]*)/\1 \2/5' file
Split the fifth set of non-delimiters into two and arrange as required.

vi sed or awk. every line in a text file. replace 9 characters starting at position 75

I have a huge file
from line 3 to end of (#lines in file -1 )
starting at character position 75 on the line. I need to change the string to 123456789.
thought suggestions? I can't the existing characters per line are not duplicates so I can't search on that.
The joys of hiding pii data
In vim, you can do this:
%s/\(^.\{75\}\)\#<=........./1234567890/g
which basically does a lookbehind of 75 characters (which starts at the beginning of the line), and replaces the rest of the line with your string.
Let's consider this test file:
$ cat testfile
.........-.........-.........-.........-.........-.........-.........-....ReplaceMeKeep
.........-.........-.........-.........-.........-.........-.........-....OldData..Keep
Using sed
This replaces the nine characters starting with column 75 on with 123456789:
$ sed -E 's/(.{74}).{0,9}/\1123456789/' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
Using awk
This puts the new string in place of the first nine characters starting at position 75:
$ awk '{print substr($0,1,74) "123456789" substr($0,75+9)}' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources