How to replace extra spaces in text file with comma in linux - linux

my text file has 3 or more than 3 spaces, now I want to replace the 3 or more than 3 spaces with a comma and it should not replace if the file has less than 3 spaces
ex:
input:
a b 3 c d 6 9
output:
a b,3,c,d,6,9

You can do it easily with sed:
$ sed -r 's/ {3,}/,/g' file
a b 3,c,d,6,9
The -r flag instructs the sed to use the extended regular expression syntax which we need for the {min,max} interval operator in the s/// search/replace command. With it we say: for each occurrence (note the g, or global flag in the end) of the space character which is repeated 3 or more times (no upper limit), replace it with ,. Pass through all other characters.

Related

sed replacing first occurence of characters in each line of file only if they are first 2 characters

Is it possible using sed to replace the first occurrence of a character or substring in line of file only if it is the first 2 characters in the line?
For example we have this text file:
15 hello
15 h15llo
1 hello
1 h15loo
Using the following command: sed -i 's/15/0/' file.txt
Will give this output
0 hello
0 h15llo
1 hello
1 h0loo
What I am trying to avoid is it considering the characters past the first 2.
Is this possible?
Desired output:
0 hello
0 h15llo
1 hello
1 h15loo
You can use
sed -i 's/^15 /0 /' file.txt
sed -i 's/^15\([[:space:]]\)/0\1/' file.txt
sed -i 's/^15\(\s\)/0\1/' file.txt
Here, the ^ matches the start of string position, 15 matches the 15 substring and then a space matches a space.
The second and third solutions are the same, instead of a literal space, they capture a whitespace char into Group 1 and the group value is put back into the result using the \1 placeholder.

replace only white spaces (no tabs, no line end) of a tabular file with underscores

I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!
[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.

Sed Insert a newline before match

I'm trying to insert a string in multiple text files at a random line number. before adding the string in the text files i want to add a newline.
For example, a text file has 4 paragraphs.
paragraph 1
paragraph 2
paragraph 3
paragraph 4
I want the output to be
paragraph 1
STRING
paragraph 2
paragraph 3
paragraph 4
My code is working fine, but its not adding the empty newline before the string.
$ for i in *.txt; do sed -i "$(shuf -n 1 -e 2 4 6)i \n\rSTRING \n\r" $i ; done
The i command is actually i\, from the GNU manual:
'i\'
'TEXT'
insert TEXT before a line.
So the backslash before the n is "eaten" by the i command. Add an extra backslash and it should work.

Insert space after 3 characters in specific column in CSV file

In the file below I want to separate the month part and the date part of the value in the 5th column with a single space character.
Input File:
22144842,860998142,1001409110,DLY,Jan4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan2 2016,13:00,17:00
Required Output File:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
How could I do this using the AWK language or the sed command ?
If you can assume a 3 letter month name in all cases and none of the preceding fields ever contain a comma, you should be able to do this using sed:
sed -r 's/([^,]*,){4}[A-Z][a-z]{2}/& /' file
The first four fields are described by zero or more characters that are not a comma [^,]* followed by a comma. The month name is described by an uppercase letter followed by two lowercase ones. The replacement is everything that is matched & with a space added afterwards.
awk -F, -v OFS=, '{sub(/.../, "& ", $5)}1' File
or
awk -F, -v OFS=, '{sub(/[A-Za-z]+/, "& ", $5)}1' File
Output:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
Replace the first 3 characters(/.../) of the 5th field with the same 3 characters (&) followed by a space. Or, Replace the sequence of characters at the beginning of the 5th field with the sequence (&)followed by space.
This might work for you (GNU sed):
sed -r 's/([^,]{0,3})([^,]*)/\1 \2/5' file
Split the fifth set of non-delimiters into two and arrange as required.

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources