Insert space after 3 characters in specific column in CSV file - string

In the file below I want to separate the month part and the date part of the value in the 5th column with a single space character.
Input File:
22144842,860998142,1001409110,DLY,Jan4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan2 2016,13:00,17:00
Required Output File:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
How could I do this using the AWK language or the sed command ?

If you can assume a 3 letter month name in all cases and none of the preceding fields ever contain a comma, you should be able to do this using sed:
sed -r 's/([^,]*,){4}[A-Z][a-z]{2}/& /' file
The first four fields are described by zero or more characters that are not a comma [^,]* followed by a comma. The month name is described by an uppercase letter followed by two lowercase ones. The replacement is everything that is matched & with a space added afterwards.

awk -F, -v OFS=, '{sub(/.../, "& ", $5)}1' File
or
awk -F, -v OFS=, '{sub(/[A-Za-z]+/, "& ", $5)}1' File
Output:
22144842,860998142,1001409110,DLY,Jan 4 2016,13:00,17:00
22084015,860902007,29465297,DLY,Jan 4 2016,08:00,12:00
22034081,860845334,1001392391,DLY,Jan 3 2016,13:00,17:00
22159924,861029758,1001411656,DLY,Jan 3 2016,13:00,17:00
22068143,853558982,1001397841,DLY,Jan 2 2016,13:00,17:00
Replace the first 3 characters(/.../) of the 5th field with the same 3 characters (&) followed by a space. Or, Replace the sequence of characters at the beginning of the 5th field with the sequence (&)followed by space.

This might work for you (GNU sed):
sed -r 's/([^,]{0,3})([^,]*)/\1 \2/5' file
Split the fifth set of non-delimiters into two and arrange as required.

Related

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

How to replace extra spaces in text file with comma in linux

my text file has 3 or more than 3 spaces, now I want to replace the 3 or more than 3 spaces with a comma and it should not replace if the file has less than 3 spaces
ex:
input:
a b 3 c d 6 9
output:
a b,3,c,d,6,9
You can do it easily with sed:
$ sed -r 's/ {3,}/,/g' file
a b 3,c,d,6,9
The -r flag instructs the sed to use the extended regular expression syntax which we need for the {min,max} interval operator in the s/// search/replace command. With it we say: for each occurrence (note the g, or global flag in the end) of the space character which is repeated 3 or more times (no upper limit), replace it with ,. Pass through all other characters.

How to remove line if it contains "-" and numbers [0-9]

So if the file is
abcd // no remove because it's has alphabets
a-c // no remove because it has numbers
1-1 // remove because it has both number and hyphen
a-1 // no remove because it contains alphabet
11 // no remove because it has no hypen
Only remove "1-1" because it contains only numbers and "-"
try:
grep -vE "^([0-9]*\\-[0-9]+)|([0-9]+\\-[0-9]*)$" input.txt
or if your input doesn`t contain a line with just "-" character you can simply use:
grep -vE "^[0-9]*\\-[0-9]*$" input.txt
Try this. This is using perl and regex.
perl -lne '{if($_!~/\d{1,}-\d{1,}/){print $_;}}' input_file

How to format decimal space using awk in linux

original file :
a|||a 2 0.111111
a|||book 1 0.0555556
a|||is 2 0.111111
now i need to control third columns with 6 decimal space
after i tried awk {'print $1,$2; printf "%.6f\t",$3'}
but the output is not what I want
result :
a|||a 2
0.111111 a|||book 1
0.055556 a|||is 2
that's weird , how can I do that will just modify third columns
Your print() is adding a newline character. Include your third field inside it, but formatted. Try with sprintf() function, like:
awk '{print $1,$2, sprintf("%.6f", $3)}' infile
That yields:
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111
Print adds a newline on the end of printed strings, whereas printf by default doesn't. This means a newline is added after every second field and none is added after the third.
You can use printf for the whole string and manually add a newline.
Also I'm not sure why you are adding a tab to the end of the lines, so i removed that
awk '{printf "%s %d %.6f\n",$1,$2,$3}' file
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources