I have a dirty csv-file containing rows with quoted semicolons. I am trying to clear these semicolons with commands like:
sed -rin 's/(^.*\;.*\;\".*)(\;)(.*\"\;.*$)/\1\3/' file
But somehow this doesn't remove all of the semicolons. Some of the problematic rows look like this:
;0;"One ▒;)";123; ... ; nth-1column;
;0;"Two ▒;)";456; ... ; nthcolumn;
When they should be cleaned to:
;0;"One ▒)";123; ... ; nth-1column;
;0;"Two ▒)";456; ... ; nthcolumn;
There might be some encoding issues, but this should be ignored by the regex. I am only interested in removing the semicolons, the encoding is handled afterwards.
Any ideas on how to aggressively clean all semicolons contained within double-quotes?
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^;"]*"[^"]*)*"[^";]*);/\1/;ta' file
Make a back reference starting from the front of each line that contain characters not between double quotes and quoted strings that do not contain ;'s followed by double quote and characters that are neither double-quote or semi-colon. If the next character is a semi-colon, remove it and repeat until failure, then print the result.
An alternative:
sed -E '/^([^"]*("[^";]*"[^"]*)*"[^";]*);/{s//\n\1/;D}' file
or:
sed -E 's/^([^"]*("[^";]*"[^"]*)*"[^";]*);/\n\1/;T;D' file
EDIT:
sed -nE '/^([^"]*("[^";]*"[^"]*)*"[^";]*);/{:a;s//\1/;ta;p}' file
You can use
sed ':a;s/^\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\);/\1/;ta' file
See an online demo.
It works like this:
:a - sets a label
^\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\); - find:
^ - start of string
\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\) - Group 1:
\([^"]*;\?\|"[^";]*";\?\)* - zero or more occurrences of
[^"]*;\? - zero or more chars other than " and then an optional ;
\| - or
"[^";]*";\? - ", then zero or more chars other than " and ; and then a " and then an optional ;
" - a " char
[^";]* - zero or more chars other than a ; and "
; - a semi-colon
\1 - replace with Group 1 value
ta - if there was a substitution, go back to a label position.
Related
I have a string:
2021-05-27 10:40:50.678117 PID529270:TID 47545543550720:SID 1673488:TXID 786092740:QID 140: INFO:MEMCONTEXT:MemContext state: mem[cur/hi/max] = 9135 / 96586 / 96576 MB, VM[cur/hi/max] = 9161 / 21841178 / 100663296 MB
I want to get the number 9135 that first occurrence between '=' and '/', right now, my command as below, it works, but I don't think it's perfect:
sed -r 's/.* = ([0-9]+) .* = .*/\1 /'
Need a more neat one, please help advise.
You can use
sed -En 's~.*= ([0-9]+) /.*=.*~\1~p'
See the online demo.
An awk solution:
awk -F= '{gsub(/\/.*|[^0-9]/,"",$2);print $2}'
See this demo.
Details:
-En - E (or r as in your example) enables the POSIX ERE syntax and n suppresses the default line output
.*= ([0-9]+) /.*=.* - matches any text, = + space, captures one or more digits into Grou 1, then matches a space, /, then any text, = and again any text
\1 - replaces with Group 1 value
p - prints the result of the substitution.
Here, ~ are used as regex delimiters in order not to escape / in the pattern.
awk:
-F= - sets the input field separator to =
gsub(/\/.*|[^0-9]/,"",$2) - removes any non-digit or / and the rest of the string
print $2 - prints the modified Field 2 value.
You could also get the first match with grep using -P for Perl-compatible regular expressions.
grep -oP "^.*? = \K\d+(?= /)"
^ Start of string
.*? Match as least as possible chars
= Match space = and space
\K\d+ Forget what is matched so far
(?= /) Assert a space and / to the right
Output
9135
See a bash demo
Since you want the material between the first = and the first /, ignoring the spaces, you could use:
sed -E -e 's%^[^=]*= ([^/]*) /.*$%\1%'
This uses Extended Regular Expressions (ERE) (-E; -r also works with GNU sed), and searches from the start of the line for a sequence of 'not =' characters, the = character, a space, anything that's not a slash (which is remembered), another space, a slash, and anything that follows, replacing it all with what was remembered. The ^ and $ anchors aren't crucial; it will work the same without them. The % symbols are used instead of / because the searched-for pattern includes a /. If your sure there'll never be any spaces other than the first and last ones between the = and /, you can use [^ /]* in place of [^/]* and there should be some small (probably immeasurable) performance benefit.
I want to replace all slashes "/" between alphanumeric with backslash+slash "\/" apart from the last one on each string, e.g.
nocareNocare abc\/def/ghi/mno\/pq/r abc\/def\/ghi/mno\/pq/r
should become:
nocareNocare abc\/def\/ghi\/mno\/pq/r abc\/def\/ghi\/mno\/pq/r
I use:
sed 's/\(.*\)\([[:alnum:]]\)\/\([[:alnum:]]\)\(\S*\)\(\\\|\/\)/\1\2\\\/\3\4\//g'
Short explanation: match
any string + alnum + / + any non-white + / or \
But it only replace one case, so I need to run it 3 times to replace all 3 occurences. Looks like the first time it matches all the way to :
>nocareNocare abc\/def/ghi/mno\/pq/r abc\/def\/ghi/
instead of
>nocareNocare abc\/def/
sed -e :a -e 's|\([a-z0-9]\)/\([a-z0-9][^ ]*[a-z0-9]/[a-z0-9]\)|\1\\/\2|;ta' filename
Loosely translated, this says "replace a lone slash followed by some other stuff in the string, followed by another lone slash, with backslash-slash and that same stuff (and the second slash). And after making such a replacement, start over again."
You can use a perl command line solution based on the following regEx's
(?<!\\)
not preceded by a backslash
(?!\w+\s)
not followed by word characters terminating in whitespace
perl -pe 's;(?<!\\)/(?!\w+\s);\\/;g' file
nocareNocare abc\/def\/ghi\/mno\/pq/r abc\/def\/ghi\/mno\/pq/r
With GNU sed:
sed -E 's:([^\])/:\1\\/:g;s:\\/([^\]*( |$)):/\1:g' file
Two s command here:
s:([^\])/:\1\\/:g replace all / not preceded by a \ with \/
s:\\/([^\]*( |$)):/\1:g replace last \/ before space or end of line with /
I have the following sed command:
sed -n '/^out(/{n;p}' ${filename} | sed -n '/format/ s/.*format=//g; s/),$//gp; s/))$//gp'
I tried to do it as one line as in:
sed -n '/^out(/{n;}; /format/ s/.*format=//g; s/),$//gp; s/))$//gp' ${filename}
But that also display the lines I don't want (those that do not match).
What I have is a file with some strings as in:
entry(variable=value)),
format(variable=value)),
entry(variable=value)))
out(variable=value)),
format(variable=value)),
...
I just want the format lines that came right after the out entry. and remove those trailing )) or ),
You can use this sed command:
sed -nr '/^out[(]/ {n ; s/.*[(]([^)]+)[)].*/\1/p}' your_file
Once a out is found, it advanced to the next line (n) and uses the s command with p flag to extract only what is inside parenthesises.
Explanation:
I used [(] instead of \(. Outside brackets a ( usually means grouping, if you want a literal (, you need to escape it as \( or you can put it inside brackets. Most RE special characters dont need escaping when put inside brackets.
([^)]+) means a group (the "(" here are RE metacharacters not literal parenthesis) that consists of one or more (+) characters that are not (^) ) (literal closing parenthesis), the ^ inverts the character class [ ... ]
Like I have a text file with,
+
Code here
+
Code here +
Code+ here
+
Code here
And I want to grep this file and only show the line with a + character only? What regex should I construct?
Please advise
Many thanks
If I understand correctly, you want the line with only a +, no whitespace. To do that,
use ^ and $ to match the beginning and end of the line, respectively.
grep '^+$' filename
If you want a line with nothing but the + character, use:
grep '^+$' inputFile
This uses the start and end markers to ensure it only has that one character.
However, if you want lines with only a + character, possibly surrounded by spaces (as seems to be indicated by your title), you would use something like:
grep '^ *+ *$' inputFile
The sections are:
"^", start of line marker.
" *", zero or more spaces.
"+", your plus sign.
" *", zero or more spaces.
"$", end of line marker.
The following transcript shows this in action:
pax> echo ' +
Code here
+
Code here +
Code+ here
+
Code here' | grep '^ *+ *$' | sed -e 's/^/OUTPUT:>/' -e 's/$/</'
OUTPUT:> +<
OUTPUT:> +<
OUTPUT:> + <
The input data has been slightly modified, and a sed filter has been added, to show how it handles spacing on either side.
And if you want general white space rather than just spaces, you can change the space terms from the current " *" into "\s*" for example.
Before overwriting I have copied /boot/grub/menu.lst to /home/san. I am running sed on the /home/san/menu.lst just for testing.
How can i overwrite
default 0
to
default 1
with the help of sed. I used following commands but none worked. It's most probably because i don't how many spaces are there in between "default" and "0". I thought there were two spaces and one tab but I was wrong.
sed -e 's/default \t0/default \t1/' /home/san/menu.lst
sed -e 's/default\t0/default\t1/' /home/san/menu.lst
sed -e 's/default \t0/default \t1/' /home/san/menu.lst
sed -e 's/default 0/default 1/' /home/san/menu.lst
I actually want to write a script that may see if 'default 0' is written in menu.lst then replace it with 'default 1' and if 'default 1' is written then replace it with 'default 2'.
Thanks
Update:
How can used a conditional statement to see if the line starting with 'default' has "0" or "1" written after it? If "0" replace with "1" and if "1" replace it with "0"?
This works for me (with another file, of course ;)
sed -re 's/^default([ \t]+)0$/default\11/' /home/san/menu.lst
How it works:
Passing -r to sed allows us to use extended regular expressions, thus no need to escape the parentheses and plus sign.
[ \t]+ matches one or more tabs or spaces, in any order.
Putting parentheses around this, that is ([ \t]+), turns this into a group, to which we can refer.
This is the only group in this case, thus \1. That is what happens in the replacement string.
We don't want to replace default 0 as part of a larger string. Thus we use ^ and $, which match the start and end of the line, respectively.
Note that:
Using a group is not strictly necessary. You can also opt to replace all those tabs and spaces by a single tab or space. In that case just omit the parentheses and replace \1 with a space:
sed -re 's/^default[ \t]+0$/default 1/' /home/san/menu.lst
This will fail if there is trailing whitespace after default 0. If that is a concern, then you can match [ \t]* (zero or more tabs and spaces) just before the EOL:
sed -re 's/^default([ \t]+)0[ \t]*$/default\11/' /home/san/menu.lst
Firstly, you could use several lines like this:
sed -re 's/default([ \t]*)0/default\11/' /home/san/menu.lst.
However, you might be better off with one line like this:
awk '/^default/ { if ($2 == 1) print "default 0" ; else print "default 1" } !/^default/ {print}' /boot/grub/menu.lst
This substitutes 1 for 0, 0 for 1:
sed -r '/^default[ \t]+[01]$/{s/0/1/;t;s/1/0/}'
This substitutes 1 for 0, 2 for 1, 0 for 2:
sed -r '/^default[ \t]+[012]$/{s/0/1/;t;s/1/2/;t;s/2/0/}'
Briefly:
/^default[ \t]+[01]$/ uses addressing to select lines that consist only of "default 0" or "default 1" (or "default 2" in the second form) with any non-zero amount of spaces and/or tabs between "default" and the number
s/0/1/ - substitute a "1" for a "0"
t - branch to the end if a substitution was made, if not then continue with the next command
followed by more substitution(s) (and branches)
Edit:
Here's another way to do it:
sed -r '/^default[ \t]+[012]$/y/012/120/'