Remove double quotes within the column value using Unix - linux

I am working on Processing a (90 Cols) CSV File - Semicolon Separated (;) {case can be ignore and I am aware file standard is a mess but I am helpless in that regards}
Input Rows :
"AAAAA";"ABABDBDA";"ASDASDA"asads";"123";"456"
"AAAAA";"ABABDBDA";"12322AAasd"asads";"123";"456"
"Lmnop";"asdasads";"mer";"123;2343;asa"dwd";"456"
Output Expected :
"AAAAA";"ABABDBDA";"ASDASDA asads";"123";"456"
"AAAAA";"ABABDBDA";"12322AAasd asads";"123";"456"
"Lmnop";"asdasads";"mer";"123;2343;asa dwd";"456"
(Double Quote can be replaced by Space or blank). {Kindly note - even though this is ';' seperated file some rows have ';' within quoted data for a column.
Issue : In the rows - I am getting an extra Double Quote within the quoted data.
Please advise me on how to handle this in Unix.

one trick you can use is to remove " not around the field boundaries. A simple sed script can be
$ sed -E 's/([^;])"([^;])/\1 \2/g' file
note that if you allow escaped quote marks is you fields, this is going to remove them as well.
note the example below in the comments which is not covered with one round of the sed. Due to greedy match a single char can't be a condition for both matches, so "a"b"c"; won't work correctly.

What would you think of the following solution:
Replace all ";" by ;
Remove all remaining "
Replace all ; back into ";"
Add additional " characters, at the beginning and at the end of every line.
The whole thing can be done with tr or sed or whatever command you prefer.

mawk 'NF*(gsub(__," ",$!(NF=NF))^_ +gsub(OFS,FS) +gsub("^ | $",__))' \
__='\42' FS='\442\73\42' OFS='\31\17'
"AAAAA";"ABABDBDA";"ASDASDA asads";"123";"456"
"AAAAA";"ABABDBDA";"12322AAasd asads";"123";"456"
"Lmnop";"asdasads";"mer";"123;2343;asa dwd";"456"

This transform is easy to do using tool which provide regular expression with zero-length assertions (lookbehind and lookahead), as you applied unix tag there is good chance you have perl command and therefore I propose following solution, let file.txt content be
"AAAAA";"ABABDBDA";"ASDASDA"asads";"123";"456"
"AAAAA";"ABABDBDA";"12322AAasd"asads";"123";"456"
"Lmnop";"asdasads";"mer";"123;2343;asa"dwd";"456"
then
perl -p -e 's/(?<=[[:alnum:]])"(?=[[:alnum:]])/ /g' file.txt
gives output
"AAAAA";"ABABDBDA";"ASDASDA asads";"123";"456"
"AAAAA";"ABABDBDA";"12322AAasd asads";"123";"456"
"Lmnop";"asdasads";"mer";"123;2343;asa dwd";"456"
Explanation: I inform perl that I want to use it sed-style via -p -e then I provide substitution (s): " which is after alphanumeric character (letter or digit) and before alphanumeric should be replaced using space character. This is applied to all such " that is globally (g).
Note: you might elect to port that answer to any other tools which does provide ability to replace regular expression with zero-length assertions.
(tested in perl 5, version 26, subversion 3)

When you consider the combination ";" as a delimiter, you can use
awk -F '";"' '{
printf "\"";
for (i=1;i<NF;i++) {
gsub("\"","", $i);
printf("%s\";\"",$i)
};
print $NF
}' inputfile

This might work for you (GNU sed):
sed -E ':a;s/^(("[^"]*";)*"[^"]*)"([^;])/\1 \3/;ta' file
Iterate starting from the start of the line, match zero or more correctly double quoted fields followed by an incorrect double quote and replace that double quote by a space.

Related

how to sed spacial character if it come inside double quote in linux file

I have txt file delimited by comma (,) and each column quoted by double quote
what I want to do is :
I need to keep the delimiter as comma but I want to remove each comma come into double pair quote (as each column around by double quote)
sample on input and output file I want
input file :
"2022111812160156601777153","","","false","test1",**"here the , issue , that comma comma come inside the column"**
the output as I want :
"2022111812160156601777153","","","false","test1",**"here the issue that comma comma come inside the column"**
what I try :
sed -i ':a' -e 's/\("[^"]*\),\([^"]*"\)/\1~\2/;ta' test.txt
but above sed command replace all comma not only the comma that come inside the column
is there are way to do it ?
Using sed
$ sed -Ei.bak ':a;s/((^|,)(\*+)?"[^"]*),/\1/;ta' input_file
"2022111812160156601777153","","","false","test1",**"here the issue that comma comma come inside the column"**
Any time you find yourself using more than s, g, and p (with -n) in sed you'd be better off using awk for some combination of clarity, robustness, efficiency, portability, etc.
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS="\""} {for (i=2; i<=NF; i+=2) gsub(/,/,"",$i)} 1' file
"2022111812160156601777153","","","false","test1",**"here the issue that comma comma come inside the column"**
Just like GNU sed has -i as in your question to update the input file with the command's output, GNU awk has -i inplace, or just add > tmp && mv tmp file with any awk or any other Unix command.
This might work for you (GNU sed):
sed -E ':a;s/^(("[^",]*"\**,?\**)*"[^",]*),/\1/;ta' file
This iterates through each line removing any commas within paired double quoted fields.
N.B. The solution above also caters for double quoted field prefixed/suffixed by zero or *'s. If this should not be catered for, here is an ameliorated solution:
sed -E ':a;s/^(("[^",]*",?)*"[^",]*),/\1/;ta' file
N.B. Escaped double quotes and commas would need a or more involved regexp.

How to extract and replace columns with a multi-character delimiter?

I got a file with ^$ as delimiter, the text is like :
tony^$36^$developer^$20210310^$CA
I want to replace the datetime.
I tried awk -F '\^\$' '{print $4}' file.txt | sed -i '/20210310/20221210/' , but it returns nothing. Then I tried the awk part, it returns nothing, I guess it still treat the line as a whole and the delimiter doesn't work. Wondering why and how to solve it?
A simple solution would be:
sed 's/\^\$/\n/g; s/20210310/20221210/g' -i file.txt
which will modify the file to separate each section to a new line.
If you need a different delimiter, change the \n in the command to maybe space or , .. up to you.
And it will also replace the date in the file.
If you want to see the changes, and really modify the file, remove the -i from the command.
When I run your awk command, I get these warnings:
awk: warning: escape sequence `\^' treated as plain `^'
awk: warning: escape sequence `\$' treated as plain `$'
That explains why your output is blank: the field delimiter is interpreted as the regular expression '^$', which matches a completely blank line (only). As a result, each non-blank line of input is without any field separators, and therefore has only a single field. $4 can be non-empty only if there are at least four fields.
You can fix that by escaping the backslashes:
awk -F '\\^\\$' '{print $4}' file.txt
If all you want to do is print the modified datecodes py themselves, then that should get you going. However, the question ...
How to extract and replace columns with a multi-character delimiter?
... sounds like you may want actually to replace the datecode within each line, keeping the rest intact. In that case, it is a non-starter for the awk command to discard the other parts of the line. You have several options here, but two of the more likely would be
instead of sending field 4 out to sed for substitution, do the sub in the awk script, and then reconstitute the input line by printing all fields, with the expected delimiters. (This is left as an exercise.) OR
do the whole thing in sed:
sed -E 's/^((([^^]|\^[^$])*\^\$){3})20210310(\^\$.*)/\120221210\4/' file.txt
If you wanted to modify file.txt in-place then you could add the -i flag (which, on the other hand, is not useful in your original command, where sed's input is coming from a pipe rather than a file).
The -E option engages the POSIX extended regex dialect, which allows the given regex to be more readable (the alternative would require a bunch more \ characters).
Overall, presuming that there are five or more fields delimited by literal '^$' strings, and the fourth contains exactly "20210310", that matches the first three fields, including their trailing delimiters, and captures them all as group 1; matches the leading delimiter of the fifth field and all the remainder of the line and captures it as group 4; and substitutes replaces the whole line with group 1 followed by the new datecode followed by group 4.

How to correctly detect and replace apostrophe (') with sed?

I'm having a directory with many files having special characters and spaces. I want to perform an operation with all these files so I'm trying to store all filenames in a list.txt and then run the command with this list.
The special characters in my list are & []'.
So basically I want to use sed to replace each occurence with \ + the character in question.
E.g. : filename .txt => filename\ .txt etc...
The thing is I have trouble handling apostrophes.
Here is my command as of now :
ls | sed 's/\ /\\ /g' | sed 's/\&/\\&/g' | sed "s/\'/\\'/g" | sed 's/\[/\\[/g' | sed 's/\]/\\]/g'
At first I had issues with, I believe, the apostrophes in the string command in conflict with the apostrophes surrounding the string. So I used double quotes instead, but it still doesn't work.
I've tried all these and nothing worked :
sed "s/\'/\\'/g" (escaping the apostrophe)
sed "s/'/\'/g" (escaping nothing)
sed "s/'/\\'/g" (escaping the backslash)
sed 's/"'"/\"'"/g' (double quoting single quote)
As a disclaimer, I must say, I'm completely new to sed. I just run my first sed command today, so maybe I'm doing something wrong I didn't realize.
PS : I've seen those thread, but no answer worked for me :
https://unix.stackexchange.com/questions/157076/how-to-remove-the-apostrophe-and-delete-the-space
How to replace to apostrophe ' inside a file using SED
This may do:
cat file
avbadf
test&rr
more [ yes
this ]
and'df
sed -r 's/(\x27|&|\[|\])/\\\1/g' file
avbadf
test\&rr
more \[ yes
this \]
and\'df
\x27 is equal to singe quote '
\x22 is equal to double quote "
Whoops, I found the answer to my question. Here is the working input :
sed "s/'/\\\'/g"
This will effectively replace any ' with \'.
However I'm having trouble understanding exactly what's happening here.
So if I understand correctly, we are escaping the backslash and the apostrophe in the replacement string. Now, if somebody could answer some those, I would be grateful :
Why don't we need to escape the first quote (the one in the pattern to find) ?
Why do we have to escape the backslash whereas for the other characters, there's no need ?
Why do we need to escape the second quote (the one in the replacement string) ?
I think all of your sed matches actually need that replacement pattern. This one seems to work for all examples:
ls | sed "s/\ /\\\ /g" | sed "s/\&/\\\&/g" | sed "s/\[/\\\[/g" | sed "s/\]/\\\]/g" | sed "s/'/\\\'/g"
So it is s/regex/replacement/command and 'regex' and 'replacement' have different sets of special characters.
The only one that's different is s/'/\\\'/g and there only because I don't believe there is any special ' character on the regex expression. There is some obscure \' special character in the replacement expression, for matching buffer ends in multi-line mode, accord to the docs. That might be why it needs an escape in the replacement side, but not in the regex side.
For example, \5 is a special character in the replacement expression, so to replace:
filename5.txt -> filename\5.txt
You would also need, as with apostrophe:
sed "s/5/\\\5/g"
It probably has to do with the mysterious inner works of sed parsing, it might read from right to left or something.
Please try the following:
sed 's/[][ &'\'']/\\&/g' file
By using the same example by #Jotne, the result will be:
gavbadf
gtest\&rr
gmore\ \[\ yes
gthis\ \]
gand\'df
[How it works]
The regex part in the sed s command above just defines a character
class of & []', which should be escaped with a backslash.
The right square bracket ] does not need escaping when put
immediately after the left square bracket [.
The obfuscating part will be the handling of a single quote.
We cannot put a single quote within single quotes even if we escape it.
The workaround is as follows: Say we have an assignment str='aaabbb'.
To put a single quote between "aaa" and "bbb", we can say as
str='aaa'\''bbb'.
It may look puzzling but it just concatenates the three sequences;
1) to close the single-quoted string as 'aaa'.
2) to put a single quote with an escaping backslash as \'.
3) to restart the single-quoted string as 'bbb'.
Hope this helps.

A good way to use sed to find and replace characters with 2 delimiters

I trying to find and replace items using bash. I was able to use sed to grab out some of the characters, but I think I might be using it in the wrong matter.
I am basically trying to remove the characters after ";" and before "," including removing ","
sed -e 's/\(;\).*\(,\)/\1\2/'
That is what I used to replace it with nothing. However, it ends up replacing everything in the middle so my output came out like this:
cmd2="BMC,./socflash_x64 if=B600G3_BMC_V0207.ima;,reboot -f"
This is the original text of what I need to replace
cmd2="BMC,./socflash_x64 if=B600G3_BMC_V0207.ima;X,sleep 120;after_BMC,./run-after-bmc-update.sh;hba_fw,./hba_fw.sh;X,sleep 5;DB,2;X,reboot -f"
Is there any way to make it look like this output?
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;reboot -f
Ff there is any way to make this happen other than bash I am fine with any type of language.
Non-greedy search can (mostly) be simulated in programs that don't support it by replacing match-any (dot .) with a negated character class.
Your original command is
sed -e 's/\(;\).*\(,\)/\1\2/'
You want to match everything in between the semi-colon and the comma, but not another comma (non-greedy). Replace .* with [^,]*
sed -e 's/\(;\)[^,]*\(,\)/\1\2/'
You may also want to exclude semi-colons themselves, making the expression
sed -e 's/\(;\)[^,;]*\(,\)/\1\2/'
Note this would treat a string like "asdf;zxcv;1234,qwer" differently, since one would match ;zxcv;1234, and the other would match only ;1234,
In perl:
perl -pe 's/;.*?,/;/g;' -pe 's/^[^,]*,//' foo.txt
will output:
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;2;reboot -f
The .*? is non greedy matching before the comma. The second command is to remove from the beginning to the comma.
Something like:
echo $cmd2 | tr ';' '\n' | cut -d',' -f2- | tr '\n' ';' ; echo
result is:
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;2;reboot -f;
however, I thing your requirements are a few more complex, because 'DB,2' seems a particular case. After "tr" command, insert a "grep" or "grep -v" to include/exclude these cases.

Escape file name for use in sed substitution

How can I fix this:
abc="a/b/c"; echo porc | sed -r "s/^/$abc/"
sed: -e expression #1, char 7: unknown option to `s'
The substitution of variable $abc is done correctly, but the problem is that $abc contains slashes, which confuse sed. Can I somehow escape these slashes?
Note that sed(1) allows you to use different characters for your s/// delimiters:
$ abc="a/b/c"
$ echo porc | sed -r "s|^|$abc|"
a/b/cporc
$
Of course, if you go this route, you need to make sure that the delimiters you choose aren't used elsewhere in your input.
The GNU manual for sed states that "The / characters may be uniformly replaced by any other single character within any given s command."
Therefore, just use another character instead of /, for example ::
abc="a/b/c"; echo porc | sed -r "s:^:$abc:"
Do not use a character that can be found in your input. We can use : above, since we know that the input (a/b/c/) doesn't contain :.
Be careful of character-escaping.
If using "", Bash will interpret some characters specially, e.g. ` (used for inline execution), ! (used for accessing Bash history), $ (used for accessing variables).
If using '', Bash will take all characters literally, even $.
The two approaches can be combined, depending on whether you need escaping or not, e.g.:
abc="a/b/c"; echo porc | sed 's!^!'"$abc"'!'
You don't have to use / as pattern and replace separator, as others already told you. I'd go with : as it is rather rarely used in paths (it's a separator in PATH environment variable). Stick to one and use shell built-in string replace features to make it bullet-proof, e.g. ${abc//:/\\:} (which means replace all : occurrences with \: in ${abc}) in case of : being the separator.
$ abc="a/b/c"; echo porc | sed -r "s:^:${abc//:/\\:}:"
a/b/cporc
backslash:
abc='a\/b\/c'
space filling....
As for the escaping part of the question I had the same issue and resolved with a double sed that can possibly be optimized.
escaped_abc=$(echo $abc | sed "s/\//\\\AAA\//g" | sed "s/AAA//g")
The triple A is used because otherwise the forward slash following its escaping backslash is never placed in the output, no matter how many backslashes you put in front of it.

Resources