How can I escape all non-alphanumeric characters in AWK? - string

I inherited a very large AWK script that matches against .csv files, and I've found it does not match some alphanumeric characters, especially + ( ).
While I realize this would be easy in sed:
sed 's/\([^A-z0-9]\)/\\\1/g'
I can't seem to find a way to call on the matched character the same way in AWK.
For instance a sample input is:
select.awk 'Patient data +(B/U)'
I would like to escape the non-alphanumeric characters, and turn the line into:
Patient\ data\ \+\(B\/U\)
I have seen some people pass very obscure non-alphanumeric characters as well, which I would like to escape.

gsub(/[^[:alnum:]]/, "\\\\&", arg)

the gnu variant has more feature,
awk '{n=gensub(/[^[:alnum:]]/,"\\\\&","g"); print n}' d.csv

Related

how to transpose values two by two using shell?

I have my data in a file store by lines like this :
3.172704445659,50.011996744997,3.1821975358417,50.012335988197,3.2174797791605,50.023182479597
And I would like 2 columns :
3.172704445659 50.011996744997
3.1821975358417 50.012335988197
3.2174797791605 50.023182479597
I know sed command for delete ','(sed "s/,/ /") but I don't know how to "back to line" every two digits ?
Do you have any ideas ?
One in awk:
$ awk -F, '{for(i=1;i<=NF;i++)printf "%s%s",$i,(i%2&&i!=NF?OFS:ORS)}' file
Output:
3.172704445659 50.011996744997
3.1821975358417 50.012335988197
3.2174797791605 50.023182479597
Solution viable for those without knowledge of awk command - simple for loop over an array of numbers.
IFS=',' read -ra NUMBERS < file
NUMBERS_ON_LINE=2
INDEX=0
for NUMBER in "${NUMBERS[#]}"; do
if (($INDEX==$NUMBERS_ON_LINE-1)); then
INDEX=0
echo "$NUMBER"
else
((INDEX++))
echo -n "$NUMBER "
fi
done
Since you already tried sed, here is a solution using sed:
sed -r "s/(([^,]*,){2})/\1\n/g; s/,\n/\n/g" YOURFILE
-r uses sed extended regexp
there are two substitutions used:
the first substitution, with the (([^,]*,){2}) part, captures two comma separated numbers at once and store them into \1 for reuse: \1 holds in your example at the first match: 3.172704445659,50.011996744997,. Notice: both commas are present.
(([^,]*,){2}) means capture a sequence consisting of NOT comma - that is the [^,]* part followed by a ,
we want two such sequences - that is the (...){2} part
and we want to capture it for reuse in \1 - that is the outer pair of parentheses
then substitute with \1\n - that just inserts the newline after the match, in other words a newline after each second comma
as we have now a comma before the newline that we need to get rid of, we do a second substitution to achieve that:
s/,\n/\n/g
a comma followed by newline is replace with only newline - in other words the comma is deleted
awk and sed are powerful tools, and in fact constitute programming languages in their own right. So, they can, of course, handle this task with ease.
But so can bash, which will have the benefits of being more portable (no outside dependencies), and executing faster (as it uses only built-in functions):
IFS=$', \n'
values=($(</path/to/file))
printf '%.13f %.13f\n' "${values[#]}"

How can I remove a newline (\n) at the end of a string?

The problem
I have multiple property lines in a single string separated by \n like this:
LINES2="Abc1.def=$SOME_VAR\nAbc2.def=SOMETHING_ELSE\n"$LINES
The LINES variable
might contain an undefined set of characters
may be empty. If it is empty, I want to avoid the trailing \n.
I am open for any command line utility (sed, tr, awk, ... you name it).
Tryings
I tried this to no avail
sed -z 's/\\n$//g' <<< $LINES2
I also had no luck with tr, since it does not accept regex.
Idea
There might be an approach to convert the \n to something else. But since $LINES can contain arbitrary characters, this might be dangerous.
Sources
I skim read through the following questions
How can I replace a newline (\n) using sed?
sed with literal string--not input file
Here's one solution:
LINES2="Abc1.def=$SOME_VAR"$'\n'"Abc2.def=SOMETHING_ELSE${LINES:+$'\n'$LINES}"
The syntax ${name:+value} means "insert value if the variable name exists and is not empty." So in this case, it inserts a newline followed by $LINES if $LINES is not empty, which seems to be precisely what you want.
I use $'\n' because "\n" is not a newline character. A more readable solution would be to define a shell variable whose value is a single newline.
It is not necessary to quote strings in shell assignment statements, since the right-hand side of an assignment does not undergo word-splitting nor glob expansion. Not quoting would make it easier to interpolate a $'\n'.
It is not usually advisable to use UPPER-CASE for shell variables because the shell and the OS use upper-case names for their own purposes. Your local variables should normally be lower case names.
So if I were not basing the answer on the command in the question, I would have written:
lines2=Abc1.def=$someVar$'\n'Abc2.def=SOMETHING_ELSE${lines:+$'\n'$lines}

Replace control characters and spaces with escape sequences

I want to replace control characters (ASCII 0-31) and spaces (ASCII 32) with hex escape codes. For example:
$ escape 'label=My Disc'
label=My\x20Disc
$ escape $'multi\nline\ttabbed string'
multi\x0Aline\x09tabbed\x20string
$ escape '\'
\\
For context, I'm writing a script which statuses a DVD drive. Its output is designed to be parsed by another program. My idea is to print each piece of info as a separate space-separated word. For example:
$ ./discStatus --monitor
/dev/dvd: no-disc
/dev/dvd: disc blank writable size=0 capacity=2015385600
/dev/dvd: disc not-blank not-writable size=2015385600 capacity=2015385600
I want to add the disc's label to this output. To fit with the parsing scheme I need to escape spaces and newlines. I might as well do all the other control characters as well.
I'd prefer to stick to bash, sed, awk, tr, etc., if possible. I can't think of a really elegant way to do this with those tools, though. I'm willing to use perl or python if there's no good solution with basic shell constructs and tools.
Here's a Perl one-liner I came up with. It uses /e to run code in the replacements.
perl -pe 's/([\x00-\x20\\])/sprintf("\\x%02X", ord($1))/eg'
A slight deviation from the example in my question: it emits \x5C for backslashes instead of \\.
I would use a higher-level language. There are three different types of replacement going on (single character to multicharacter for the control characters and space, identity for other printable characters, and the special case of doubling the backslash), which I think is too much for awk, sed, and the like to handle simply.
Here's my approach for Python
def translate(c):
cp = ord(c)
if cp in range(33):
return '\\x%02x'%(cp,)
elif c == '\\':
return r'\\'
else:
return c
if __name__ == '__main__':
import sys
print ''.join( map(translate, sys.argv[1]) )
If speed is a concern, you can replace the translate function with a prebuilt dictionary mapping each character to its desired string representation.
Wow, it looks like a fairly trivial sed script along the lines of
's|\n|\\n|' for each character you want to substitute.

linux bash replace placeholder with unknown text which can contain any characters

If I want to replace for example the placeholder {{VALUE}} with another string which can contain any characters, what's the best way to do it?
Using sed s/{{VALUE}}/$(value)/g might fail if $(value) contains a slash...
oldValue='{{VALUE}}'
newValue='new/value'
echo "${var//$oldValue/$newValue}"
but oldValue is not a regexp but works like a glob pattern, otherwise :
echo "$var" | sed 's/{{VALUE}}/'"${newValue//\//\/}"'/g'
Sed also works like 's|something|someotherthing|g' (or with other delimiters for that matter), but if you can't control the input string, you'll have to use some function to escape it before passing it to sed..
The question asked basically duplicates How can I escape forward slashes in a user input variable in bash?, Escape a string for sed search pattern, Using sed in a makefile; how to escape variables?, Use slashes in sed replace, and many other questions. “Use a different delimiter” is the usual answer. Pianosaurus's answer and Ben Blank's answer list characters (backslash and ampersand) that need to be escaped in the shell, besides whatever character is used as an alternate delimiter. However, they don't address the quoting-a-quote problem that will occur if your “string which can contain any characters” contains a double quote. The same kind of problem can affect the ${parameter/pattern/string} shell variable expansion mentioned in a previous answer.
Some other questions besides the few mentioned above suggest using awk, and that is usually a good approach to changes that are more complicated than are easy to do with sed. Also consider perl and python. Besides single- and double-quoted strings, python has u'...' unicode quoting, r'...' raw quoting,ur'...' quoting, and triple quoting with ''' or """ delimiters. The question as stated doesn't provide enough context for specific awk/perl/python solutions.

How to tell sed "do not remove some characters"?

I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, ... ).
How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.
With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:
sed 's/[\x00-\x7F]//g' # hex notation
sed 's/[\o000-\o177]//g' # octal notation
You should also be able to achieve the same effect with the tr command:
tr -d '[\000-\177]'
Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that's a standard ASCII (7 bits) character.
To keep everything except some well defined characters, use a negative character classe:
sed 's/[^characters you want to keep]//g'
Using a pattern alike to [^…]\+ might improve performance of the regex.

Resources