Escape dollar sign in dollar-quoted strings query - cassandra

Hello I am trying to work on a Cassandra query which has an explanation field of data type text. I am using dollar-quoted strings to escape special characters but I face problem when the string of my explanation field ends with a dollar sign
For example
INSERT INTO Users (name, explanation) VALUES ($$Tom$$, $$Some'text$$$);
The last two dollar-quoted strings are the end quotes but the third last is a part of the explanation, how can I escape that? Or is there any other way through which I can escape all special characters including dollar sign?
Thanks in advance

Looking into grammar, it looks like that the lexer is just search until the next occurrence of the $$, and doesn't distinguish between double & triple dollar signs. But if you have any character after $, then it's just handled correctly (for example, string $$dwewdewe'adqdq$'$$ works just fine) - it's only shouldn't be a last character before ending $$. If you want to insert string with ' character inside, you can just escape (see doc) it with another ' character (for example; 'this is string with '' inside' - it works fine, and will produce this is string with ' inside as expected).
This is about inserting the statement one time. But if you're inserting the data from your program, then it's better to use prepared statements instead - they should be supported in all existing drivers. When you're using prepared statements, you don't need to take care about escaping - it's the job of the driver (really, no escaping happens, as string is sent as-is, not as part of the statement). And besides the lack of need for escaping, you also should get better performance, because parsing of the statement happens only once, when it's prepared, and then only statement ID plus parameters are sent to the Cassandra node.

Related

Regex remove both apostrophes if they exist in Python

I’m quit new to Regex but almost finished with my text mining script. Only one thing fails: I’m trying to remove the apostrophes between a word if they exist. I’m using re.sub for this.
For instance:
‘Apple’ needs to be Apple
‘apple’ needs to be apple
‘[apple]’ needs to be [apple]
‘(apple)’ needs to be (apple)
However: Apple’s needs to stay Apple’s because there is only one apostrophe.
How do I select both apostrophes when there is a word in between so I can delete them with re.sub? In every try I remove the entire string! Hopefully someone can help.
My code is as follows:
str_o='\'Apple\''
str_o_a = re.sub(r"\'(.*?)\'","", str_o)
I have a simpler idea: split by whitespace, trim leading and trailing apostrophes, join with whitespace. Avoids having to write a regular expression and handles sentences such as "She's 'her' mother's daughter".
text = "She's 'her' mother's daughter"
text = ' '.join([word.strip("'") for word in text.split()])
print(text)
# She's her mother's daughter
The purpose of the parentheses in your regular expression was probably to capture the string you want to keep. The idiom looks like
str_o_a = re.sub(r"'([^']*)'", r"\1", str_o)
You want a raw string around the replacement, too, in order to preserve the backslash in the argument (otherwise you would be replacing with the literal string "\x01").
Notice also the preference for using a negated character class over a non-greedy "match anything" wildcard.

How to split strings sperated by comma with escapes?

I have a string looks like this:
(The whole code block is a string, aka, this string contains quotation marks.)
"he\"llo", "world\n", "fro,m"
[update] Aka, the "actual" string is this:
"\"he\\\"llo\", \"world\\n\", \"fro,m\""
I want to get an array of strings like this:
[ "\"he\\\"llo\"", "\"world\\n\"", "\"fro,m\"" ]
[update] Comma inside quotation marks should be remained.
In my opinion, there are several ways to solve this:
build a automata (DFA or NFA) for this syntax
using several status flags like inQuote, handle judging logics with lots of if else
write a complex but clever Regular Expression for this
Are there any general solutions to this problem? Or how should I actually do using those thinkings above?
P.S. It couldn't be better if some syntax errors like "unclosed quotation mark" can be found.
You need to first define your grammar. This is a simple grammar for your case:
document = *WS [string *WS *(',' *WS string *WS)]
string = %x22 *char %x22
char = %x20-21 / %x23-5B / escape / %x5D-10FFFF
escape = %x5C (%x5C / %x22 / 't' / 'n' / 'r')
WS = %x9 / %x20
You can read it as:
A document may begin/end with a white space, then may have one or more strings separated by commas. Before and after each comma there may be some white space.
A string is made of characters and begins and ends with double quotes Unicode/ASCII hex code 22.
Each character (char), may be: 1) any non control Unicode character before the double quotes i.e. hex 20 (space) or hex 21 (exclamation mark); 2) any character after the double quotes and before the escape slash \ (hex 5C); 3) an escape character sequence; 4) any other Unicode character after the slash (hex 5C).
The escape sequence (rule escape) begins with the escape slash \ and is followed by another slash, or the characters t for tab, n for line feed and r for carriage return. You may add and other escapable characters if you want, as for a C++ string syntax you can see here: https://en.cppreference.com/w/cpp/language/escape .
A white space (WS) is a tab or space, you may add and %xA and %xD for line feed and carriage return respectively.
By the use of this grammar you will get this tree for your input:
The screenshort is from the Tunnel Grammar Studio online laboratory that can run ABNF grammars (as the one above), and I work on.
After you have the grammar, you may use tools to generate a parser, or you can write one yourself. If you want to do it by hand (preferable for so small and simple grammar), you may have one function per each grammar rule that reads one character and checks is it the expected one. If your input finishes when you are parsing the string rule, then you have an input with a started but not finished string.
Your actual string syntax tree will look like that:

Decrypt obfuscated perl script

Had some spam issues on my server and, after finding out and removing some Perl and PHP scripts I'm down to checking what they really do, although I'm a senior PHP programmer I have little experience with Perl, can anyone give me a hand with the script here:
http://pastebin.com/MKiN8ifp
(It was one long line of code, script was called list.pl)
The start of the script is:
$??s:;s:s;;$?::s;(.*); ]="&\%[=.*.,-))'-,-#-*.).<.'.+-<-~-#,~-.-,.+,~-{-,.<'`.{'`'<-<--):)++,+#,-.{).+,,~+{+,,<)..})<.{.)-,.+.,.)-#):)++,+#,-.{).+,,~+{+,,<)..})<*{.}'`'<-<--):)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})<'`'<.{'`'<'<-}.<)'+'.:*}.*.'-|-<.+):)~*{)~)|)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})
It continues with precious few non-punctuation characters until the very end:
0-9\;\\_rs}&a-h;;s;(.*);$_;see;
Replace the s;(.*);$_;see; with print to get this. Replace s;(.*);$_;see; again with print in the first half of the payload to get this, which is the decryption code. The second half of the payload is the code to decrypt, but I can't go any further with it, because as you see, the decryption code is looking for a key in an envvar or a cookie (so that only the script's creator can control it or decode it, presumably), and I don't have that key. This is actually reasonably cleverly done.
For those interested in the nitty gritty... The first part, when de-tangled looks like this:
$? ? s/;s/s;;$?/ :
s/(.*)/...lots of punctuation.../;
The $? at the beginning of the line is the pre-defined variable containing the child error, which no doubt serves only as obfuscation. It will be undefined, as there can be no child error at this point.
The questionmark following it is the start of a ternary operator
CONDITION ? IF_TRUE : IF_FALSE
Which is also added simply to obfuscate. The expression returned for true is a substitution regex, where the / slash delimiter has been replaced with colon s:pattern:replacement:. Above, I have put back slashes. The other expression, which is the one that will be executed is also a substitution regex, albeit an incredibly long one. The delimiter is semi-colon.
This substitution replaces .* in $_ - the default input and pattern-searching space - with a rather large amount of punctuation characters, which represents the bulk of the code. Since .* matches any string, even the empty string, it will simply get inserted into $_, and is for all intents and purposes identical to simply assigning the string to $_, which is what I did:
$_ = q;]="&\%[=.*.,-))'-,-# .......;;
The following lines are a transliteration and another substitution. (I inserted comments to point out the delimiters)
y; -"[%-.:<-#]-`{-}#~\$\\;{\$()*.0-9\;\\_rs}&a-h;;
#^ ^ ^ ^
#1 2 3
(1,2,3 are delimiters, the semi-colon between 2 and 3 is escaped)
The basic gist of it is that various characters and ranges -" (space to double quote), and something that looks like character classes (with ranges) [%-.:<-#], but isn't, get transliterated into more legible characters e.g. curly braces, dollar sign, parentheses,0-9, etc.
s;(.*);$_;see;
The next substitution is where the magic happens. It is also a substitution with obfuscated delimiters, but with three modifers: see. s does nothing in this case, as it only allows the wildcard character . to match newline. ee means to evaluate the expression twice, however.
In order to see what I was evaluating, I performed the transliteration and printed the result. I suspect that I somewhere along the line got some characters corrupted, because there were subtle errors, but here's the short (cleaned up) version:
s;(.*);73756220656e6372797074696f6e5f6 .....;; # very long line of alphanumerics
s;(..);chr(hex($1));eg;
s;(.*);$_;see;
s;(.*);704b652318371910023c761a3618265 .....;; # another long line
s;(..);chr(hex($1));eg;
&e_echr(\$_);
s;(.*);$_;see;
The long regexes are once again the data containers, and insert data into $_ to be evaluated as code.
The s/(..)/chr(hex($1))/eg; is starting to look rather legible. It is basically reading two characters at the time from $_ and converting it from hex to corresponding character.
The next to last line &e_echr(\$_); stumped me for a while, but it is a subroutine that is defined somewhere in this evaluated code, as hobbs so aptly was able to decode. The dollar sign is prefixed by backslash, meaning it is a reference to $_: I.e. that the subroutine can change the global variable.
After quite a few evaluations, $_ is run through this subroutine, after which whatever is contained in $_ is evaluated a last time. Presumably this time executing the code. As hobbs said, a key is required, which is taken from the environment %ENV of the machine where the script runs. Which we do not have.
Ask the B::Deparse module to make it (a little more) readable.

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

String literals and escape characters in postgresql

Attempting to insert an escape character into a table results in a warning.
For example:
create table EscapeTest (text varchar(50));
insert into EscapeTest (text) values ('This is the first part \n And this is the second');
Produces the warning:
WARNING: nonstandard use of escape in a string literal
(Using PSQL 8.2)
Anyone know how to get around this?
Partially. The text is inserted, but the warning is still generated.
I found a discussion that indicated the text needed to be preceded with 'E', as such:
insert into EscapeTest (text) values (E'This is the first part \n And this is the second');
This suppressed the warning, but the text was still not being returned correctly. When I added the additional slash as Michael suggested, it worked.
As such:
insert into EscapeTest (text) values (E'This is the first part \\n And this is the second');
Cool.
I also found the documentation regarding the E:
http://www.postgresql.org/docs/8.3/interactive/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g. E'foo'. (When continuing an escape string constant across lines, write E only before the first opening quote.) Within an escape string, a backslash character (\) begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represents a special byte value. \b is a backspace, \f is a form feed, \n is a newline, \r is a carriage return, \t is a tab. Also supported are \digits, where digits represents an octal byte value, and \xhexdigits, where hexdigits represents a hexadecimal byte value. (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\\). Also, a single quote can be included in an escape string by writing \', in addition to the normal way of ''.
The warning is issued since you are using backslashes in your strings. If you want to avoid the message, type this command "set standard_conforming_strings=on;". Then use "E" before your string including backslashes that you want postgresql to intrepret.
I find it highly unlikely for Postgres to truncate your data on input - it either rejects it or stores it as is.
milen#dev:~$ psql
Welcome to psql 8.2.7, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
milen=> create table EscapeTest (text varchar(50));
CREATE TABLE
milen=> insert into EscapeTest (text) values ('This will be inserted \n This will not be');
WARNING: nonstandard use of escape in a string literal
LINE 1: insert into EscapeTest (text) values ('This will be inserted...
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
INSERT 0 1
milen=> select * from EscapeTest;
text
------------------------
This will be inserted
This will not be
(1 row)
milen=>
Really stupid question: Are you sure the string is being truncated, and not just broken at the linebreak you specify (and possibly not showing in your interface)? Ie, do you expect the field to show as
This will be inserted \n This will not
be
or
This will be inserted
This will not be
Also, what interface are you using? Is it possible that something along the way is eating your backslashes?

Resources