Strange trim function behavior

Strange trim function behavior - string

I wonder why I've got empty string as a result when I'm especting something completely else...
I use trim function to cut phone number from string:
select trim(leading '509960405' from '509960405509960404');
Why the result isn't 509960404 as expected?

trim strips out any characters matching a list of characters. All the characters in your string are in your "leading" list of characters. What you wrote could just as easily be written as
select trim(leading '04569' from '509960405509960404');
It removes any 0, 4, 5, 6 or 9 characters from the beginning of your string. Since your string consists of only 0, 4, 5, 6, or 9 characters, it removes them all.

#Paul clarified the behaviour of trim().
The solution you presented in the comment is potentially treacherous:
SELECT replace('509960405509960404','509960405','')
Replaces all occurrences of '509960405' not just the first. For example:
SELECT replace('509960405509960404','50996040' ,'');
Results in 54. I suspect that's not what you want.
Use a regular expressions with regexp_replace():
SELECT regexp_replace('509960405509960404','^509960405' ,'');
^ .. glues the pattern to the start of the string ("left-anchored").
regexp_replace() is more expensive than a simple replace() but also more versatile.

Related

Regular expression (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$) meaning

Given a tweet dataset from this link which has a content column as follows:
I hope to add one new column to identify whether or not the tweet mentioned Trump. The regex patern (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$) seems work out, but I don't understand well. I've tested with the code below:
Test1 gives the output since it's matched:
txt1 = "anti-Trump protesters"
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt1)
Out:
<_sre.SRE_Match object; span=(4, 11), match='-Trump '>
Test2 return None since it's not matched as expected:
txt2 = 'I got Trumped'
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt2)
Someone could help to explain a little bit about this pattern. Many thanks at advance.

The (^|[^A-Za-z0-9]) portion has |, which means “or”. The left side, the ^, is the start of the string. The right side, [^A-Za-z0-9], matches any character that is not a letter or a number. In short, it matches when “Trump” is at the start of the string, or is preceded by a non-alphanumeric character.
The ([^A-Za-z0-9]|$) follows a similar pattern, where the left side matches any character that is not a letter or a number. The right side, the $ matches the end of the string. Likewise, it matches when “Trump” is at the end of the string or is followed by a non-alphanumeric character.
So, bottom line, it matches “Trump“ that is either at the start of the string or is preceded by any character that is not alphanumeric, as well as matches if it is also and the end of the string or is followed by a non-alphanumeric character.

Keeping only letters and digits in a string

I am recoding some open survey responses in SPSS and am wanting just to keep the usual characters a-z and 1-9
I have done rtrim and ltrim which has worked on the majority, but some strings have trailing spaces remaining, which I am assuming are not actually spaces but are hidden characters.
I have also removed punctuation such as "?" but I imagine there must be a more straightforward way than going through each one.
e.g. I need
"exam'ple! " or " exam!!--ple?"
to say "example"

The following syntax will create a new clean field and copy to it only the digits and letters (uppercase or lowercase) from the original field.
Note that I used 15 as the new field width and as the number of iterations in the loop - please change 15 to the actual width of the original field
do repeat val=1 to 15.
compute #i = number(char.substr(OrigField, val, 1), PIB).
if range(#i, 48, 57) or
range(#i, 65, 90) or
range(#i, 97, 122)
CleanField=concat(rtrim(CleanField), char.substr(OrigField, val, 1)).
end repeat.
exe.
See the link suggested by #user45392 to understand how/why this works.
Also see this list for additional values you can add to the loop if you'd like.

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.

Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)

The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44

#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

How to capture only one word from a sentence in Excel?

AWESOME :)
Another QUESTION:
What if I have multiple Sentences like:
[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)
[PATCHING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (CCB)
Notice that the last word that I need to take out varies from sentence to sentence. How can i make sure that I always take out the last part of the sentence. In this case; (AMB) and (CCB)
I also need to do the same with the words at the beginning:
[PROGRAMMING]
[PATCHING]
Thanks :)

You can use this for the part within []:
=MID(A2,2,FIND("]",A2)-2)
And this for the part within ():
=MID(A2,FIND("(",A2)+1,3)
googlespreadsheet sample
MID takes 3 parameters:
A text,
A starting position,
The length of the extracted text.
FIND takes 2-3 parameters and returns a position number:
Something it will look for,
The text in which it will look for the something,
The position from where it'll start looking. If not mentioned, looks from the beginning.
=MID(A2,2,FIND("]",A2)-2) with your first example becomes the following after replacing the innermost evaluation:
=MID(A2,2,FIND("]","[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)")-2)
FIND("]","[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)")
] appears at the 13th position, so this FIND() returns 13. The MID becomes:
=MID(A2,2,13-2) => =MID(A2,2,11)
And if you count the characters in PROGRAMMING, there are 11. I removed 2, because 1 is for the beginning [ to be removed, the second is for the ] to be removed.
Now, it becomes:
=MID("[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)",2,11)
Which means start (including) at character 2 and take 11 characters, which gives the text you are looking for.
The one for () is just as simple if you got the above.

You can use the MID() function if the data always follows the same pattern.
=MID(A1, FIND("(",A1, 1) + 1, LEN(A1)-FIND("(",A1, 1)-1)
Assuming the string is in A1. The first parameter is the string. The second is the start of the substring to extract. You want to start one character past the first parenthesis. The last parameter is the length of the substring to extract. You want to take the whole string minus all the characters before the parenthesis and also ignore the last parenthesis (thus the -1).

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")

"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));

That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Strange trim function behavior - string

I wonder why I've got empty string as a result when I'm especting something completely else... I use trim function to cut phone number from string: select trim(leading '509960405' from '509960405509960404'); Why the result isn't 509960404 as expected?

Related

Regular expression (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$) meaning

Keeping only letters and digits in a string

Why doesn't this RegEx match anything?

How to capture only one word from a sentence in Excel?

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Categories

Resources