Split only when quotes contain 3 characters or more - python-3.x

I'm trying to split a given string when two quotes appear and they contain at least 3 characters(I also have to split whenever . or , appear). So, something like hello"example"hello,cat should return [hello;example;hello;cat].
I came up with:
re.split("\'(...+)\'|\.|,","hello'example'hello,cat")
This works fine with the quotes, but whenever it split for . or , this happens:
['hello', 'example', 'hello', None, 'cat']
I found out the capture group is the one that causes it (the None in the middle of the list), but it is the only way I know to keep the content.
Please keep in mind that I have to do as few as possible computations because the program shall work with huge files, also I'm not very experienced with Python so sorry if I did something obvious wrong.

Try just:
re.split("\'|\.|,", "hello'example'hello,cat")

It's tricky because the open quote and close quote is the exact same character. I think you'd have to use a negative look behind to exclude any single quote that is preceded by 0, 1 or 2 characters and another single quote. In addition, you'd have to use a positive lookahead. This works in javascript.
re.split("(?<!'(.?|..))'(?=[^']{3,})|\.|\,", "hello'example'hello,cat")
But it doesn't look like python supports variable-length lookbehinds. Also, this won't work if there is a lone single quote (apostrophe).

Related

vim Search Replace should use replaced text in following searches

I have a data file (comma separated) that has a lot of NAs (It was generated by R). I opened the file in vim and tried to replace all the NA values to empty strings.
Here is a sample slimmed down version of a record in the file:
1,1,NA,NA,NA,NATIONAL,NA,1,NANA,1,AMERICANA,1
Once I am done with the search-replace, the intended output should be:
1,1,,,,NATIONAL,,1,NANA,1,AMERICANA,1
In other words, all the NAs should be replaced except the words NATIONAL, NANA and AMERICANA.
I used the following command in vim to do this:
1, $ s/\,NA\,/\,\,/g
But, it doesn't seem to work. Here is the output that I get:
1,1,,NA,,NATIONAL,,1,NANA,1,AMERICANA,1
As you can see, there is one ,NA, that is left out of the replacement process.
Does anyone have a good way to fix it? Thanks.
A trivial solution is to run the same command again and it will take care of the remaining ,NA,. However, it is not a feasible solution because my actual data file has 100s of columns and 500K+ rows each with a variable number of NAs.
, doesn't have a special meaning so you don't have to escape it:
:1,$s/,NA,/,,/g
Which doesn't solve your problem.
You can use % as a shorthand for 1,$:
:%s/,NA,/,,/g
Which doesn't solve your problem either.
The best way to match all those NA words to the exclusion of other words containing NA would be to use word boundaries:
:%s/,\<NA\>,/,,/g
Which still doesn't solve your problem.
Which makes those commas, that you used to restrict the match to NA and that are causing the error, useless:
:%s/\<NA\>//g
See :help :range and :help \<.
Use % instead of 1,$ (% means "the buffer" aka the whole file).
You don't need \,. , works fine.
Vim finds discrete, non-overlapping matches. so in ,NA,NA,NA, it only finds the first ,NA, and third ,NA, as the middle one doesn't have its own separate surrounding ,. We can modify the match to not include certain characters of our regex with \zs (start) and \ze (end). These modify our regex to find matches that are surrounded by other characters, but our matches don't actually include them, so we can match all the NA in ,NA,NA,NA,.
TL;DR: %s/,\zsNA\ze,//g

How to capture a string between parentheses?

str = "fa, (captured)[asd] asf, 31"
for word in str:gmatch("\(%a+\)") do
print(word)
end
Hi! I want to capture a word between parentheses.
My Code should print "captured" string.
lua: /home/casey/Desktop/test.lua:3: invalid escape sequence near '\('
And i got this syntax error.
Of course, I can just find position of parentheses and use string.sub function
But I prefer simple code.
Also, brackets gave me a similar error.
The escape character in Lua patterns is %, not \. So use this:
word=str:match("%((%a+)%)")
If you only need one match, there is no need for a gmatch loop.
To capture the string in square brackets, use a similar pattern:
word=str:match("%[(%a+)%]")
If the captured string is not entirely composed of letters, use .- instead of %a+.
lhf's answer likely gives you what you need, but I'd like to mention one more option that I feel is underused and may work for you as well. One issue with using %((%a+)%) is that it doesn't work for nested parentheses: if you apply it to something like "(text(more)text)", you'll get "more" even though you may expect "text(more)text". Note that you can't fix it by asking to match to the first closing parenthesis (%(([^%)]+)%)) as it will give you "text(more".
However, you can use %bxy pattern item, which balances x and y occurrences and will return (text(more)text) in this case (you'd need to use something like (%b()) to capture it). Again, this may be overkill for your case, but useful to keep in mind and may help someone else who comes across this problem.

Add parenthesis to vim string

I'm having a bit of trouble using parenthesis in a vim string. I just need to add a set of parenthesis around 3 digits, but I can't seem to find where I'm suppose to correctly place them. So for example; I would have to place them around a phone number such as: 2015551212.
Right now I have a strings that separates the numbers and puts a hyphen between them. For example; 201 555-1212. So I just need the parenthesis. The final result should look like: (201) 555-1212
The string I have so far is this: s/\(\d\{3}\)\(\d\{3}\)/\1 \2-/g
How might I go about doing this?
Thanks
Just add the parens around the \1 in your replacement.
s/\(\d\{3\}\)\(\d\{3\}\)/(\1) \2-/g
If you want to go in reverse, and change "(800) 555-1212" to "8005551212", you can use something like this:
s/(\(\d\d\d\))\ \(\d\d\d\)-\(\d\d\d\d\)/\1\2\3/g
Instead of the \d\d\d, you could use \d\{3\}, but that is more trouble to type.

replacing part of regex matches

I have several functions that start with get_ in my code:
get_num(...) , get_str(...)
I want to change them to get_*_struct(...).
Can I somehow match the get_* regex and then replace according to the pattern so that:
get_num(...) becomes get_num_struct(...),
get_str(...) becomes get_str_struct(...)
Can you also explain some logic behind it, because the theoretical regex aren't like the ones used in UNIX (or vi, are they different?) and I'm always struggling to figure them out.
This has to be done in the vi editor as this is main work tool.
Thanks!
To transform get_num(...) to get_num_struct(...), you need to capture the correct text in the input. And, you can't put the parentheses in the regular expression because you may need to match pointers to functions too, as in &get_distance, and uses in comments. However, and this depends partially on the fact that you are using vim and partially on how you need to keep the entire input together, I have checked that this works:
%s/get_\w\+/&_struct/g
On every line, find every expression starting with get_ and continuing with at least one letter, number, or underscore, and replace it with the entire matched string followed by _struct.
Darn it; I shouldn't answer these things on spec. Note that other regex engines might use \& instead of &. This depends on having magic set, which is default in vim.
For an alternate way to do it:
%s/get_\(\w*\)(/get_\1_struct(/g
What this does:
\w matches to any "word character"; \w* matches 0 or more word characters.
\(...\) tells vim to remember whatever matches .... So, \(w*\) means "match any number of word characters, and remember what you matched. You can then access it in the replacement with \1 (or \2 for the second, etc.)
So, the overall pattern get_\(\w*\)( looks for get_, followed by any number of word chars, followed by (.
The replacement then just does exactly what you want.
(Sorry if that was too verbose - not sure how comfortable you are with vim regex.)

String compare : Comparing 'zürich' and 'zurich' results in -1

I'm trying to do a string compare for 'zürich' and 'zurich'
Something like this:
int compareResult = String.Compare(zürich, zurich);
So what happens is that it returns -1, which causes a problem as I'm using compareResult for an if-else later.
Can someone point me to the right direction on why does this happen. Do I need to clean this first before comparing "zürich" or is it something else?
you use the method just fine, but the strings are actually different.
so, in order to make this comparison in your way, you need:
decide if this you want every comparison that uses ü and other "special" latin characters to look at them as they were the simple characters.
i.e. in every time you see ü, it will treat it as a "u"
if so, you need to do pre-processing of both the strings, and replace all special chars with regular ones.
there is another thread about it here:
How can I remove accents on a string?
hope it helped.

Resources