Exact strings do not match in Jupyter Notebook

Exact strings do not match in Jupyter Notebook - python-3.x

What did I want to do?
I was reading file names with various organ names in their file endings and there are many such files using glob.glob('filename/**/blabla')
Later, I tried to match a particular string if present inside the filename using IN operator. like
"ADRENALGLAND(LEFT).NRRD" IN "blabla/blabla/blabla/blablabla_ADRENALGLAND(LEFT).NRRD"
It worked for other filenames with the same ending whereas it did not work for a few.
To debug, I was trying to match if visually the same filename endings from two files are the same programmatically, but they are not!!! why?
For debug, I tried to match string to string. Like below. But I saw a peculiar thing while comparing strings in python.
Can anyone tell me what is the difference here?
**
'ADRENALGLAND(LEFT).NRRD' == 'АDRENALGLAND(LEFT).NRRD' => False !!!
**
I bring it down to this part where 'A's do not match whereas others matched properly.

As mentioned by #canbax, I checked the underline ASCII value for both the character and found that they are different. One gave 65 (Normal ASCII Code for English Alphabet 'A') whereas the other one gave 1040.
You can use ord() to get the ASCII int value of a character.
Although the int values are different, visually they look the same, which might be an issue from the jupyter notebook side.
Final Solution: Replaced the fancy A with the normal A in the file.

Related

vim Search Replace should use replaced text in following searches

I have a data file (comma separated) that has a lot of NAs (It was generated by R). I opened the file in vim and tried to replace all the NA values to empty strings.
Here is a sample slimmed down version of a record in the file:
1,1,NA,NA,NA,NATIONAL,NA,1,NANA,1,AMERICANA,1
Once I am done with the search-replace, the intended output should be:
1,1,,,,NATIONAL,,1,NANA,1,AMERICANA,1
In other words, all the NAs should be replaced except the words NATIONAL, NANA and AMERICANA.
I used the following command in vim to do this:
1, $ s/\,NA\,/\,\,/g
But, it doesn't seem to work. Here is the output that I get:
1,1,,NA,,NATIONAL,,1,NANA,1,AMERICANA,1
As you can see, there is one ,NA, that is left out of the replacement process.
Does anyone have a good way to fix it? Thanks.
A trivial solution is to run the same command again and it will take care of the remaining ,NA,. However, it is not a feasible solution because my actual data file has 100s of columns and 500K+ rows each with a variable number of NAs.

, doesn't have a special meaning so you don't have to escape it:
:1,$s/,NA,/,,/g
Which doesn't solve your problem.
You can use % as a shorthand for 1,$:
:%s/,NA,/,,/g
Which doesn't solve your problem either.
The best way to match all those NA words to the exclusion of other words containing NA would be to use word boundaries:
:%s/,\<NA\>,/,,/g
Which still doesn't solve your problem.
Which makes those commas, that you used to restrict the match to NA and that are causing the error, useless:
:%s/\<NA\>//g
See :help :range and :help \<.

Use % instead of 1,$ (% means "the buffer" aka the whole file).
You don't need \,. , works fine.
Vim finds discrete, non-overlapping matches. so in ,NA,NA,NA, it only finds the first ,NA, and third ,NA, as the middle one doesn't have its own separate surrounding ,. We can modify the match to not include certain characters of our regex with \zs (start) and \ze (end). These modify our regex to find matches that are surrounded by other characters, but our matches don't actually include them, so we can match all the NA in ,NA,NA,NA,.
TL;DR: %s/,\zsNA\ze,//g

How to make a dictionary that contains an Arabic diacritic as a key in python

I am trying to make a program that converts the Arabic diacritics and letters into the Latin script. The letters work well in the program, but the diacritics can not be converted as I get an error every time I run the program.
At the beginning, I put the diacritics alone as keys but that did not work with me. please, see the last key, it contains َ ,which is a diacritic, but do not work properly as the letters:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("الوَ")
However, I tried to fix the problem by using letters attached with diacritics as keys, but the program resulted in the same error:
the dictionary:
ArEn = {'ا':'A', 'ل':'L', "وَ":"Wa"}
the error:
Traceback (most recent call last):
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 10, in <module>
convert("الوَ")
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 5, in convert
end_word.append(ArEn[lit[i]])
KeyError: 'و'

The chances are rather there is a bug in the programing-code editor you are using for coding Python than on Pyhton itself.
Since you are using Python-3.x, the diacritics from the running progam point of view are just a single character, like any other, and there should be no issues at all.
From the cod-editor point of view, there are issues such as whether to advance one character when displaying certain special unicode characters or not, and maybe the " character itself can be show out of space - when one tries to manually correct the position of the ", one could place it out of order, leaving the special character actually outside the quoted string -
The fact you could solve the issue by re-editing the file suggests that is indeed what happened.
One way to avoid this is to put certain special characters - specially ones that have different displaying rules, is to escape then with the "\uxxxx" unicode codepoint unicode sequence. This will avoid yourself or other persons having issues when editing your file again in the future, since even i yu get it working now, the editor may show then incorrectly when they are opened, and by trying to fix it one might break the syntax again.
You can use a table on the web or Python3's interactive prompt to get the unicode codepoint of each character, ensuring the code part of the program is displayed in a deterministic way in any editor - (if you add the diacritical char as a comment on the same line, it will actually enhance the readability of your code - enormously if it is ever supposed to be edited by non Arabic speakers)
So, your above declaration, I used this snippet to extract the codepoints:
>>> ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
>>> [print (hex(ord(yy)), yy ) for yy in ArEn.keys()]
0x648 و
0x644 ل
0x64e َ
0x627 ا
Which allows me to declare the dictionary like this:
ArEn = {
"\u0648": "W", # و
"\u0644": "L", # L
"\u064e": "a", # ۮ
"\u0627": "A", # ا
}
(And yes, I had trouble with displaying the characters on my terminal like I said you probably had on your editor while getting these - the fatha ("\u064e" - "a") character is tricky ! :-) )
Alternatively for using the codepoints in your code, is to use Python's unicode data module to discover and them use the actual character names - this can enhance readability further, and maybe by exploring unicodedata you can find out you don't even have to create this dictionary manually, but use that module instead -
In [16]: [print("\\u{:04x} - '{}' - {}".format(ord(yy), unicodedata.name(yy), yy) ) for yy in ArEn.keys()]
\u0648 - 'ARABIC LETTER WAW' - و
\u0644 - 'ARABIC LETTER LAM' - ل
\u064e - 'ARABIC FATHA' - َ
\u0627 - 'ARABIC LETTER ALEF' - ا
And from these full text names, you can get back to the character with the unicodedata.lookup function:
>>> unicodedata.lookup("ARABIC LETTER LAM")
'ل'
notes:
1) This requires Python3 - for Python2 one might try to prefix each string with u"" - but one dealign with these characters is far better off using Python 3, since unicode support is one of the big deals with it.
2) This also requires a terminal with a nice support for unicode characters using "utf-8" encoding - I am on a Linux system with the "konsole" terminal. On Windows, the idle Python prompt might work, but not the cmd Python prompt.

You might need proper indentation in python:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")

Update: I just noticed, after years, that the letters and diacritics are put together in the first try. When I separated them, the program worked.
I just solved the problem!
I am not really sure if it is a mistake in python or something else, but as far as I know python does not support Arabic very well. Or maybe I made a problem in the program above.
I kept writing the same program and suddenly it worked very well.
I even added different diacritics and they worked properly.
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")
the reult is
AwLWa

Character Encoding interferes with matching Scala strings?

Right now dealing with a weird problem when trying to match two Scala strings. When trying to determine if the following two strings are the same:
SM8lz5IEIWs7TUhR3ke27pnY3XsjojxqaMEg+ARCGs1nm3sVkwA+CM+XJfdsUxqzqH7LZdkflvny
z621tYkmXA== and SM8lz5IEIWs7TUhR3ke27pnY3XsjojxqaMEg+ARCGs1nm3sVkwA+CM+XJfdsUxqzqH7LZdkflvny
z621tYkmXA==
Scala returns false. So if I do the following if(hash1 == hash2) it returns false.
I suspect this is either a whitespace or character encoding issue, since hash matching only fails when trying to match a hash that was produced on a computer of a different operating system. I already tried stripping whitespace using regex, but it still failed.
What have I overlooked? And are there better ways to clean and match hashes in Scala?
Update
After comparing the two strings, Scala thinks hash2 is a single character longer than hash1. So I ran the following functions on both hashes: .trim.replaceAll("""(?m)\s+$""", ""). Still, it says they're not the same. What other characters could be interfering?

I have found the cause of this particular problem. Apparently when processing strings on Macintosh, \r is added in addition to any line breaks. Even though line break characters don't print out on a console, they're still inside the string.
The remedy was to do the following: .trim.replaceAll("\r", "")
And now both strings match.

Mocha fails with no string diff, utf-8?

I have a failing mocha test that outputs my string with the "Actual" and "Expected" highlighting... except that nothing's highlighted.
After some head-bashing, I think I've determined that my actual string contains some whacky UTF-8 characters that are completely hidden from me, and Mocha doesn't seem to know to highlight them.
I figured this out by writing out my expected and actual values to raw text files and loading them up in Kaleidoscope, which shows that they differ by highlighting what appears to be empty spaces between words.
I tried loading the utf8 library (on npm) and encoding one of the strings with utf8.encode str, and that still failed, but now the characters appear as something more than blank spaces, and Mocha does highlighting:
But either way, my tests are failing. How can I encode/decode/whatever these strings so that they match and my tests pass?
Btw, the comparison string I'm using in my test looks like this:

Make sure that either your text editor is saving your source code as proper UTF-8, or convert those copy/pasted chars to escaped literals as #loganfsmyth correctly comments.

Why doesn't Vims errorformat take regular expressions?

Vims errorformat (for parsing compile/build errors) uses an arcane format from c for parsing errors.
Trying to set up an errorformat for nant seems almost impossible, I've tried for many hours and can't get it. I also see from my searches that alot of people seem to be having the same problem. A regex to solve this would take minutesto write.
So why does vim still use this format? It's quite possible that the C parser is faster but that hardly seems relevant for something that happens once every few minutes at most. Is there a good reason or is it just an historical artifact?

It's not that Vim uses an arcane format from C. Rather it uses the ideas from scanf, which is a C function. This means that the string that matches the error message is made up of 3 parts:
whitespace
characters
conversion specifications
Whitespace is your tabs and spaces. Characters are the letters, numbers and other normal stuff. Conversion specifications are sequences that start with a '%' (percent) character. In scanf you would typically match an input string against %d or %f to convert to integers or floats. With Vim's error format, you are searching the input string (error message) for files, lines and other compiler specific information.
If you were using scanf to extract an integer from the string "99 bottles of beer", then you would use:
int i;
scanf("%d bottles of beer", &i); // i would be 99, string read from stdin
Now with Vim's error format it gets a bit trickier but it does try to match more complex patterns easily. Things like multiline error messages, file names, changing directory, etc, etc. One of the examples in the help for errorformat is useful:
1 Error 275
2 line 42
3 column 3
4 ' ' expected after '--'
The appropriate error format string has to look like this:
:set efm=%EError\ %n,%Cline\ %l,%Ccolumn\ %c,%Z%m
Here %E tells Vim that it is the start of a multi-line error message. %n is an error number. %C is the continuation of a multi-line message, with %l being the line number, and %c the column number. %Z marks the end of the multiline message and %m matches the error message that would be shown in the status line. You need to escape spaces with backslashes, which adds a bit of extra weirdness.
While it might initially seem easier with a regex, this mini-language is specifically designed to help with matching compiler errors. It has a lot of shortcuts in there. I mean you don't have to think about things like matching multiple lines, multiple digits, matching path names (just use %f).
Another thought: How would you map numbers to mean line numbers, or strings to mean files or error messages if you were to use just a normal regexp? By group position? That might work, but it wouldn't be very flexible. Another way would be named capture groups, but then this syntax looks a lot like a short hand for that anyway. You can actually use regexp wildcards such as .* - in this language it is written %.%#.
OK, so it is not perfect. But it's not impossible either and makes sense in its own way. Get stuck in, read the help and stop complaining! :-)

I would recommend writing a post-processing filter for your compiler, that uses regular expressions or whatever, and outputs messages in a simple format that is easy to write an errorformat for it. Why learn some new, baroque, single-purpose language unless you have to?

According to :help quickfix,
it is also possible to specify (nearly) any Vim supported regular
expression in format strings.
However, the documentation is confusing and I didn't put much time into verifying how well it works and how useful it is. You would still need to use the scanf-like codes to pull out file names, etc.

They are a pain to work with, but to be clear: you can use regular expressions (mostly).
From the docs:
Pattern matching
The scanf()-like "%*[]" notation is supported for backward-compatibility
with previous versions of Vim. However, it is also possible to specify
(nearly) any Vim supported regular expression in format strings.
Since meta characters of the regular expression language can be part of
ordinary matching strings or file names (and therefore internally have to
be escaped), meta symbols have to be written with leading '%':
%\ The single '\' character. Note that this has to be
escaped ("%\\") in ":set errorformat=" definitions.
%. The single '.' character.
%# The single '*'(!) character.
%^ The single '^' character. Note that this is not
useful, the pattern already matches start of line.
%$ The single '$' character. Note that this is not
useful, the pattern already matches end of line.
%[ The single '[' character for a [] character range.
%~ The single '~' character.
When using character classes in expressions (see |/\i| for an overview),
terms containing the "\+" quantifier can be written in the scanf() "%*"
notation. Example: "%\\d%\\+" ("\d\+", "any number") is equivalent to "%*\\d".
Important note: The \(...\) grouping of sub-matches can not be used in format
specifications because it is reserved for internal conversions.

lol try looking at the actual vim source code sometime. It's a nest of C code so old and obscure you'll think you're on an archaeological dig.
As for why vim uses the C parser, there are plenty of good reasons starting with that it's pretty universal. But the real reason is that sometime in the past 20 years someone wrote it to use the C parser and it works. No one changes what works.
If it doesn't work for you the vim community will tell you to write your own. Stupid open source bastards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string