I pasted some text from a text editor (Atom) into IPython and it was rendered as I saw it on the editor, but some special characters appeared, too. These are light-blue carat capital-i's (^I). They seem to represent indentations. Indeed, when I search through the string by index slices, they show tab characters (\t).
What is this symbol's name? I tried to find it using unicodedata.name('^I'), but it returned a ValueError: no such name error.
If anyone knows where I can find a table of characters by their string representation that will save me a lot of time. The unicode.org source cited in the SO post above does not allow that. Something like this, but with ^I.
Related
When i try to paste in a Unicode in The Text Entity as a text. it comes up with the error:
:text(warning): No definition in for character U+21d5
I have also tried to use the Unicode string itself inside the text string.
Anyone know how to do it?
What did I want to do?
I was reading file names with various organ names in their file endings and there are many such files using glob.glob('filename/**/blabla')
Later, I tried to match a particular string if present inside the filename using IN operator. like
"ADRENALGLAND(LEFT).NRRD" IN "blabla/blabla/blabla/blablabla_ADRENALGLAND(LEFT).NRRD"
It worked for other filenames with the same ending whereas it did not work for a few.
To debug, I was trying to match if visually the same filename endings from two files are the same programmatically, but they are not!!! why?
For debug, I tried to match string to string. Like below. But I saw a peculiar thing while comparing strings in python.
Can anyone tell me what is the difference here?
**
'ADRENALGLAND(LEFT).NRRD' == 'АDRENALGLAND(LEFT).NRRD' => False !!!
**
I bring it down to this part where 'A's do not match whereas others matched properly.
As mentioned by #canbax, I checked the underline ASCII value for both the character and found that they are different. One gave 65 (Normal ASCII Code for English Alphabet 'A') whereas the other one gave 1040.
You can use ord() to get the ASCII int value of a character.
Although the int values are different, visually they look the same, which might be an issue from the jupyter notebook side.
Final Solution: Replaced the fancy A with the normal A in the file.
I have an issue with doing a search.
My data has a space like this (between Ground and the Hyphen)
Ground -
but I can't replicate it. it isn't a normal space or a tab and I have tried all the zero white space characters i know. And most text styling assumes it is a normal space and converts it so (including StackOverflow)
The only way I can show it even exists is through notepad++
As you can see normal spaces have an orange dot but this does not. further more it thinks it is part of the word.
unfortunately I cant change the source data so cant change these weird spaces to normal spaces. so I need a search that includes them to return results. I just don't know how to replicate that space.
Any ideas would be most helpful.
I know there are some threads about this on stackoverflow but when i write ":set list" in the editor, it seems to display hidden characters but it doesnt display the hidden characters in the code we are having problems with.
Some times now we have had some invisible symbols in our code making if loops break, i dont know how the symbols get there except from that some wierd keyboard combination much have been accidentally typed in. The code itself looks correct but the invisible symbol breaks it.
I have searched online about this but all i can find seems to be the ":set list" command in vim in addition to have to change the color of the hidden characters, but while this seems to display some hidden characters it doesnt display the problematic ones. We are getting two symbols which looks like a cross and one looks like a pistol. We have also tried to add the "draw_white_space" setting in sublime text but this only seems to display, well, whitspace like it says but the result was shown on google for showing hiden characters so i gave it a try.
The only way we have been able to see where the symbols are is with the DiffMerge tool, we have not been able to see these symbols in any other editor but we have actually been able to copy the sign to its own file and grep through all the files with the -f grep option which works, but it would be easier to display the characters in vim but using a keybinding.
Does someone have any suggestions? This is causing us to use a lot more time debugging the code when the problems is an invisible symbol.
Try the following search command:
/[^ -~<09>]
(you get the <09> by pressing the tab key). Or if you want to get rid of those nasty tabs, just:
/[^ -~]
That will find and highlight any non-ASCII or control-ASCII character.
If you still have hidden characters out there, you can try this command before the search:
:set enc=latin1
That will prevent any weird Unicode character to show up in your code.
When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.