Any idea why I am getting a length of 6 instead of 5?
I created a file called björn-100.png and ran the code using python3:
import os
for f in os.listdir("."):
p = f.find("-")
name = f[:p]
print("name")
print(name)
length = len(name)
print(length)
for a in name:
print(a)
prints out the following:
name
björn
6
b
j
o
̈
r
n
instead of printing out
name
björn
5
b
j
ö
r
n
If you're using python 2.7, you can simply decode the file name as UTF-8 first:
length = len(name.decode('utf-8'))
But since you're using python 3 and can't simply decode a string as if it were a bytearray, I recommend using unicodedata to normalize the string.
import unicodedata
length = len(unicodedata.normalize('NFC', name))
The way to get the correct string with the two dots inside the o char is:
import unicodedata
name = unicodedata.normalize('NFC', name)
Related
The following works:
import re
text = "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I'm happy
This doesn't work:
training_data = pd.read_csv('train.txt')
import re
text = training_data['tweet_text'][0] # Assume that this returns a string "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I\u2019m happy
I tried running your code and got I'm happy returned from both the string and the list item when passing each into re.sub(...) as outlined in your question.
If you're just looking to parse (decode) the unicode characters you probably don't need to be using re. Something like the below could be used to parse the unicode characters without having to run re to check each possibility.
text = training_data['tweet_text'][0]
if type(text) == str: # if value is str then encode to utf-8 byte string then decode back to str
text = text.encode()
text = text.decode()
elif type(text) == bytes: # elif value is bytes just decode to str
text = text.decode()
else: # else printout to console if value is neither str or bytes
print("Value not recognised as str or bytes!")
So I have been stuck on this one for a while and could use some help. I have been trying to fix this code and I keep getting an error about invalid syntax. So here is the code I need help with need to convert from str or int to float.
# Input from the command line
import sys
A = sys.argv[1]
B = sys.argv[2]
C = sys.argv[3]
# Your code goes here
num = A * (B + C / 3)
# Outputs
print (float(num))
By default,python gets inputs from command line as str.So,you must convert them to floats before applying any operation
import sys
A=float(sys.argv[1])
B=float(sys.argv[2])
C=float(sys.argv[3])
.....
I read some value from Windows Registry (SAM) with Python3. As far as I can tell it looks like hex encoded bytes:
>>> b = b'A\x00d\x00m\x00i\x00n\x00i\x00s\x00t\x00r\x00a\x00t\x00o\x00r\x00'
>>> print(b)
A d m i n i s t r a t o r
Now how would I convert that to a String (should be "Administrator")? Using "print" just gives me "A d m i n i s t r a t o r". How to do the conversion correctly without using dirty tricks?
b = b'A\x00d\x00m\x00i\x00n\x00i\x00s\x00t\x00r\x00a\x00t\x00o\x00r\x00'
b = b.replace(b'\x00', b'')
print(b)
# b'Administrator'
I propably should have used utf-16 decoding:
>>> b = b'A\x00d\x00m\x00i\x00n\x00i\x00s\x00t\x00r\x00a\x00t\x00o\x00r\x00'
>>> print(b.decode('utf-16'))
Administrator
SORRY!
Am tryin to add all the alphabet to all the position in a string just one by one, This is the code:
from string import ascii_lowercase
var = 'abc'
for i in ascii_lowercase:
result = [var[:j] + i + var[j:] for j in range(len(var))]
But this is what am getting :
['zabc', 'azbc', 'abzc']
This is what am expecting :
['aabc', 'abac', 'abca','babc','abbc','abcb'...]
Does anyone know how to fix this. Thanks.
You can build the whole list at once using a nested list comprehension
from string import ascii_lowercase
var = 'abc'
result = [var[:n]+c+var[n:] for c in ascii_lowercase for n in range(len(var)+1)]
Trying to get this spellchecker I came across online to work, but no luck. Any help Would be appreciated. Original code from http://norvig.com/spell-correct.html
import re, collections, codecs
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
file = codecs.open('C:\88888\88888\88888\88888\8888\A Word.txt', encoding='utf-8', mode='r')
NWORDS = train(words(file.read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Error:
File "C:\8888\8888\8888\8888\88888\SpellCheck.py", line 11
file = codecs.open('C:\888\888\888\8888\88888\A Word.txt', encoding='utf-8', mode='r')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
OK, let's do something try this...
get a string value '\x' and try to do something to it
or try
string('\x.....')
Returns your error right?
So if you have a string defined say
x = string('\y\o\u \c\a\n \n\e\v\e\r \c\h\a\n\g\e \t\h\i\s \i\n \p\y\t\h\o\n')
Than you are just out of luck.
It will be a bummer if the user decides to type a '\' as any character of the input.
To fix the problem you could try using some looping or recursive code like:
How to remove illegal characters from path and filenames?
C:\88888\88888\88888\88888\8888\A Word.txt - that's the strangest path I've seen this year :)
Try replacing it with C:\\88888\\88888\\88888\\88888\\8888\\A Word.txt