How to read Greek characters in pandas? - python-3.x

I am dealing with a dataframe that has Greek characters. They appear like that:
The data are here:
toy.to_json()
'{"a_a":{"0":49.0,"1":50.0,"2":52.0,"3":53.0,"4":54.0},"grade":{"0":3.0,"1":5.0,"2":4.0,"3":5.0,"4":4.0},"sex":{"0":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","1":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","2":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","3":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","4":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9"},"age":{"0":122.0,"1":125.0,"2":119.0,"3":122.0,"4":127.0},"fath_job":{"0":2.0,"1":2.0,"2":2.0,"3":2.0,"4":2.0},"phscs":{"0":49.0,"1":73.0,"2":61.0,"3":75.0,"4":59.0},"pcc":{"0":10.0,"1":26.0,"2":19.0,"3":28.0,"4":23.0},"pcg":{"0":21.0,"1":28.0,"2":20.0,"3":25.0,"4":19.0},"tasc":{"0":17.0,"1":5.0,"2":17.0,"3":8.0,"4":11.0},"class":{"0":0.0,"1":0.0,"2":0.0,"3":0.0,"4":0.0},"grade3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00ef\\u00f2","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","2":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","4":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2"},"pcc3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"tasc3":{"0":3.0,"1":1.0,"2":3.0,"3":2.0,"4":2.0},"pcg3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"phscs3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"}}'
I tried to import the file with encoding = 'utf_8' but it did not work.
Here are some other approaches I tried:
toy.to_csv('toy.csv', index = False)
import chardet
rawdata = open('toy.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
pd.read_csv('toy.csv', encoding = charenc)
pd.read_csv('toy.csv', encoding = 'cp737')

Try something like this:
pd.read_csv('toy.csv', encoding = 'iso8859_7')
It solved the problem for me, I hope it's the same for you.

On Mac (macOS Catalina 10.15.7) this works:
gdf = gpd.read_file("./../data/processed/landmarks.shp", encoding="mac_greek")
Apparently, you have to find the right encoding for your system. A list of python encodings can be found in the documentation. However, it is still not perfect. I still have some random characters. I do not know it that is due to the data source or because the list refers to python 2.
Greek encoded geo pandas data

Related

Pass encoding option in Pyspark text method

I have fixed length file encoded in ISO-8859-1.Spark 2.4 is not honoring encoding passed as option.
Below is the sample code(chars got corrupted).
g_df = g_spark.read.option("encoding", "ISO-8859-1").text(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
Howeve, when I read it as csv file ,chars are stored as expected.
g_df = g_spark.read.option("encoding", "ISO-8859-1").csv(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
This looks like spark does not support encoding for text method.
As this is fixed length file, so I cannot use csv method.
Could you please suggest a way out
Text method does not honor encoding, so I have tried RDD APIS, which is working for me.
Below is the sample code
encoded_rdd = sc.textFile(loc, use_unicode=False).map(lambda x: x.decode("iso-8859-1"))
encoded_df = encoded_rdd.map(Row("val")).toDF()

How to get packet.tcp.payload and packet.http.data as string?

The return value for these attributes are in hex format seperated by ':'
Eg : 70:79:f6:2e: something like this.
When I am trying to decode it to plain string ( human readable ) it doesn't work. What encoding is being used? I tried various different methods like codecs.decode(), binascii.unhexlify(), bytes.fromhex() also different encodings ASCII and UTF-8. Nothing worked, any help is appreciated. I am using python 3.6
Thanks for your question! I believe you're wanting to read the payload in chunks of two hex places. The functions you tried are not able to parse the : delimiter out-of-the-box. Something like splitting the string by the : delimiter, converting their values to human-readable characters, and joining the "list" to a string should do the trick.
hex_string = '70:79:f6:2e'
hex_split = hex_string.split(':')
hex_as_chars = map(lambda hex: chr(int(hex, 16)), hex_split)
human_readable = ''.join(hex_as_chars)
print(human_readable)
Is this what you have in mind?

How to use python to convert a backslash in to forward slash for naming the filepaths in windows OS?

I have a problem in converting all the back slashes into forward slashes using Python.
I tried using the os.sep function as well as the string.replace() function to accomplish my task. It wasn't 100% successful in doing that
import os
pathA = 'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
Expected Output:
'V:/Gowtham/2019/Python/DailyStandup.txt'
Actual Output:
'V:/Gowtham\x819/Python/DailyStandup.txt'
I am not able to get why the number 2019 is converted in to x819. Could someone help me on this?
Your issue is already in pathA: if you print it out, you'll see that it already as this \x81 since \201 means a character defined by the octal number 201 which is 81 in hexadecimal (\x81). For more information, you can take a look at the definition of string literals.
The quick solution is to use raw strings (r'V:\....'). But you should take a look at the pathlib module.
Using the raw string leads to the correct answer for me.
import os
pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
OutPut:
V:/Gowtham/2019/Python/DailyStandup.txt
Try this, Using raw r'your-string' string format.
>>> import os
>>> pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt' # raw string format
>>> newpathA = pathA.replace(os.sep,'/')
Output:
>>> print(newpathA)
V:/Gowtham/2019/Python/DailyStandup.txt

Problem with multivariables in string formatting

I have several files in a folder named t_000.png, t_001.png, t_002.png and so on.
I have made a for-loop to import them using string formatting. But when I use the for-loop I got the error
No such file or directory: '/file/t_0.png'
This is the code that I have used I think I should use multiple %s but I do not understand how.
for i in range(file.shape[0]):
im = Image.open(dir + 't_%s.png' % str(i))
file[i] = im
You need to pad the string with leading zeroes. With the type of formatting you're currently using, this should work:
im = Image.open(dir + 't_%03d.png' % i)
where the format string %03s means "this should have length 3 characters and empty space should be padded by leading zeroes".
You can also use python's other (more recent) string formatting syntax, which is somewhat more succinct:
im = Image.open(f"{dir}t_{i:03d}")
You are not padding the number with zeros, thus you get t_0.png instead of t_000.png.
The recommended way of doing this in Python 3 is via the str.format function:
for i in range(file.shape[0]):
im = Image.open(dir + 't_{:03d}.png'.format(i))
file[i] = im
You can see more examples in the documentation.
Formatted string literals are also an option if you are using Python 3.6 or a more recent version, see Green Cloak Guy's answer for that.
Try this:
import os
for i in range(file.shape[0]):
im = Image.open(os.path.join(dir, f't_{i:03d}.png'))
file[i] = im
(change: f't_{i:03d}.png' to 't_{:03d}.png'.format(i) or 't_%03d.png' % i for versions of Python prior to 3.6).
The trick was to specify a certain number of leading zeros, take a look at the official docs for more info.
Also, you should replace 'dir + file' with the more robust os.path.join(dir, file), which would work regardless of dir ending with a directory separator (i.e. '/' for your platform) or not.
Note also that both dir and file are reserved names in Python and you may want to rename your variables.
Also check that if file is a NumPy array, file[i] = im may not be working.

Error in reading a ascii encoded csv file?

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?
Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.
Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

Resources