Load utf-8 encoded text into H2OFrame - text

I have a utf-8 encoded .csv-file that I load to H2O.ai in Python 3.7 using
h2o.load_dataset("my.csv")
The Scandinavian characters do not display correctly. The same problem persists if I save my H2OFrame to disk and open in an editor using utf-8. How can I make H2O.ai understand utf-8?
Many thanks.

I ran a quick test using the characters you provide and was able to get everything to display correctly on H2O-3 version 3.20.0.8 and python 3.5 so hopefully newer versions also work.
In [7]: dd = ["Tässä vähän tekstiä åäö"]
In [8]: h2o.H2OFrame(dd)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[8]:
C1
-----------------------
Tässä vähän tekstiä åäö
[1 row x 1 column]
I also created a csv with the string as the first cell and it seemed to display correctly.
In [12]: hhf = h2o.import_file('Scandinavians.csv', header=-1)
Parse progress: |████████████████████████████████████████████████████████████████████████████| 100%
In [13]: hhf
Out[13]:
C1 C2 C3 C4
------ ----- ------- ----
Tässä vähän tekstiä åäö
[1 row x 4 columns].
(If these code snippet's don't help I can try to update my response)

Related

Remove non meaningful characters in pandas dataframe

I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python pandas column. Although the text starts with b' , it's a string
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
Your string looks like byte string but not so encode/decode doesn't work. Try something like this:
>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)
0 b'Hello! End Climate Silence is looking for v...
1 b'I doubt if climate emergency 8s real, I thin...
2 b'No, thankfully it doesnt. Cant see how cheap...
3 b'Climate Change Poses a WidelllThreat to Nati...
4 b""This doesn't feel like targeted propaganda ...
5 b'berates climate change activist who confront...
Name: text, dtype: object
Note you have to clean your unbalanced single/double quotes and remove the first 'b' character.
You could go through your strings and keep only ascii characters:
my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)
Note that .encode('ascii', errors= 'ignore') doesn't change the string it's applied to but returns the encoded string. This should work:
new_str = my_str.encode('ascii',errors='ignore')
print(new_str)

Python 3.8 issue - print of float with 5 digits after dot - error in PyCharm?

I am using PyCharm, Python version is 3.8
I receive the below error once I try to print e.g. 5 digits after dot for float sum variable being 3.14+2.17:
print(f'{test:.5f}')
^
SyntaxError: invalid syntax
Process finished with exit code 1
The corresponding code is:
test = 3.14 + 2.17
print(test)
print(f'{test:.5f}')
Do you have any idea why this happens, by changing "f" to "format" the issue persists. I changed Python Interpreter to 3.8 explicitly and removed 2.8 so that "f" is as well accepted in the syntax.
Thanks.
p.s. I have checked the below code on the www.Repl.it website and the print works as it should be so the issue lies within my setup...:
test = 3.14+2.17
print(test)
print(f'{test:.5f}')
try like this bro :) <3
test = 3.14+2.17
print(test)
print(f'{round(test, 5)}')
but, better...
test = 3.14+2.17
print(test)
print(round(test, 5))

How to read a column as bytes?

I have a pandas DataFrame whereby a column consists of strings as follows
import pandas as pd
df = pd.DataFrame(...)
df
WORD
0 '0% de mati\xc3\xa8res grasses'
1 '115 apr\xc3\xa8s J.-C.'
For each string in the dataframe, I can read them as bytes by b'0% de mati\xc3\xa8res grasses'.decode("utf-8") and b'115 apr\xc3\xa8s J.-C.'.decode("utf-8"). I would like to ask how to decode this column. I tried df['WORD'].astype('bytes').str.decode("utf-8") but to no avail.
Thank you so much for your help!
It's hard to know what the initial encoding is, but it looks like latin-1:
df['WORD'].str.encode('latin-1').str.decode('utf-8')
0 0% de matières grasses
1 115 après J.-C.
Name: WORD, dtype: object
Since the output seems sensical I'd say this is correct, but generally there's no surefire way to re-encode text if it has an unknown encoding to start.

h5py accuracy for matrix storage

I want to use Python3-h5py to store matrix to the .HDF5 format
My problem is that when I compare the initial data to the data extracted from the HDF5 file, I get surprising differences.
import numpy
import h5py
# Create a vector of float64 values between 0 and 1
A = numpy.array(range(16384+1))/(16384+1)
# Save the corresponding float16 array to a HDF5 file
Fid = h5py.File("Output/Test.hdf5","w")
Group01 = Fid.create_group("Group")
Group01.create_dataset("Data", data=A, dtype='f2')
# Group01.create_dataset("Data", data=A.astype(numpy.float16), dtype='f2')# Use that line to avoid the bug
Fid.flush()
Fid.close()
# Read the HDF5 file
Fid = h5py.File("Output/Test.hdf5",'r')
B = Fid["Group/Data"][:]
Fid.close()
# Compare float64 and float16 Values
print(A[8192])
print(B[8192])
print("")
print(A[8192+1])
print(B[8192+1])
print("")
print(A[16384])
print(B[16384])
Gives :
0.499969484284
0.25
0.500030515716
0.5
0.999938968569
0.5
Sometimes I get a difference of about "0.00003" and sometimes "0.4999".
Normally, I am supposed to always get "0.00003" which is related to the float16 rounding for a value between 0 and 1.
But the "0.4999" value is really unexpected, I have noticed that it happens to values which are close to power of 2 (for example "~1/2" will be stored as "~1/4").
Is it a bug into the h5py package ?
Thanks in advance,
Stéphane,
[Xubuntu 17.09 64bits + python3-h5py v2.7.1-2 + python3 v3.6.3-0ubuntu2]
I am not fully sure that this can be considered as an answer, but I finally get rid of my problem with a small circumvent.
To sum it up, it looks like there is a bug with "h5py v2.7.1-2"
When using h5py to store arrays, don't use such command :
`Group01.create_dataset("Data", data=A, dtype='f2')# Buggy command`
But instead :
`Group01.create_dataset("Data", data=A.astype(numpy.float16), dtype='f2')`
Edit 18 Nov 2022 : with h5py==3.7.0 the bug is now fixed

using split() to split values in an entire column in a python dataframe

I am trying to clean a list of url's that has garbage as shown.
/gradoffice/index.aspx(
/gradoffice/index.aspx-
/gradoffice/index.aspxjavascript$
/gradoffice/index.aspx~
I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement
str = df['csuristem']
it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.
import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
s = s.split(".")[0]
print s
I am looking to get an output like this
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
Thank you,
Santhosh.
You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:
In [6]:
df['csuristem'].str.split('.').str[0]
Out[6]:
0 /gradoffice/index
1 /gradoffice/index
2 /gradoffice/index
3 /gradoffice/index
Name: csuristem, dtype: object

Resources