covert ascii to decimal python - python-3.x

I have a data pandas DataFrame, where one of the columns is filled with ascii characters. I'm trying to convert this column from ascii to decimal, where, for example, the following string should be converted from in Hex:
313533313936393239382e323834303638
to:
1531969298.284068
I've tried
outf['data'] = outf['data'].map`(`lambda x: bytearray.fromhex(x).decode())
as well as
outf['data'] = outf['data'].map(lambda x: ascii.fromhex(x).decode())
The error that I get is as follows:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 8: invalid start byte
I'm not sure where the problem manifests itself. I have a txt file and a sample of its contents are as follows:
data time
313533313936393239382e32373737343800 1531969299.283273000
313533313936393239382e32373838303400 1531969299.284253000
313533313936393239382e32373938353700 1531969299.285359000
When the data was normal integers the the lambda would work fine where I used:
outf['data'] = outf['data'].astype(str)
outf['data'] = outf['data'].str[:-2:]
outf['data'] = outf['data'].map(lambda x: bytearray.fromhex(x).decode())
outf['data'] = outf['data'].astype(int)
, however now it says there's something wrong with the encoding.
I've looked on Stackoverflow, but perhaps I wasn't able to find something similar.
However, it hasn't worked. If someone where to help me out, I would very much appreciate it.

You can use map with a lambda function for bytearray.fromhex and astype to float.
out['data'].map(lambda x: bytearray.fromhex(x).decode()).astype(float)

Such lambda would do the trick:
>>> f = lambda v: float((bytearray.fromhex(v)))
>>> f('313533313936393239382e323834303638')
1531969298.284068
Note that the use of numpy's astype hinted by Scott Boston in the comment section may be better performance-wise.

Related

python - numpy array gets imported as str using read_csv()

I am importing a df like
dbscan_parameter_search = pd.read_csv('./src/temp/05_dbscan_parameter_search.csv',
index_col=0)
type(dbscan_parameter_search['clusters'][0])
which results in str.
How can I keep the datatype as numpy array? I tried
dbscan_parameter_search = pd.read_csv('./src/temp/05_dbscan_parameter_search.csv',
dtype={'clusters':np.int32},
index_col=0)
which results in ValueError: invalid literal for int() with base 10: '[ 0 1 2 ... 139634 139634 139636]'.
Any hint? Thanks
Thanks to hpaulj, the problem was rather the export but the import. String was literally saved with dots. For quick workaround see here:
python - how to to_csv() with an column of arrays

Pandas read_csv not working when items missing from last column

I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).
import pandas as pd
import itertools
alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]
seasons = [2002, 2003, 2004]
for season in seasons):
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')
This gives the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62
I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.
Python Pandas Error tokenizing data
I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?
Thanks
Baz
UPDATE:
The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Following the answer below from Serge Ballesta option 2:
UnicodeDecodeError when reading CSV file in Pandas with Python
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')
With the above amendment the code also works for season=2004.
I still have two questions though:
Q1.) How can I find which character/s were causing the problem is season 2004?
Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>

Convert All Items in a Dataframe to Float

I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)
You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')
I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')

pandas reading data from column in as float or int and not str despite dtype setting

i have an issue with pandas (0.23.4) on python 3.7 where the data is being read in as scientific notation instead of just a string despite setting the dtype setting. Here is an example of the data that is being read in
-------------------
codes
-------------------
001234544
00023455
123456789
A1253532
780E9000
00678E10
The problem comes with lines 5 and 6 of the above because they contain, i think, 'E' characters and they are being turned into scientific notation.
My reader is setup as follows.
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', dtype=str)
despite that dtype=str setting, it appears that pandas using something called ... a "sniffer" that detects the data type automatically and its being changed back to what I assume is float or int, and then changing it to scientific notation. One suggestion in another thread says to use something called a converter statement within the read_csv like the following
pd.read_csv('my.csv', converters = {i: str for i in range(0, 100)})
I am curious if this is a possible solution to my problem, but also i have no idea how long that range should be as it changes often. Is there any way to query the length of the column and feed that as a variable into that range call?
I looks like i can do something like len(accounts.index) ... but i cant do this till after the reader has read the file so something like this below doesnt work
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', converters = {i: str for i in range(0, gym_length)}))
gym_length = len(accounts.index)
the length check is after the .. i guess you call it ... data reader, so it doesnt work obviously.

Python3, how to encode this string correctly?

disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem
Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):
>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'
For me, markup should be assigned to a bytes, so I could do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>
Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>" which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>" ?
Because if I do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'
You see? It inserted \x83\xc2 for free.
>>> print(bytes(markup))
TypeError: string argument without an encoding
If you have the Unicode string "<h1>Sacr\xc3\xa9 bleu!</h1>", something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.
If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert "<h1>Sacr\xc3\xa9 bleu!</h1>" to b"<h1>Sacr\xc3\xa9 bleu!</h1>" anyway, then encode it in latin-1:
bytestring = broken_unicode.encode('latin1')

Resources