DF - data conversion issues - python-3.x

I have a question regarding a project that I am doing. I am getting a conversion error, that cannot convert the series to <class 'int'> and cannot see the reason why. the values that I've got are int64 meanwhile the system tries to convert to a base 10.
I have a csv file called "test.csv" and it is structured like this:
date,value
2016-05-09,1201
2016-05-10,2329
2016-05-11,1716
2016-05-12,10539
...
I import the data, parse the dates and set the index column to 'date'.
df = pd.read_csv("test.csv", parse_dates=True)
df = df.set_index('date')
Afterwards I clean the data of the first and last 2.5%
df = df[(df['value'] >= (df['value'].quantile(0.025))) &(df['value'] <= (df['value'].quantile(0.975)))]
I print the data types that I've got and find only one:
print (df.dtypes)
value int64
dtype: object
If I run it against this code (as part of a test):
actual = int(time_series_visualizer.df.count(numeric_only=True))
I get this error:
TypeError: cannot convert the series to <class 'int'>
I was tried to convert to another type to see if it was an issue with int64.
tried:
df.value.astype(int)
df.value.astype(float)
but both didn't work.
Does anyone have any suggestions that I could try?
thanks

Related

convert pandas Series of dtype <- 'datetime64' into dtype <- 'np.int' without iterating

consider the below sample pandas DataFrame
df = pd.DataFrame({'date':[pd.to_datetime('2016-08-11 14:09:57.00'),pd.to_datetime('2016-08-11 15:09:57.00'),pd.to_datetime('2016-08-11 16:09:57.8700')]})
I can convert single instance into np.int64 type with
print(df.date[0].value)
1470924597000000000
or convert the entire columns iteratively with
df.date.apply(lambda x: x.value)
How can I achieve this without using iteration? something like
df.date.value
I would also want to convert back the np.int64 object to pd.Timestamp object without using iteration. I got some insights from solutions posted here and here but it doesn't solve my problem.
as per #anky comments above, the solution was straightforward.
df.date.astype('int64') >> to int64 dtype
pd.to_datetime(df.date) >> from int dtype to datetime64 dtype

Changing column datatype from Timestamp to datetime64

I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.
I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.
I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:
dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi
But still it doesn't work. I have also tried updating values one by one:
arr_i = df.index
for i in range(len(arr_i)):
df.at[arri[l],'dates'].to_datetime64()
Edit
The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp
Does anyone have a suggestion of a workaround that is fairly fast?
Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.
There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:
import numpy as np
import pandas as pd
# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')
# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
'datetime64s': ser})
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps 10 non-null datetime64[ns]
# datetime64s 10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes
print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>
But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism,
it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.
You may be looking at an XY problem -- it may be more fruitful to find a way to (1)
work with Pandas Timestamps instead of trying to force Pandas to return NumPy
datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).

Pandas is messing with a high resolution integer on read_csv

EDIT: This was Excel's fault changing the data type, not Pandas.
When I read a CSV using pd.read_csv(file) a column of super long ints gets converted to a low res float. These ints are a date time in microseconds.
example:
CSV Columns of some values:
15555071095204000
15555071695202000
15555072295218000
15555072895216000
15555073495207000
15555074095206000
15555074695212000
15555075295202000
15555075895210000
15555076495216000
15555077095230000
15555077695206000
15555078295212000
15555078895218000
15555079495209000
15555080095208000
15555080530515000
15555086531880000
15555092531889000
15555098531886000
15555104531886000
15555110531890000
15555116531876000
15555122531873000
15555128531884000
15555134531884000
15555140531887000
15555146531874000
pd.read_csv produces: 1.55551e+16
how do I get it to report the exact int?
I've tried using: float_precision='high'
It's possible that this is caused by the way Pandas handles missing values, meaning that your column is importing as floats, to allow the missing values to be coded as NaN.
A simple solution would be to force the column to import as a str, then impute or remove missing values, and the convert to int:
import pandas as pd
df = pd.read_csv(file, dtypes={'col1': str}) # Edit to use appropriate column reference
# If you want to just remove rows with missing values, something like:
df = df[df.col1 != '']
# Then convert to integer
df.col1 = df.col1.astype('int64')
With a Minimal, Complete and Verifiable Example we can pinpoint the problem and update the code to accurately solve it.

Cannot convert object to np.int64 with Numpy

I have a dataframe with 3 columns with the following dtypes:
df.info()
tconst object
directors object
writers object
Please see the data itself:
Now, I have to change the column tconst to dtype:int64. I tried this code but it throws an error:
df = pd.read_csv('title.crew.tsv',
header=None,sep='\t',
encoding= 'latin1',
names = ['tconst', 'directors','writers'],
dtype={'tconst': np.int64,'directors':np.int64})
Error 1:ValueError: invalid literal for int() with base 10: 'tconst'
Error:TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
What is going wrong here?
In my opinion problem here is parameter header=None which is used for read file with no csv header.
Solution is remove it, because in file is first row header, which is conveterted to columns names of DataFrame:
df = pd.read_csv('title.crew.tsv',
sep='\t',
encoding= 'latin1')
Another problem is tt and nm prefix in columns, so cannot be converted to integers.
Solution is:
df['tconst'] = df['tconst'].str[2:].astype(int)

pandas to_excel TypeError: Unsupported type <type 'list'> in write()

I have a dataframe and I am trying to export it in excel. I am using the following code:
outcome = merge(sql, emptydf, on='sku', how='left')
writer = pd.ExcelWriter('casafina_json.xlsx', engine='xlsxwriter')
outcome.to_excel(writer, sheet_name='Sheet1')
writer.save()
when I run it I get this error:
raise TypeError("Unsupported type %s in write()" % type(token))
TypeError: Unsupported type <type 'list'> in write()
what's wrong? thanks.
EDIT:
the column on which the two dataframes were joint does not have unique values. I think this may be the cause of the error. Any idea on how to join them differently?
I still don't know what is wrong. However, it is possible to create a csv out of it with the function:
outcome.to_csv('/Users/whatever/whatever.csv')

Resources