Cannot astype timedelta using pandas? [duplicate] - python-3.x

This question already has answers here:
Python: Convert timedelta to int in a dataframe
(6 answers)
Convert timedelta64[ns] column to seconds in Python Pandas DataFrame
(6 answers)
Closed 4 years ago.
I got the following error running my code :
import pandas as pd
data = pd.read_csv('file.csv')
data['time'] = pd.to_datetime(data['UNIX time'],unit='s')
data['time_min'] = (data['time'] - data['time'].min()).astype(int)
"cannot astype a timedelta from [timedelta64[ns]] to [int32]"

You can extract the day from the result using Timedelta.days, which is saved as an int:
sol = (data['time'] - data['time'].min()).apply(lambda x: x.days)
print(sol)
0 0
1 0
2 14
3 14
print(sol.apply(type))
0 <class 'int'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>

Related

Pandas Timestamp: What type is this?

I have a pandas dataframe with a parsed time stamp. What type is this? I have tried matching against it with the following rules:
dtype_dbg = df[col].dtype # debugger shows it as 'datetime64[ns]'
if isinstance(df[col].dtype,np.datetime64) # no luck
if isinstance(df[col].dtype,pd.Timestamp) # ditto
if isinstance(df[col].dtype,[all other timestamps I could think of]) # nothing
How does one match against the timestamp dtype in a pandas dataframe?
Pandas datetime64[ns] is a '<M8[ns]' numpy type, so you can just compare the dtypes:
df = pd.DataFrame( {'col': ['2019-01-01', '2019-01-02']})
df.col = pd.to_datetime(df.col)
df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 1 columns):
#col 2 non-null datetime64[ns]
#dtypes: datetime64[ns](1)
#memory usage: 144.0 bytes
df[col].dtype == np.dtype('<M8[ns]')
#True
You can also (or maybe should better) use pandas built-in api.types.is_... functions:
pd.api.types.is_datetime64_ns_dtype(df[col])
#True
Your comparisons isinstance(df[col].dtype, ...) don't work, as you compare the type of dtype (which is numpy.dype of course) with other data dtypes, which will naturally fail for any data type.

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>
You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Python - dataframe url parsing issue

I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?
Tried:
from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'
working on string:
string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result
looking for a column:
col3
https://google.com/
https://math.com/
nan
Errror
You can try something like this.
Here I have used pandas.Series.apply() to solve.
» Initialization and imports
>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>
» Inspect the newly created DataFrame.
>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
id url
0 1 https://google.com/tensoflow
1 2 https://math.com/some/website
2 3 NaN
>>>
>>> df["url"]
0 https://google.com/tensoflow
1 https://math.com/some/website
2 NaN
Name: url, dtype: object
>>>
» Applying a function using pandas.Series.apply(func) on url column..
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0 https://google.com/
1 https://math.com/
2 NaN
Name: url, dtype: object
>>>
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
>>>
» Store the above result in a variable (not mandatory, just to simply).
>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
» Finally
>>> df2 = pd.DataFrame({"col3": s})
>>> df2
col3
0 https://google.com/
1 https://math.com/
2 nan
>>>
» To make sure, what is s and what is df2, check types (again, not mandatory).
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
>>>
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>
Reference links:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html

AttributeError: 'NoneType' object has no attribute 'isnull' [duplicate]

This question already has answers here:
Pandas DataFrame: set_index with inplace=True returns a NoneType, why?
(2 answers)
Understanding inplace=True in pandas
(11 answers)
Closed 4 years ago.
I'm trying to remove the empty rows. But when I try to count empty lines to see if it worked, I have an error:
AttributeError: 'NoneType' object has no attribute 'isnull'
My script:
import pandas
import pandas as pd
data = pd.read_csv('data.csv', sep=';')
#print('Table Data\n', data)
data_sum_empty = data.isnull().sum()
#print(data_sum_empty)
data_not_empty = data_sum_empty.dropna(how = 'all', inplace = True)
print(data_not_empty.isnull().sum())
Output:
Traceback (most recent call last):
File ".\data_vis.py", line 12, in
print(data_not_empty.isnull().sum())
AttributeError: 'NoneType' object has no attribute 'isnull'
Some data
flightID DepTime ArrTime ActualElapsedTime AirTime ArrDelay
BBYYEUVY67527 1416.0 1514.0 58.0 39.0 64.0
MUPXAQFN40227 2137.0 37.0 120.0 47.0 52.0
LQLYUIMN79169 730.0 916.0 166.0 143.0 -25.0
KTAMHIFO10843 NaN NaN NaN NaN NaN
BOOXJTEY23623 NaN NaN NaN NaN NaN
Why duplicate???? I did not know the problem was because of the inplace. If I'd known, I would not have asked!
When you do an operation on a df with inplace=True, the variable or output of that operation is None.
data_sum_empty.dropna(how = 'all', inplace = True)
data_not_empty = data_sum_empty.copy()
print(data_not_empty.isnull().sum())
Or
data_not_empty = data_sum_empty.dropna(how = 'all')
print(data_not_empty.isnull().sum())
Do not reassign if you use inplace = True:
data_not_empty = data_sum_empty.dropna(how = 'all')
print(data_not_empty.isnull().sum())

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Resources