python pandas read_excel how to leave date as str without parsing - excel

hi I have an excel with date format of dd/mm/yyyy, and datatype = general
I read some post and tried
df = pd.read_excel(f_in
,converters={'Created Date': str}
,engine='openpyxl')
and
df = pd.read_excel(f_in
,dtype={'Created Date': str}
,engine='openpyxl')
both got same wrong result:
for ee in df['Created Date'][0:3]:
print(ee, type(ee), '\n')
2019-04-11 00:00:00 <class 'str'> ##this was parsed as mm/dd/yyyy and coerced into str, not left as is
18/11/2019 <class 'str'>
25/11/2019 <class 'str'>
update: tried parse_dates=False and still not work:
df = pd.read_excel(f_in
,parse_dates=False
,engine='openpyxl')
for ee in df['Created Date'][0:3]:
print(ee, type(ee), '\n')
2019-04-11 00:00:00 <class 'datetime.datetime'>
18/11/2019 <class 'str'>
25/11/2019 <class 'str'>

Related

How to convert working with date time in Pandas?

I have datetime field like 2017-01-15T02:41:38.466Z and would like to convert it to %Y-%m-%d format. How can this be achieved in pandas or python?
I tried this
frame['datetime_ordered'] = pd.datetime(frame['datetime_ordered'], format='%Y-%m-%d')
but getting the error
cannot convert the series to <class 'int'>
The following code worked
d_parser= lambda x: pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S.%fZ')
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0,parse_dates['datetime_ordered'],date_parser=d_parser)
li.append(df)
frame =pd.concat(li, axis=0, ignore_index=True)
import datetime
from datetime import datetime
date_str="2017-01-15T02:41:38.466Z"
a_date=pd.to_datetime(date_str)
print("date time value", a_date)
#datetime to string with format
print(a_date.strftime('%Y-%m-%d'))

Pandas set_index() seems to change the types for some rows to <class 'pandas.core.series.Series'>

I'm observing an unexpected behavior of the Pandas set_index() function.
In order to make my results reproducible I provide my DataFrame as a pickle file df_test.pkl.
df_test = pd.read_pickle('./df_test.pkl')
time id avg
0 1554985690182 117455392 4.06300000
1 1554985690288 117455393 0.95800000
2 1554985690641 117455394 2.38400000
...
Now, when I iterate over the rows and print the type of each "id" value I get <class 'numpy.int64'> for all cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
...
Now I set the index to the "time" column and everything looks fine.
df_test = df_test.set_index(keys='time', drop=True)
id avg
time
1554985690182 117455392 4.06300000
1554985690288 117455393 0.95800000
1554985690641 117455394 2.38400000
...
But when I iterate again over the rows and print the type of each "id" value I get <class 'pandas.core.series.Series'> for some cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
...
Does anyone know what is going on here?
UPDATE:
I have removed the "id_type" column from the df_test DataFrame, because it was not helpful. Thanks to #Let'stry for making me aware!
I think I found the answer myself.
There where duplicate timestamps in the "time" column and it seems that Pandas cannot set_index() properly if there are duplicate values in the selected column. Which makes total sense, because an index with duplicates would be pointless.
By the way, I found this issue by using the argument verify_integrity=True in the set_index() function. So I recommend using that argument to avoid this kind of trouble.
df_test = df_test.set_index(keys='time', drop=True, verify_integrity=True)
Everything works fine now after I've removed the duplicate rows.

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>
You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Cannot astype timedelta using pandas? [duplicate]

This question already has answers here:
Python: Convert timedelta to int in a dataframe
(6 answers)
Convert timedelta64[ns] column to seconds in Python Pandas DataFrame
(6 answers)
Closed 4 years ago.
I got the following error running my code :
import pandas as pd
data = pd.read_csv('file.csv')
data['time'] = pd.to_datetime(data['UNIX time'],unit='s')
data['time_min'] = (data['time'] - data['time'].min()).astype(int)
"cannot astype a timedelta from [timedelta64[ns]] to [int32]"
You can extract the day from the result using Timedelta.days, which is saved as an int:
sol = (data['time'] - data['time'].min()).apply(lambda x: x.days)
print(sol)
0 0
1 0
2 14
3 14
print(sol.apply(type))
0 <class 'int'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>

Spacy - Convert Token type into list

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..
Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]
Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Resources