Pandas set_index() seems to change the types for some rows to <class 'pandas.core.series.Series'> - python-3.x

I'm observing an unexpected behavior of the Pandas set_index() function.
In order to make my results reproducible I provide my DataFrame as a pickle file df_test.pkl.
df_test = pd.read_pickle('./df_test.pkl')
time id avg
0 1554985690182 117455392 4.06300000
1 1554985690288 117455393 0.95800000
2 1554985690641 117455394 2.38400000
...
Now, when I iterate over the rows and print the type of each "id" value I get <class 'numpy.int64'> for all cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
...
Now I set the index to the "time" column and everything looks fine.
df_test = df_test.set_index(keys='time', drop=True)
id avg
time
1554985690182 117455392 4.06300000
1554985690288 117455393 0.95800000
1554985690641 117455394 2.38400000
...
But when I iterate again over the rows and print the type of each "id" value I get <class 'pandas.core.series.Series'> for some cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
...
Does anyone know what is going on here?
UPDATE:
I have removed the "id_type" column from the df_test DataFrame, because it was not helpful. Thanks to #Let'stry for making me aware!

I think I found the answer myself.
There where duplicate timestamps in the "time" column and it seems that Pandas cannot set_index() properly if there are duplicate values in the selected column. Which makes total sense, because an index with duplicates would be pointless.
By the way, I found this issue by using the argument verify_integrity=True in the set_index() function. So I recommend using that argument to avoid this kind of trouble.
df_test = df_test.set_index(keys='time', drop=True, verify_integrity=True)
Everything works fine now after I've removed the duplicate rows.

Related

How to create an empty Series on pandas?

I'm learning about Series and I'm using VS Code to do excercises to learn it's usage, but when I typed this
current_series_add = pd.Series()
in the terminal it shows me a message telling me that "Te default dtype for empty Series will be object instead of float64 in a future version"
How can I specify a dtype?
As the docs say:
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]ΒΆ
...
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from data. See the user guide for more usages.
Example:
In [1]: import pandas as pd
In [2]: pd.Series(dtype=int)
Out[2]: Series([], dtype: int64)

Pandas Timestamp: What type is this?

I have a pandas dataframe with a parsed time stamp. What type is this? I have tried matching against it with the following rules:
dtype_dbg = df[col].dtype # debugger shows it as 'datetime64[ns]'
if isinstance(df[col].dtype,np.datetime64) # no luck
if isinstance(df[col].dtype,pd.Timestamp) # ditto
if isinstance(df[col].dtype,[all other timestamps I could think of]) # nothing
How does one match against the timestamp dtype in a pandas dataframe?
Pandas datetime64[ns] is a '<M8[ns]' numpy type, so you can just compare the dtypes:
df = pd.DataFrame( {'col': ['2019-01-01', '2019-01-02']})
df.col = pd.to_datetime(df.col)
df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 1 columns):
#col 2 non-null datetime64[ns]
#dtypes: datetime64[ns](1)
#memory usage: 144.0 bytes
df[col].dtype == np.dtype('<M8[ns]')
#True
You can also (or maybe should better) use pandas built-in api.types.is_... functions:
pd.api.types.is_datetime64_ns_dtype(df[col])
#True
Your comparisons isinstance(df[col].dtype, ...) don't work, as you compare the type of dtype (which is numpy.dype of course) with other data dtypes, which will naturally fail for any data type.

Is there a way to list the rows and columns in a pandas DataFrame that are empty strings?

I have a 1650x40 dataframe that is a matrix of people who worked on projects each day. It looks like this:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
I am trying to sanity check the data by:
listing any columns that do not have an X in them (in this case
'project2' and 'project4')
listing any rows that do not have an X in them (in this case
'sam')
Desired outcome:
Something like df.show_empty(columns) returns ['project2','project4'] and df.show_empty(rows) returns ['sam']
Obviously the this method would need some way to tell it that the first two columns are not expected to be empty and they should be ignored.
My desired outcome above would return lists of column headings (or row indexes) so that I could go back and check my data and application to find out why there's no entry in the relevant cell (I am guessing there's a good chance that more than one row or column are affected). This seems like it should be trivial but I'm really stuck with figuring this out.
Thanks for any help offered!
For me, it is easier to use apply to accomplish this task. The working code is shown below:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
import numpy as np
df = df.replace('', np.NaN)
colmns = df.apply(lambda x: x.count()==0, axis=0)
df[colmns.index[colmns]]
df[df.apply(lambda x: x[2:].count()==0, axis=1)]
df = df.replace('', np.NaN) will replace the '' with NaN, so that we can use count() function.
colmns = df.apply(lambda x: x.count()==0, axis=0): this will find the columns that are all NaN.
df[df.apply(lambda x: x[2:].count()==0, axis=1)]: this will ignore the first two columns.

How to use pandas DataFrame or Series with seaborn compliantly?

I import this dataset and select an interval with
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'],
index_col='dt')
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
df_08_09.info()
to get
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1052641 entries, 2008-01-01 00:00:00 to 2010-01-01 00:00:00
Data columns (total 7 columns):
Global_active_power 1052641 non-null float64
Global_reactive_power 1052641 non-null float64
Voltage 1052641 non-null float64
Global_intensity 1052641 non-null float64
Sub_metering_1 1052641 non-null float64
Sub_metering_2 1052641 non-null float64
Sub_metering_3 1052641 non-null float64
dtypes: float64(7)
memory usage: 64.2 MB
I just wanted to know how I can treat the DatetimeIndex dt as a data column as well to make use of lmplot() or regplot() as, for, e.g.:
seaborn.regplot(x="dt", y="Global_active_power", data=df_08_09)
The dt is always making problems, because seaborn is not able to access it for some reason. I tried to access the DatetimeIndex, but i found no way to extract it and make it a data column, due to the fact that I'm not quite used to pandas.
I expect seaborn to find dt in the data, but it doesn't and throws an error accordingly. It's clear for me to see that, but idk how to treat this in an efficent python/pandas/seaborn fashion. So please help me out! :)
...An other question btw... I'm also wondering why df_08_09.Global_active_power.values is returning an (n,) shaped np.array and not (n,1). I'm always forced to do values = np.array([values]).transpose() to recover (n,1)
You can use a workaround and convert the datetime-columns to an integer first and replace the axis-tick-labels of matplotlib with the datetime-values afterwards, e.g.
import pandas as pd
import numpy as np
import seaborn
from datetime import datetime
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'])
#index_col='dt') No need for this, as we are not working with indexes
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
# Convert to integers
df_08_09['date_ordinal'] = pd.to_datetime(df_08_09['dt']).apply(lambda date: date.timestamp())
# Plotting as integers
ax = seaborn.regplot(data=df_08_09, x="date_ordinal", y="Global_active_power")
# Adjust axis
ax.set_xlim(df_08_09['date_ordinal'].min() - 1, df_08_09['date_ordinal'].max() + 1)
ax.set_ylim(0, df_08_09['Global_active_power'].max() + 1)
# Set x-axis-tick-labels to datetime
new_labels = [datetime.utcfromtimestamp(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels, rotation = 45)
Reference: This SO answer by waterproof

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>
You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Resources