TypeError: unorderable types: int() > str(): Series - python-3.x

From a search it seems like one can get this error in a whole host of different situations. Here is mine:
testDf['pA'] = priorDf.loc[testDf['Period']]['a'] + testDf['TotalPlays']
--> 743 sorter = uniques.argsort()
744
745 reverse_indexer = np.empty(len(sorter), dtype=np.int64)
TypeError: unorderable types: int() > str()
where priorDf.loc[testDf['Period']]['a']is:
Period
2-17-1 1.120947
1-14-1 1.181726
7-19-1 1.935126
4-08-1 3.828184
3-14-1 0.668255
and testDf['TotalPlays'] is:
0 1
1 1
2 1
3 1
4 1
Both are of length 48.
----Additional Info-----
print (priorDf.dtypes)
mean float64
var float64
a float64
b float64
dtype: object
print (testDf.dtypes)
UserID int64
Period object
PlayCount int64
TotalPlays int64
TotalWks int64
Prob float64
pA int64
dtype: object
----- More Info ---------
print (priorDf['a'].head())
Period
1-00-1 0.889164
1-01-1 2.304074
1-02-1 0.281502
1-03-1 1.137781
1-04-1 2.335650
Name: a, dtype: float64
print (testDf[['Period','TotalPlays']].head())
Period TotalPlays
0 2-17-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 3-14-1 1
I also tried converting priorDf.loc[testDf['Period']]['a'] to type int (as it was a float) but still the same error.

I think you need map by priorDf['a'] or dict created from it.
Problem was different indexes of DataFrames, so data cannot align.
#changed data of Period for match sample data
print (testDf)
Period TotalPlays
0 1-00-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 1-04-1 1
testDf['pA'] = testDf['Period'].map(priorDf['a']) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650
print (priorDf['a'].to_dict())
{'1-02-1': 0.28150199999999997, '1-01-1': 2.304074,
'1-00-1': 0.88916399999999995, '1-03-1': 1.1377809999999999,
'1-04-1': 2.3356499999999998}
testDf['pA'] = testDf['Period'].map(priorDf['a'].to_dict()) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650

So my conclusion after testing with randomly generated values S = pd.Series(np.random.randn(48)) is that when adding columns from two different series together, they have to have the same Index. Pandas must automatically be sorting them both behind the scenes or something. So in my case I had period as the index for one, and period as a column, not the index for the other.
My re-written solution was:
testDf.set_index('Period', inplace=True)
testDf['pA'] = priorDf.loc[testDf.index]['a'] + testDf['TotalPlays']
testDf['pB'] = priorDf.loc[testDf.index]['b'] + testDf['TotalWks']-testDf['TotalPlays']
Thanks to jezrael for helping me get to the bottom of it.

Related

How to add unique date values of a datetime64[ns] Series object

I have a column of type datetime64[ns] (df.timeframe).
df has columns ['id', 'timeframe', 'type']
df['type'] can be 'A' or 'B'
I want to get the total number of unique dates per df.type == 'A' and per df.id
I tried this:
df = df.groupby(['id', 'type']).timeframe.apply(lambda x: x.dt.date()).unique().rename('test').reset_index()
But got error:
TypeError: 'Series' object is not callable
What should I do?
You could use value_counts:
df[df['type']=='A'].assign(timeframe=df['timeframe'].dt.date)
.value_counts(['id','type','timeframe'], sort=False)
.reset_index().rename(columns={0:'count'})
id type timeframe count
0 1 A 2022-06-06 2
1 1 A 2022-06-08 1
2 1 A 2022-06-10 2
3 2 A 2022-06-07 1
4 2 A 2022-06-09 1
5 2 A 2022-06-10 1

How to add text element to series data in Python

I have a series data in python defined as:
scores_data = (pd.Series([F1[0], auc, ACC[0], FPR[0], FNR[0], TPR[0], TNR[0]])).round(4)
I want to append the text 'Featues' at location 0 to the series data.
I tried scores_data.loc[0] but that replaced the data at location 0.
Thanks for your help.
You can't directly insert a value in a Series (like you could in a DataFrame with insert).
You can use concat:
s = pd.Series([1,2,3,4])
s2 = pd.concat([pd.Series([0], index=[-1]), s])
output:
-1 0
0 1
1 2
2 3
3 4
dtype: int64
Or create a new Series from the values:
pd.Series([0]+s.to_list())
output:
0 0
1 1
2 2
3 3
4 4
dtype: int64

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

Adding NaN changes dtype of column in Pandas dataframe

I have an int dataframe:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
But if I set a value to NaN, the whole column is cast to floats! Apparently int columns can't have NaN values. But why is that?
>>> df.iloc[2,1] = np.nan
>>> df
0 1 2
0 0 1.0 2
1 3 4.0 5
2 6 NaN 8
3 9 10.0 11
For performance reasons (which make a big impact in this case), Pandas wants your columns to be from the same type, and thus will do its best to keep it that way. NaN is a float value, and all your integers can be harmlessly converted to floats, so that's what happens.
If it can't, you get what needs to happen to make this work:
>>> x = pd.DataFrame(np.arange(4).reshape(2,2))
>>> x
0 1
0 0 1
1 2 3
>>> x[1].dtype
dtype('int64')
>>> x.iloc[1, 1] = 'string'
>>> x
0 1
0 0 1
1 2 string
>>> x[1].dtype
dtype('O')
since 1 can't be converted to a string in a reasonable manner (without guessing what the user wants), the type is converted to object which is general and doesn't allow for any optimizations. This gives you what is needed to make what you want work though (a multi-type column):
>>> x[1] = x[1].astype('O') # Alternatively use a non-float NaN object
>>> x.iloc[1, 1] = np.nan # or float('nan')
>>> x
0 1
0 0 1
1 2 NaN
This is usually not recommended at all though if you don't have to.
Not best but visually better is to use pd.NA rather than np.NaN:
>>> df.iloc[2,1] = pd.NA
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 <NA> 8
3 9 10 11
Seems to be good but:
>>> df.dtypes
0 int64
1 object # <- not float, but object
2 int64
dtype: object
You can read this page from the documentation.

Unable to write function for df.columns to factorize()

I have a dataframe df:
age 45211 non-null int64
job 45211 non-null object
marital 45211 non-null object
default 45211 non-null object
balance 45211 non-null int64
housing 45211 non-null object
loan 45211 non-null object
contact 45211 non-null object
day 45211 non-null int64
month 45211 non-null object
duration 45211 non-null int64
campaign 45211 non-null int64
pdays 45211 non-null int64
previous 45211 non-null int64
poutcome 45211 non-null object
conversion 45211 non-null int64
I want to do two things:
(1) I want to create two sub-dataframes which will be automatically separated by dtype=object and dtype=int64. I thought of something like this:
object_df=[]
int_df=[]
for i in df.columns:
if dtype=object:
*add column to object_df*
else:
*add column to int_df*
(2) Next, I want to use the columns from object_df['job','marital','default','housing','loan','contact','month','poutcome'] and write a function which factorizes each column, so that categories will be converted to numbers. I thought of something like this:
job_values,job_labels= df['job'].factorize()
df['job_fac']=job_values
Since I would have to copy and paste those for all columns in the object_df, is there a way to write a neat dynamic function?
Use DataFrame.select_dtypes first:
object_df = df.select_dtypes(object)
int_df = df.select_dtypes(np.number)
And then create lambda function for factorize, DataFrame.add_suffix and DataFrame.join to original DataFrame:
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
Sample:
np.random.seed(2020)
c = ['age', 'job', 'marital', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'conversion']
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
cols1 = np.setdiff1d(c, cols)
df1 = pd.DataFrame(np.random.choice(list('abcde'), size=(3, len(cols))), columns=cols)
df2 = pd.DataFrame(np.random.randint(10, size=(3, len(cols1))), columns=cols1)
df = pd.concat([df1, df2], axis=1).reindex(columns=c)
print (df)
age job marital default balance housing loan contact day month duration \
0 9 a a d 5 d d d 6 a 5
1 4 a a c 2 b d d 7 c 1
2 3 a e e 2 a e b 1 b 2
campaign pdays previous poutcome conversion
0 6 4 6 a 6
1 3 4 9 d 4
2 0 7 1 c 9
object_df = df.select_dtypes(object)
print (object_df)
job marital default housing loan contact month poutcome
0 a a d d d d a a
1 a a c b d d c d
2 a e e a e b b c
int_df = df.select_dtypes(np.number)
print (int_df)
age balance day duration campaign pdays previous conversion
0 9 5 6 5 6 4 6 6
1 4 2 7 1 3 4 9 4
2 3 2 1 2 0 7 1 9
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
print (df)
age job marital default balance housing loan contact day month ... \
0 9 a a d 5 d d d 6 a ...
1 4 a a c 2 b d d 7 c ...
2 3 a e e 2 a e b 1 b ...
poutcome conversion job_fac marital_fac default_fac housing_fac \
0 a 6 0 0 0 0
1 d 4 0 0 1 1
2 c 9 0 1 2 2
loan_fac contact_fac month_fac poutcome_fac
0 0 0 0 0
1 0 0 1 1
2 1 1 2 2

Resources