Unable to write function for df.columns to factorize() - python-3.x

I have a dataframe df:
age 45211 non-null int64
job 45211 non-null object
marital 45211 non-null object
default 45211 non-null object
balance 45211 non-null int64
housing 45211 non-null object
loan 45211 non-null object
contact 45211 non-null object
day 45211 non-null int64
month 45211 non-null object
duration 45211 non-null int64
campaign 45211 non-null int64
pdays 45211 non-null int64
previous 45211 non-null int64
poutcome 45211 non-null object
conversion 45211 non-null int64
I want to do two things:
(1) I want to create two sub-dataframes which will be automatically separated by dtype=object and dtype=int64. I thought of something like this:
object_df=[]
int_df=[]
for i in df.columns:
if dtype=object:
*add column to object_df*
else:
*add column to int_df*
(2) Next, I want to use the columns from object_df['job','marital','default','housing','loan','contact','month','poutcome'] and write a function which factorizes each column, so that categories will be converted to numbers. I thought of something like this:
job_values,job_labels= df['job'].factorize()
df['job_fac']=job_values
Since I would have to copy and paste those for all columns in the object_df, is there a way to write a neat dynamic function?

Use DataFrame.select_dtypes first:
object_df = df.select_dtypes(object)
int_df = df.select_dtypes(np.number)
And then create lambda function for factorize, DataFrame.add_suffix and DataFrame.join to original DataFrame:
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
Sample:
np.random.seed(2020)
c = ['age', 'job', 'marital', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'conversion']
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
cols1 = np.setdiff1d(c, cols)
df1 = pd.DataFrame(np.random.choice(list('abcde'), size=(3, len(cols))), columns=cols)
df2 = pd.DataFrame(np.random.randint(10, size=(3, len(cols1))), columns=cols1)
df = pd.concat([df1, df2], axis=1).reindex(columns=c)
print (df)
age job marital default balance housing loan contact day month duration \
0 9 a a d 5 d d d 6 a 5
1 4 a a c 2 b d d 7 c 1
2 3 a e e 2 a e b 1 b 2
campaign pdays previous poutcome conversion
0 6 4 6 a 6
1 3 4 9 d 4
2 0 7 1 c 9
object_df = df.select_dtypes(object)
print (object_df)
job marital default housing loan contact month poutcome
0 a a d d d d a a
1 a a c b d d c d
2 a e e a e b b c
int_df = df.select_dtypes(np.number)
print (int_df)
age balance day duration campaign pdays previous conversion
0 9 5 6 5 6 4 6 6
1 4 2 7 1 3 4 9 4
2 3 2 1 2 0 7 1 9
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
print (df)
age job marital default balance housing loan contact day month ... \
0 9 a a d 5 d d d 6 a ...
1 4 a a c 2 b d d 7 c ...
2 3 a e e 2 a e b 1 b ...
poutcome conversion job_fac marital_fac default_fac housing_fac \
0 a 6 0 0 0 0
1 d 4 0 0 1 1
2 c 9 0 1 2 2
loan_fac contact_fac month_fac poutcome_fac
0 0 0 0 0
1 0 0 1 1
2 1 1 2 2

Related

How to add unique date values of a datetime64[ns] Series object

I have a column of type datetime64[ns] (df.timeframe).
df has columns ['id', 'timeframe', 'type']
df['type'] can be 'A' or 'B'
I want to get the total number of unique dates per df.type == 'A' and per df.id
I tried this:
df = df.groupby(['id', 'type']).timeframe.apply(lambda x: x.dt.date()).unique().rename('test').reset_index()
But got error:
TypeError: 'Series' object is not callable
What should I do?
You could use value_counts:
df[df['type']=='A'].assign(timeframe=df['timeframe'].dt.date)
.value_counts(['id','type','timeframe'], sort=False)
.reset_index().rename(columns={0:'count'})
id type timeframe count
0 1 A 2022-06-06 2
1 1 A 2022-06-08 1
2 1 A 2022-06-10 2
3 2 A 2022-06-07 1
4 2 A 2022-06-09 1
5 2 A 2022-06-10 1

Convert string column to int pandas DataFrame

I have a Dataframe that has a column with unique string column. like below:
id customerId ...
1 vqUkxUDuEmB7gHWQvcYrBn
2 KaLEhwzZxCQ7GjPmVwBVav
3 pybDYgTiCUv3Pv3WLgxKCM
4 zqPiDV33KwrMBZoyeQXMJW
5 CR8z3ThPyzBKXFqqzemQAS
.
I want to replace customerIDs to int by a method like
# replace dataFrame.customerId[from start to end]
dataFrame.customerId.replace(sum(map(ord, ???)))
How can i do that?
Given something like
import pandas as pd
df = pd.DataFrame(columns=['UID'], index=range(7))
df.iloc[0,0] = 'vqUkxUDuEmB7gHWQvcYrBn'
df.iloc[1,0] = 'KaLEhwzZxCQ7GjPmVwBVav'
df.iloc[2,0] = 'pybDYgTiCUv3Pv3WLgxKCM'
df.iloc[3,0] = 'zqPiDV33KwrMBZoyeQXMJW'
df.iloc[4,0] = 'CR8z3ThPyzBKXFqqzemQAS'
df.iloc[5,0] = 'zqPiDV33KwrMBZoyeQXMJW' # equal to 3
df.iloc[6,0] = 'vqUkxUDuEmB7gHWQvcYrBn' # equal to 0
PS: I added 2 UIDs equal to previous ones to see that they'll be correctly categorized
you can use a categorical type
df['UID_categorical'] = df.UID.astype('category').cat.codes
output
UID UID_categorical
0 vqUkxUDuEmB7gHWQvcYrBn 3
1 KaLEhwzZxCQ7GjPmVwBVav 1
2 pybDYgTiCUv3Pv3WLgxKCM 2
3 zqPiDV33KwrMBZoyeQXMJW 4
4 CR8z3ThPyzBKXFqqzemQAS 0
5 zqPiDV33KwrMBZoyeQXMJW 4
6 vqUkxUDuEmB7gHWQvcYrBn 3
where UID_categorical is int
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UID 7 non-null object
1 UID_categorical 7 non-null int8
dtypes: int8(1), object(1)
memory usage: 191.0+ bytes
If you want to replace just do
df['UID'] = df.UID.astype('category').cat.codes

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

Pandas: How to extra only latest date in pivot table dataframe

How do I create a new dataframe which only include as index the latest date of the column 'txn_date' for each 'day' based on the pivot table in the picture?
Thank you
d1 = pd.to_datetime(['2016-06-25'] *2 + ['2016-06-28']*4)
df = pd.DataFrame({'txn_date':pd.date_range('2012-03-05 10:20:03', periods=6),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'day':d1}).set_index(['day','txn_date'])
print (df)
B C D E
day txn_date
2016-06-25 2012-03-05 10:20:03 4 7 1 5
2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-07 10:20:03 4 9 5 6
2012-03-08 10:20:03 5 4 7 9
2012-03-09 10:20:03 5 2 1 2
2012-03-10 10:20:03 4 3 0 4
1.
I think you need first sort_index if necessary first, then groupby by level day and aggregate last:
df1 = df.sort_index().reset_index(level=1).groupby(level='day').last()
print (df1)
txn_date B C D E
day
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4
2.
Filter by boolean indexing with duplicated:
#if necessary
df = df.sort_index()
df2 = df[~df.index.get_level_values('day').duplicated(keep='last')]
print(df2)
B C D E
day txn_date
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4

TypeError: unorderable types: int() > str(): Series

From a search it seems like one can get this error in a whole host of different situations. Here is mine:
testDf['pA'] = priorDf.loc[testDf['Period']]['a'] + testDf['TotalPlays']
--> 743 sorter = uniques.argsort()
744
745 reverse_indexer = np.empty(len(sorter), dtype=np.int64)
TypeError: unorderable types: int() > str()
where priorDf.loc[testDf['Period']]['a']is:
Period
2-17-1 1.120947
1-14-1 1.181726
7-19-1 1.935126
4-08-1 3.828184
3-14-1 0.668255
and testDf['TotalPlays'] is:
0 1
1 1
2 1
3 1
4 1
Both are of length 48.
----Additional Info-----
print (priorDf.dtypes)
mean float64
var float64
a float64
b float64
dtype: object
print (testDf.dtypes)
UserID int64
Period object
PlayCount int64
TotalPlays int64
TotalWks int64
Prob float64
pA int64
dtype: object
----- More Info ---------
print (priorDf['a'].head())
Period
1-00-1 0.889164
1-01-1 2.304074
1-02-1 0.281502
1-03-1 1.137781
1-04-1 2.335650
Name: a, dtype: float64
print (testDf[['Period','TotalPlays']].head())
Period TotalPlays
0 2-17-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 3-14-1 1
I also tried converting priorDf.loc[testDf['Period']]['a'] to type int (as it was a float) but still the same error.
I think you need map by priorDf['a'] or dict created from it.
Problem was different indexes of DataFrames, so data cannot align.
#changed data of Period for match sample data
print (testDf)
Period TotalPlays
0 1-00-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 1-04-1 1
testDf['pA'] = testDf['Period'].map(priorDf['a']) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650
print (priorDf['a'].to_dict())
{'1-02-1': 0.28150199999999997, '1-01-1': 2.304074,
'1-00-1': 0.88916399999999995, '1-03-1': 1.1377809999999999,
'1-04-1': 2.3356499999999998}
testDf['pA'] = testDf['Period'].map(priorDf['a'].to_dict()) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650
So my conclusion after testing with randomly generated values S = pd.Series(np.random.randn(48)) is that when adding columns from two different series together, they have to have the same Index. Pandas must automatically be sorting them both behind the scenes or something. So in my case I had period as the index for one, and period as a column, not the index for the other.
My re-written solution was:
testDf.set_index('Period', inplace=True)
testDf['pA'] = priorDf.loc[testDf.index]['a'] + testDf['TotalPlays']
testDf['pB'] = priorDf.loc[testDf.index]['b'] + testDf['TotalWks']-testDf['TotalPlays']
Thanks to jezrael for helping me get to the bottom of it.

Resources