How to add unique date values of a datetime64[ns] Series object - python-3.x

I have a column of type datetime64[ns] (df.timeframe).
df has columns ['id', 'timeframe', 'type']
df['type'] can be 'A' or 'B'
I want to get the total number of unique dates per df.type == 'A' and per df.id
I tried this:
df = df.groupby(['id', 'type']).timeframe.apply(lambda x: x.dt.date()).unique().rename('test').reset_index()
But got error:
TypeError: 'Series' object is not callable
What should I do?

You could use value_counts:
df[df['type']=='A'].assign(timeframe=df['timeframe'].dt.date)
.value_counts(['id','type','timeframe'], sort=False)
.reset_index().rename(columns={0:'count'})
id type timeframe count
0 1 A 2022-06-06 2
1 1 A 2022-06-08 1
2 1 A 2022-06-10 2
3 2 A 2022-06-07 1
4 2 A 2022-06-09 1
5 2 A 2022-06-10 1

Related

How to bulild a null notnull matrix in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
Here's my expected output
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7
Assuming Null is NaN, here's one option. Using isna + sum to count the NaNs, then find the difference between df length and number of NaNs for Notnulls. Then construct a DataFrame.
nulls = df.drop(columns='Id').isna().sum()
notnulls = nulls.rsub(len(df))
out = pd.DataFrame.from_dict({'Null':nulls, 'Notnull':notnulls}, orient='index')
out['Total'] = out.sum(axis=1)
If you're into one liners, we could also do:
out = (df.drop(columns='Id').isna().sum().to_frame(name='Nulls')
.assign(Notnull=df.drop(columns='Id').notna().sum()).T
.assign(Total=lambda x: x.sum(axis=1)))
Output:
Column_A Column_B Column_C Total
Nulls 2 1 2 5
Notnull 2 3 2 7
Use Series.value_counts for non missing values:
df = (df.replace('Null', np.nan)
.set_index('Id', 1)
.notna()
.apply(pd.value_counts)
.rename({True:'Notnull', False:'Null'}))
df['Total'] = df.sum(axis=1)
print (df)
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7

Complex group by using Pandas

I am facing a situation where I need to group-by a dataframe by a column 'ID' and also calculate the total time frame depicted for that particular ID to complete. I only want to calculate the difference between the date_open and data_closed for the particular ID with the ID count.
We only need to focus on the date open and the date closed field. So it needs to do something taking the max closing date and the min open date and subtracting the two
The dataframe looks as follows:
ID Date_Open Date_Closed
1 01/01/2019 02/01/2019
1 07/01/2019 09/01/2019
2 10/01/2019 11/01/2019
2 13/01/2019 19/01/2019
3 10/01/2019 11/01/2019
The output should look like this :
ID Count_of_ID Total_Time_In_Days
1 2 8
2 2 9
3 1 1
How should I achieve this ?
Using GroupBy with named_aggregation and the min and max of the dates:
df[['Date_Open', 'Date_Closed']] = (
df[['Date_Open', 'Date_Closed']].apply(lambda x: pd.to_datetime(x, format='%d/%m/%Y'))
)
dfg = df.groupby('ID').agg(
Count_of_ID=('ID','size'),
Date_Open=('Date_Open','min'),
Date_Closed=('Date_Closed','max')
)
dfg['Total_Time_In_Days'] = dfg['Date_Closed'].sub(dfg['Date_Open']).dt.days
dfg = dfg.drop(columns=['Date_Closed', 'Date_Open']).reset_index()
ID Count_of_ID Total_Time_In_Days
0 1 2 8
1 2 2 9
2 3 1 1
Now we have Total_Time_In_Days as int:
print(dfg.dtypes)
ID int64
Count_of_ID int64
Total_Time_In_Days int64
dtype: object
This can also be used:
df['Date_Open'] = pd.to_datetime(df['Date_Open'], dayfirst=True)
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'], dayfirst=True)
df_grouped = df.groupby(by='ID').count()
df_grouped['Total_Time_In_Days'] = df.groupby(by='ID')['Date_Closed'].max() - df.groupby(by='ID')['Date_Open'].min()
df_grouped = df_grouped.drop(columns=['Date_Open'])
df_grouped.columns=['Count', 'Total_Time_In_Days']
print(df_grouped)
Count Total_Time_In_Days
ID
1 2 8 days
2 2 9 days
3 1 1 days
I'll try first to create the a column depicting how much time passed from Date_open to Date_closed for each instance of the dataframe. Like this:
df['Total_Time_In_Days'] = df.Date_closed - df.Date_open
Then you can use groupby:
df.groupby('id').agg({'id':'count','Total_Time_In_Days':'sum'})
If you need any help with the .agg function you can refer to it's official documentation here.

Unable to write function for df.columns to factorize()

I have a dataframe df:
age 45211 non-null int64
job 45211 non-null object
marital 45211 non-null object
default 45211 non-null object
balance 45211 non-null int64
housing 45211 non-null object
loan 45211 non-null object
contact 45211 non-null object
day 45211 non-null int64
month 45211 non-null object
duration 45211 non-null int64
campaign 45211 non-null int64
pdays 45211 non-null int64
previous 45211 non-null int64
poutcome 45211 non-null object
conversion 45211 non-null int64
I want to do two things:
(1) I want to create two sub-dataframes which will be automatically separated by dtype=object and dtype=int64. I thought of something like this:
object_df=[]
int_df=[]
for i in df.columns:
if dtype=object:
*add column to object_df*
else:
*add column to int_df*
(2) Next, I want to use the columns from object_df['job','marital','default','housing','loan','contact','month','poutcome'] and write a function which factorizes each column, so that categories will be converted to numbers. I thought of something like this:
job_values,job_labels= df['job'].factorize()
df['job_fac']=job_values
Since I would have to copy and paste those for all columns in the object_df, is there a way to write a neat dynamic function?
Use DataFrame.select_dtypes first:
object_df = df.select_dtypes(object)
int_df = df.select_dtypes(np.number)
And then create lambda function for factorize, DataFrame.add_suffix and DataFrame.join to original DataFrame:
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
Sample:
np.random.seed(2020)
c = ['age', 'job', 'marital', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'conversion']
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
cols1 = np.setdiff1d(c, cols)
df1 = pd.DataFrame(np.random.choice(list('abcde'), size=(3, len(cols))), columns=cols)
df2 = pd.DataFrame(np.random.randint(10, size=(3, len(cols1))), columns=cols1)
df = pd.concat([df1, df2], axis=1).reindex(columns=c)
print (df)
age job marital default balance housing loan contact day month duration \
0 9 a a d 5 d d d 6 a 5
1 4 a a c 2 b d d 7 c 1
2 3 a e e 2 a e b 1 b 2
campaign pdays previous poutcome conversion
0 6 4 6 a 6
1 3 4 9 d 4
2 0 7 1 c 9
object_df = df.select_dtypes(object)
print (object_df)
job marital default housing loan contact month poutcome
0 a a d d d d a a
1 a a c b d d c d
2 a e e a e b b c
int_df = df.select_dtypes(np.number)
print (int_df)
age balance day duration campaign pdays previous conversion
0 9 5 6 5 6 4 6 6
1 4 2 7 1 3 4 9 4
2 3 2 1 2 0 7 1 9
cols = ['job','marital','default','housing','loan','contact','month','poutcome']
df = df.join(object_df[cols].apply(lambda x: pd.factorize(x)[0]).add_suffix('_fac'))
print (df)
age job marital default balance housing loan contact day month ... \
0 9 a a d 5 d d d 6 a ...
1 4 a a c 2 b d d 7 c ...
2 3 a e e 2 a e b 1 b ...
poutcome conversion job_fac marital_fac default_fac housing_fac \
0 a 6 0 0 0 0
1 d 4 0 0 1 1
2 c 9 0 1 2 2
loan_fac contact_fac month_fac poutcome_fac
0 0 0 0 0
1 0 0 1 1
2 1 1 2 2

TypeError: unorderable types: int() > str(): Series

From a search it seems like one can get this error in a whole host of different situations. Here is mine:
testDf['pA'] = priorDf.loc[testDf['Period']]['a'] + testDf['TotalPlays']
--> 743 sorter = uniques.argsort()
744
745 reverse_indexer = np.empty(len(sorter), dtype=np.int64)
TypeError: unorderable types: int() > str()
where priorDf.loc[testDf['Period']]['a']is:
Period
2-17-1 1.120947
1-14-1 1.181726
7-19-1 1.935126
4-08-1 3.828184
3-14-1 0.668255
and testDf['TotalPlays'] is:
0 1
1 1
2 1
3 1
4 1
Both are of length 48.
----Additional Info-----
print (priorDf.dtypes)
mean float64
var float64
a float64
b float64
dtype: object
print (testDf.dtypes)
UserID int64
Period object
PlayCount int64
TotalPlays int64
TotalWks int64
Prob float64
pA int64
dtype: object
----- More Info ---------
print (priorDf['a'].head())
Period
1-00-1 0.889164
1-01-1 2.304074
1-02-1 0.281502
1-03-1 1.137781
1-04-1 2.335650
Name: a, dtype: float64
print (testDf[['Period','TotalPlays']].head())
Period TotalPlays
0 2-17-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 3-14-1 1
I also tried converting priorDf.loc[testDf['Period']]['a'] to type int (as it was a float) but still the same error.
I think you need map by priorDf['a'] or dict created from it.
Problem was different indexes of DataFrames, so data cannot align.
#changed data of Period for match sample data
print (testDf)
Period TotalPlays
0 1-00-1 1
1 1-14-1 1
2 7-19-1 1
3 4-08-1 1
4 1-04-1 1
testDf['pA'] = testDf['Period'].map(priorDf['a']) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650
print (priorDf['a'].to_dict())
{'1-02-1': 0.28150199999999997, '1-01-1': 2.304074,
'1-00-1': 0.88916399999999995, '1-03-1': 1.1377809999999999,
'1-04-1': 2.3356499999999998}
testDf['pA'] = testDf['Period'].map(priorDf['a'].to_dict()) + testDf['TotalPlays']
print (testDf)
Period TotalPlays pA
0 1-00-1 1 1.889164
1 1-14-1 1 NaN
2 7-19-1 1 NaN
3 4-08-1 1 NaN
4 1-04-1 1 3.335650
So my conclusion after testing with randomly generated values S = pd.Series(np.random.randn(48)) is that when adding columns from two different series together, they have to have the same Index. Pandas must automatically be sorting them both behind the scenes or something. So in my case I had period as the index for one, and period as a column, not the index for the other.
My re-written solution was:
testDf.set_index('Period', inplace=True)
testDf['pA'] = priorDf.loc[testDf.index]['a'] + testDf['TotalPlays']
testDf['pB'] = priorDf.loc[testDf.index]['b'] + testDf['TotalWks']-testDf['TotalPlays']
Thanks to jezrael for helping me get to the bottom of it.

How do I reformat dates in a CSV to just show MM/YYYY

Using Python 3 Pandas, spending an embarrassing amount of time trying to figure out how to take a column of dates from a CSV and make a new column with just MM/YYYY or YYYY/MM/01.
The data looks like Col1 but I am trying to produce Col2:
Col1 Col2
2/12/2017 2/1/2017
2/16/2017 2/1/2017
2/28/2017 2/1/2017
3/2/2017 3/1/2017
3/13/2017 3/1/2017
Am able to parse the year and month out:
df['Month'] = pd.DatetimeIndex(df['File_Processed_Date']).month
df['Year'] = pd.DatetimeIndex(df['File_Processed_Date']).year
df['Period'] = df['Month'] + '/' + df['Year']
That last line is wrong. Is there a clever python way to just show 2/2017?
Get the error: "TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('
Update, answer by piRsquared:
d = pd.to_datetime(df.File_Processed_Date)
df['Period'] = d.dt.strftime('%m/1/%Y')
This will create a pandas column in a dataframe that converts Col1 into Col2 successfully. Thanks!
let d be just 'Col1' converted to Timestamp
d = pd.to_datetime(df.Col1)
then
d.dt.strftime('%m/1/%Y')
0 02/1/2017
1 02/1/2017
2 02/1/2017
3 03/1/2017
4 03/1/2017
Name: Col1, dtype: object
​
d.dt.strftime('%m%Y')
0 02/2017
1 02/2017
2 02/2017
3 03/2017
4 03/2017
Name: Col1, dtype: object
d.dt.strftime('%Y/%m/01')
0 2017/02/01
1 2017/02/01
2 2017/02/01
3 2017/03/01
4 2017/03/01
Name: Col1, dtype: object
d - pd.offsets.MonthBegin()
0 2017-02-01
1 2017-02-01
2 2017-02-01
3 2017-03-01
4 2017-03-01
Name: Col1, dtype: datetime64[ns]
The function you are looking for is strftime.

Resources