convert date to week and count the dependencies from different columns of a dataframe - python-3.x

I have a dataframe like this:
date Company Email
2019-10-07 abc mr1#abc.com
2019-10-07 def mr1#def.com
2019-10-07 abc mr1#abc.com
2019-10-08 xyz mr1#xyz.com
2019-10-08 abc mr2#abc.com
2019-10-15 xyz mr2#xyz.com
2019-10-15 def mr1#def.com
2019-10-17 xyz mr1#xyz.com
2019-10-17 abc mr2#abc.com
I have to create 2 dataframes like this:
dataframe 1:
Weeks abc def xyz
octoter7-october14 3 1 1
october15-0ctober22 1 1 2
and dataframe2: Unique count for Emails as well weekwise
Weeks Company Email_ID count
octoter7-october14 abc mr1#abc.com 2
mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
october15-october22 abc mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
mr2#xyz.com 1
Below is the code what i tried to create dataframe1 :
df1['Date'] = pd.to_datetime(df1['date']) - pd.to_timedelta(7, unit='d')
df1 = df1.groupby(['Company', pd.Grouper(key='Date', freq='W-MON')])['Email_ID'].count().sum().reset_index().sort_values('Date') ```
Company Date Email_ID
abc 2019-10-07 mr1#abc.com.mr1#abc.com.mr2#abc.com
def 2019-10-07 mr1#def.com
xyz 2019-10-07 mr1#xyz.com
abc 2019-10-15 mr2#abc.com
def 2019-10-15 mr1#def.com
xyz 2019-10-15 mr1#xyz.com.mr2#xyz.com ```
Here the sum is concatenating Email_ID strings instead of numerical counts and not able to represent my data as I want in dataframe1 and dataframe2
Please provide insights on how i can represent my data in as dataframe1 and dataframe2

For Grouper need datetimes, so format of datetimes is changed by MultiIndex.set_levels after aggregation and also added closed='left' for left closing bins:
df1['date'] = pd.to_datetime(df1['date'])
df2 = df1.groupby([pd.Grouper(key='date', freq='W-MON', closed='left'),
'Company',
'Email'])['Email'].count()
new = ((df2.index.levels[0] - pd.to_timedelta(7, unit='d')).strftime('%B%d') + ' - '+
df2.index.levels[0].strftime('%B%d') )
df2.index = df2.index.set_levels(new, level=0)
print (df2)
date Company Email
October07 - October14 abc mr1#abc.com 2
mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
October14 - October21 abc mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
mr2#xyz.com 1
Name: Email, dtype: int64
For first DataFrame use sum per first and second levels and reshape by Series.unstack:
df3 = df2.sum(level=[0,1]).unstack(fill_value=0)
print (df3)
Company abc def xyz
date
October07 - October14 3 1 1
October14 - October21 1 1 2

Add a column that maps the date to a week
Do something with grouping: i.e. df.groupby(df.week).count()

Related

How to count unique values in one colulmn based on value in another column by group in Pandas

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI TE203 0
GHI TE203 0
I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:
GroupID UniqueNum
ABC 1
DEF 1
GHI 0
I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?
count_df = df.groupby(['GroupID'])['ID'].nunique()
positive counts only
You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
all counts (including zero)
If you want to count as zero, the groups with no match, you can reindex:
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reindex(df['GroupID'].unique(), fill_value=0)
.reset_index()
)
Or mask:
(df['ID'].where(df['Value'].gt(0))
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
2 GHI 0
Used input:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI AB123 0
If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:
df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
'Value': [0, 1, 2, 1, 1, 0]})
df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
.nunique()
.reset_index(name='nunique'))
print (df)
GroupID nunique
0 ABC 1
1 DEF 1
2 NEW 0
How it working:
print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
GroupID ID Value new
0 ABC TX123 0 NaN
1 ABC TX678 1 TX678
2 ABC TX678 2 TX678
3 DEF AG123 1 AG123
4 DEF AG123 1 AG123
5 NEW AG123 0 NaN <- set NaN for non matched condition

how to find how many times given value repeating in pandas dataseries if dataseries contains list of string values?

Example 1:
let suppose we have record data series
record
['ABC' ,'GHI']
['ABC' , 'XYZ']
['XYZ','PQR']
if I want to calculate how many times each value is repeating from record data-series like
value Count
'ABC' 2
'XYZ' 2
'GHI' 1
'PQR' 1
In the record series, 'ABC' and 'XYZ' are repeating for 2 times.
'GHI' and 'PQR' repeating for 1 times.
Example 2:
below is the new dataframe.
teams
0 ['Australia', 'Sri Lanka']
1 ['Australia', 'Sri Lanka']
2 ['Australia', 'Sri Lanka']
3 ['Ireland', 'Hong Kong']
4 ['Zimbabwe', 'India']
... ...
1412 ['Pakistan', 'Sri Lanka']
1413 ['Bangladesh', 'India']
1414 ['United Arab Emirates', 'Netherlands']
1415 ['Sri Lanka', 'Australia']
1416 ['Sri Lanka', 'Australia']
Now if I apply
print(new_df.explode('teams').value_counts())
it gives me
teams
['England', 'Pakistan'] 29
['Australia', 'Pakistan'] 26
['England', 'Australia'] 25
['Australia', 'India'] 24
['England', 'West Indies'] 23
... ..
['Namibia', 'Sierra Leone'] 1
['Namibia', 'Scotland'] 1
['Namibia', 'Oman'] 1
['Mozambique', 'Rwanda'] 1
['Afghanistan', 'Bangladesh'] 1
Length: 399, dtype: int64
But I want
team occurrence of team
India ?
England ?
Australia ?
... ...
I want the occurrence of each team from the dataframe.
How to perform this task?
Try explode and value_counts
On Series:
import pandas as pd
s = pd.Series({0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']})
r = s.explode().value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
dtype: int64
On DataFrame:
import pandas as pd
df = pd.DataFrame({'record': {0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']}})
r = df.explode('record')['record'].value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
Name: record, dtype: int64

How to get the top values within each group?

I am new to Pandas and I have a dataset that looks something like this.
s_name Time p_name qty
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0
I am trying to group by s_name and find the sum of the qty of each unique p_name in a month but only display the p_name with the top 2 most quantities. Below is an example of how I want the final output to look like.
s_name Time p_name qty
A 01 DEF 4
A 01 ABC 2
B 02 ABC 3
B 02 DEF 2
B 03 ABC 2
B 03 FGH 0
Do you have any ideas? I have been stuck here for quite long so much help is appreciated.
Create a month using dt, then group by s_name and month, then apply a function to the groups, group each group by name and do a sum over qty, sort_values descending and only get the first two rows with head:
df.Time = pd.to_datetime(df.Time, format='%d/%m/%Y')
df['month'] = df.Time.dt.month
df_f = df.groupby(['s_name', 'month']).apply(
lambda df:
df.groupby('p_name').qty.sum()
.sort_values(ascending=False).head(2)
).reset_index()
df_f
# s_name month p_name qty
# 0 A 1 DEF 4
# 1 A 1 ABC 2
# 2 B 2 ABC 3
# 3 B 2 DEF 2
# 4 B 3 ABC 3
# 5 B 3 FGH 0
I am new to Pandas myself. I am going to attempt to answer your question.
See this code.
from io import StringIO
import pandas as pd
columns = "s_name Time p_name qty"
# Create dataframe from text.
df = pd.read_csv(
StringIO(
f"""{columns}
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0"""
),
sep=" ",
)
S_NAME, TIME, P_NAME, QTY = columns.split()
MONTH = "month"
# Convert the TIME col to datetime types.
df.Time = pd.to_datetime(df.Time, dayfirst=True)
# Create a month column with zfilled strings.
df[MONTH] = df.Time.apply(lambda x: str(x.month).zfill(2))
# Group
group = df.groupby(by=[S_NAME, P_NAME, MONTH])
gdf = (
group.sum()
.sort_index()
.sort_values(by=[S_NAME, MONTH, QTY], ascending=False)
.reset_index()
)
gdf.groupby([S_NAME, MONTH]).head(2).sort_values(by=[S_NAME, MONTH]).reset_index()
Is this the result you expected?

Restructure dataframe based on given keys

I'm working on a dataset and after all the cleaning and restructuring I have arrived at a situation where the dataset looks like below.
import pandas as pd
df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)
CSV URL: https://pastebin.com/raw/nkDHEXQC
id year period freq_no sequence file_date data_date field \
0 abcdefghi 2018 A 001 001 20180605 20180331 05210
1 abcdefghi 2018 A 001 001 20180605 20180331 05210
2 abcdefghi 2018 A 001 001 20180605 20180331 05210
3 abcdefghi 2018 A 001 001 20180605 20180330 05220
4 abcdefghi 2018 A 001 001 20180605 20180330 05220
5 abcdefghi 2018 A 001 001 20180605 20180330 05230
6 abcdefghi 2018 A 001 001 20180605 20180330 05230
value note_type note transaction_type
0 200.0 NaN NaN A
1 NaN B {05210_B:ABC} A
2 NaN U {05210_U:DEFF} D
3 200.0 NaN NaN U
4 NaN U {05220_U:xyz} D
5 100.0 NaN NaN D
6 NaN U {05230_U:lmn} A
I want to restructure above so that it looks like below.
Logic:
Use id, year, period, freq_no, sequence, data_date as key (groupby?)
Transpose such that field becomes column and this column has value as its values
Create a combined_note column by concatenating note (for same key)
Create a deleted column which will show which note or value was deleted based on transaction_type D.
Output:
id year period freq_no sequence file_date data_date 05210 \
0 abcdefghi 2018 A 001 001 20180605 20180331 200.0
1 abcdefghi 2018 A 001 001 20180605 20180330 NaN
05220 05230 combined_note deleted
0 NaN NaN {05210_B:ABC}{05210_U:DEFF} note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D
1 200.0 100.0 {05220_U:xyz}{05230_U:lmn} note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D
I think this can be done by using set_index on key and then restructruing other columns but I wasn't able to get the desired output.
So I ended having to do this with a merge.
Logical Steps:
Group DataFrame by all fields except note and value. This is to preserve the field and transaction columns to not be affected by the aggregation.
Add a deleted column.
First DataFrame that contains the aggregation of the notes(deleted as well).
Second DataFrame to transform field and value to multiple columns.
Merge first and second data frame on the index.
Code:
import pandas as pd
import io
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# url = "https://pastebin.com/raw/nkDHEXQC"
csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
"""
data = io.BytesIO(csv_string)
df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})
# so the aggregation function will work
df['note'] = df['note'].fillna('')
grouped = df.groupby(
['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])
grouped.columns = grouped.columns.droplevel(1)
grouped.reset_index(['field', 'transaction_type'], inplace=True)
gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']
def is_deleted(note, trans_type, field):
"""Determines if a note is deleted"""
deleted = []
for val, val2 in zip(note, trans_type):
if val != "":
if val2 == 'D':
deleted.append(val)
else:
deleted.append('')
else:
deleted.append('')
return pd.Series(deleted, index=note.index)
# This function will add the deleted notes
# I am not sure of the pipe operator, i will leave that to you
grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])
# This will obtain all agg of all the notes and deleted
notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)
# converts two columns into new columns using specified table
# using pivot table to take advantage of the multi index
stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')
# finally merge the notes and stacked_value on their index
final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()
Output:
final
id year period freq_no sequence data_date file_date 05210 05220 05230 combined_note deleted
0 abcdefghi 2018 A 001 001 20180330 20180605 NaN 200.0 100.0 {05220_U:xyz}{05230_U:lmn} {05220_U:xyz}
1 abcdefghi 2018 A 001 001 20180331 20180605 200.0 NaN NaN {05210_B:ABC}{05210_U:DEFF} {05210_U:DEFF}

New column with in a Pandas Dataframe with respect to duplicates in given column

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Resources