Map Pandas Series Containing key/value pairs to a new columns with data - python-3.x

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.

First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Last Visited Interval for different people

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-4', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to calculate the last visited interval by day for two people, in this example, the outcome should be
last_visited_interval name
0 1 John
1 2 Mary
Since '2015-3-5','2015-3-6' has interval of 1 and '2016-3-6', '2016-3-8' has interval of 2
I tried
df.groupby('name').agg(last_visited_interval=('visited',lambda x: x.diff().dt.days.last())),
but got the exception of
last() missing 1 required positional argument: 'offset'
How should I do it?
If check Series.last it working different - it return last value of datetimes by DatetimeIndex, also it is not GroupBy.last, because working with Series in lambda function. So you can use Series.iloc or Series.iat:
df.groupby('name').agg(last_visited_interval=('visited',lambda x:x.diff().dt.days.iat[-1]))
last_visited_interval
name
John 1.0
Mary 2.0

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

Resources