Optimize groupby->pd.DataFrame->.reset_index->.rename(columns) - python-3.x

I am very new at this, so bear with me please.
I do this:
example=
index Date Column_1 Column_2
1 2019-06-17 Car Red
2 2019-08-10 Car Yellow
3 2019-08-15 Truck Yellow
4 2020-08-12 Truck Yellow
data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})
In order to get this:
df2=
index interval_year Column_1 Sum
1 2019-12-31 Car 2
2 2019-12-31 Truck 1
3 2020-12-31 Car 1
I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.
Any solution how to make this more efficient?
Thank you

You can use pd.NamedAgg to do some renaming for you in the groupby like this:
example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()
Output:
Date Column_1 sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1

To reduce visible noise and to make your code more performant, I suggest you to do method chaining.
Try this :
df2 = (
example
.assign(Date= pd.to_datetime(df["Date"]))
.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
.reset_index()
.rename(columns={'Date':'interval_year','Column_2':'Sum'})
)
# Output :
print(df2)
interval_year Column_1 Sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

how to compare two data frames based in difference in date

I have two data frames, each has #id column and date column,
I want to find rows in both Data frames that have same id with a date difference more than > 2 days
Normally it's helpful to include a datafrme so that the responder doesn't need to create it. :)
import pandas as pd
from datetime import timedelta
Create two dataframes:
df1 = pd.DataFrame(data={"id":[0,1,2,3,4], "date":["2019-01-01","2019-01-03","2019-01-05","2019-01-07","2019-01-09"]})
df1["date"] = pd.to_datetime(df1["date"])
df2 = pd.DataFrame(data={"id":[0,1,2,8,4], "date":["2019-01-02","2019-01-06","2019-01-09","2019-01-07","2019-01-10"]})
df2["date"] = pd.to_datetime(df2["date"])
They will look like this:
DF1
id date
0 0 2019-01-01
1 1 2019-01-03
2 2 2019-01-05
3 3 2019-01-07
4 4 2019-01-09
DF2
id date
0 0 2019-01-02
1 1 2019-01-06
2 2 2019-01-09
3 8 2019-01-07
4 4 2019-01-10
Merge the two dataframes on 'id' columns:
df_result = df1.merge(df2, on="id")
Resulting in:
id date_x date_y
0 0 2019-01-01 2019-01-02
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09
3 4 2019-01-09 2019-01-10
Then subtract the two day columns and filter for greater than two.
df_result[(df_result["date_y"] - df_result["date_x"]) > timedelta(days=2)]
id date_x date_y
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09

Calculate the sum of the numbers separated by a comma in a dataframe column

I am trying to calculate the sum of all the numbers separated by a comma in a dataframe column however I keep getting error. This is what the dataframe looks like:
Description scores
logo
graphics
eyewear 0.360740,-0.000758
glasses 0.360740,-0.000758
picture -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889
computer 0.852264 0.001007,0.000968,0.000929,0.000889
This is what the code looks like
test['Sum'] = test['scores'].apply(lambda x: sum(map(float, x.split(','))))
However I keep getting the following error
ValueError: could not convert string to float:
I though it could it be because of missing values at the start of the dataframe. But I subset the dataframe to exclude the missing the values, still I get the same error.
Output
Description scores SUM
logo
graphics
eyewear 0.360740,-0.000758 0.359982
glasses 0.360740,-0.000758 0.359982
picture -0.000646 -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889 0.003793
computer 0.852264 0.001007,0.000968,0.000929,0.000889 0.856057
How can I resolve it?
There are times when using Python seems to be very effective, this might be one of those.
df['scores'].apply(lambda x: sum(float(i) if len(x) > 0 else np.nan for i in x.split(',')))
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
You can just do str.split
df.scores.str.split(',',expand=True).astype(float).sum(1).mask(df.scores.isnull())
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
dtype: float64
Another solution using explode, groupby and sum functions:
df.scores.str.split(',').explode().astype(float).groupby(level=0).sum(min_count=1)
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
Name: scores, dtype: float64
Or to make #WeNYoBen's answer slightly shorter":
df.scores.str.split(',',expand=True).astype(float).sum(1, min_count=1)

How do I reformat dates in a CSV to just show MM/YYYY

Using Python 3 Pandas, spending an embarrassing amount of time trying to figure out how to take a column of dates from a CSV and make a new column with just MM/YYYY or YYYY/MM/01.
The data looks like Col1 but I am trying to produce Col2:
Col1 Col2
2/12/2017 2/1/2017
2/16/2017 2/1/2017
2/28/2017 2/1/2017
3/2/2017 3/1/2017
3/13/2017 3/1/2017
Am able to parse the year and month out:
df['Month'] = pd.DatetimeIndex(df['File_Processed_Date']).month
df['Year'] = pd.DatetimeIndex(df['File_Processed_Date']).year
df['Period'] = df['Month'] + '/' + df['Year']
That last line is wrong. Is there a clever python way to just show 2/2017?
Get the error: "TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('
Update, answer by piRsquared:
d = pd.to_datetime(df.File_Processed_Date)
df['Period'] = d.dt.strftime('%m/1/%Y')
This will create a pandas column in a dataframe that converts Col1 into Col2 successfully. Thanks!
let d be just 'Col1' converted to Timestamp
d = pd.to_datetime(df.Col1)
then
d.dt.strftime('%m/1/%Y')
0 02/1/2017
1 02/1/2017
2 02/1/2017
3 03/1/2017
4 03/1/2017
Name: Col1, dtype: object
​
d.dt.strftime('%m%Y')
0 02/2017
1 02/2017
2 02/2017
3 03/2017
4 03/2017
Name: Col1, dtype: object
d.dt.strftime('%Y/%m/01')
0 2017/02/01
1 2017/02/01
2 2017/02/01
3 2017/03/01
4 2017/03/01
Name: Col1, dtype: object
d - pd.offsets.MonthBegin()
0 2017-02-01
1 2017-02-01
2 2017-02-01
3 2017-03-01
4 2017-03-01
Name: Col1, dtype: datetime64[ns]
The function you are looking for is strftime.

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Resources