Lookup the Most Recent Past Date Based on Criteria - excel-formula

Here is my sample data
Date
Vendor
Func
Name
1/1/2023
A
1
AB
1/1/2023
A
2
AC
1/1/2023
B
3
AD
1/2/2023
A
1
AB
1/2/2023
A
2
AC
1/2/2023
B
3
AD
1/4/2023
A
1
AB
1/4/2023
A
2
AC
1/4/2023
B
3
AD
1/5/2023
A
1
AB
1/5/2023
A
2
AC
1/5/2023
B
3
AD
Output:
Date
Vendor
Func
Name
Recent_Date
1/1/2023
A
1
AB
Null
1/1/2023
A
2
AC
Null
1/1/2023
B
3
AD
Null
1/2/2023
A
1
AB
1/1/2023
1/2/2023
A
2
AC
1/1/2023
1/2/2023
B
3
AD
1/1/2023
1/4/2023
A
1
AB
1/2/2023
1/4/2023
A
2
AC
1/2/2023
1/4/2023
B
3
AD
1/2/2023
1/5/2023
A
1
AB
1/4/2023
1/5/2023
A
2
AC
1/4/2023
1/5/2023
B
3
AD
1/4/2023
What makes row unique : Date+Func columns
For Date 1/1/2023:The previous past date is not listed in the data,it should return null
For Date 1/2/2023: Similarly,it should return 1/1/2023
For Date 1/4/2023:Similarly,it should return 1/2/2023
For Date 1/5/2023:Similarly,it should return 1/4/2023
Can someone help me with what excel function to use to get the expected ouput (Recent_Date) as shown above?
I tried using index and match for this but still not working

As you mentioned INDEX/MATCH, here's the solution using that:
=INDEX(DateColumn;MATCH(DateValue-1;DateColumn;1))
where DateValue refers to the date cell of the same row.
Of course, there's also the XLOOKUP option which is maybe a bit more legible:
=XLOOKUP(DateValue-1;DateColumn;DateColumn;;-1)

Related

How to recalculate DataFrame column values based on condition dict (Pandas Python)

Lets say I have the following DataFrame:
A
B
0
aa
4.32
1
aa
7.00
2
bb
8.00
3
dd
74.00
4
cc
30.00
5
bb
2.00
And let's say I have the following dict which determs the condition for column A in its keys and determs the multiplier for coulmn B in its values:
dict1={'aa':-1, 'bb':2}
All I want is to multiply values in column B with vulues from dict1 based on condition that values in column A are queal to dict1 keys.
So the ouptput should be:
A
B
0
aa
-4.32
1
aa
-7.00
2
bb
16.00
3
dd
74.00
4
cc
30.00
5
bb
4.00
Thanks
Use pd.Series.map:
print (df["A"].map(dict1).fillna(1)*df["B"])
0 -4.32
1 -7.00
2 16.00
3 74.00
4 30.00
5 4.00
dtype: float64

Nesting INDEX function within a SUMPRODUCT to aggregate values within the correct column

This is an adaption to my previous question regarding how to aggregate values based on lookup IDs and multiple criteria.
This time I would like to index for the correct year. My goal is to get to sheet 3, where a formula contained in cells C4:D6 will reference the product name in column A, the quarter in cell A1 (which the user will input) and the year in cell C3 & D3, and aggregate the relevant sales figures under each region in column B.
In my previous question, I was provided a solution that would nest a SUMIF within a SUMPRODUCT. I am trying to build on this function by adding an INDEX & MATCH formula within the formula to index for the correct year's column in sheet 1. I have tried the following in the report but am getting a #N/A error.
=SUMPRODUCT(SUMIFS(INDEX(Sheet1!$D:$F,0,MATCH(C$3,Sheet1!$D$1:$F$1,0)),Sheet1!C:C,IF(Sheet2!$B$2:$B$8=$B4,Sheet2!$A$2:$A$8),Sheet1!A:A,$A$1,Sheet1!B:B,A4))
UPDATE: It has been discovered that the above formula indeed works. This issue was that Sheet 1 was a pivot table, and as such the column headers for each year was in text format, and was different than the formatting of the look up cell in the report, thus there was no link to reference the data.
Sheet 1 (Raw data)
A
B
C
D
E
F
1
Quarter
Product
ID
2021
2020
2019
2
Q1
A
1
$12
$12
$9
3
Q1
A
3
$4
$30
$50
4
Q1
A
7
$48
$15
$39
5
Q1
A
14
$42
$7
$26
6
Q1
A
25
$36
$50
$20
7
Q1
A
27
$45
$8
$9
8
Q1
A
44
$12
$10
$2
9
Q1
B
1
$40
$32
$23
10
Q1
B
3
$15
$14
$30
11
Q1
B
7
$21
$4
$42
12
Q1
B
14
$38
$26
$13
13
Q1
B
25
$31
$45
$9
14
Q1
B
27
$32
$46
$30
15
Q1
B
44
$21
$40
$30
16
Q2
A
1
$6
$1
$43
17
Q2
A
3
$12
$16
$44
and so forth…
Sheet 2 (lookup table)
A
B
1
ID
Region
2
1
East
3
3
East
4
7
Central
5
14
Central
6
25
Central
7
27
West
8
44
West
Sheet 3 (Report)
A
B
C
D
1
Q1
2
3
Product
Region
2021
2020
4
A
East
$16
$42
5
A
Central
$126
$45
6
A
West
$57
$22

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Finding unique ids in lines of dataframe

Input - dataframe with more than 50k rows.
Result expected: find unique id's by multiple columns.
F.e. there is dataframe:
id par1 par2 par3
1 a 1 AA
2 b 2 AB
3 c 3 AC
4 a 4 AD
5 d 3 AE
6 e 5 AD
7 d 1 AF
So the logic is, if any row share common parameter - that is the same unique id, the result should be something like this, made by iterations:
First by par1:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 4
6 e 5 AD 5
7 d 1 AF 4
Then by par2:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 3
6 e 5 AD 5
7 d 1 AF 1
Then by par3:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 3
6 e 5 AD 1
7 d 1 AF 1
Then it should be checked if there are still any misleads:
f.e. id=5 and id=3 should get uniq_id = 1, because —id=7isuniq_id=1andid=7sharepar1withid=5, and because of thatid=3` also changes.
I hope it is clear what I try to explain.
At the moment only working solution made by me - creating multiple for cycles and comparing values manually, but since there are lots of observations, it can take forever to execute.
Use factorize first and then Series.map with DataFrame.drop_duplicates:
df['uniq_id'] = pd.factorize(df['par1'])[0] + 1
df['uniq_id'] = df['par2'].map(df.drop_duplicates('par2').set_index('par2')['uniq_id'])
df['uniq_id'] = df['par3'].map(df.drop_duplicates('par3').set_index('par3')['uniq_id'])
print (df)
id par1 par2 par3 uniq_id
0 1 a 1 AA 1
1 2 b 2 AB 2
2 3 c 3 AC 3
3 4 a 4 AD 1
4 5 d 3 AE 3
5 6 e 5 AD 1
6 7 d 1 AF 1
If possible more columns is possible create loop:
df['uniq_id'] = pd.factorize(df['par1'])[0] + 1
for col in ['par2','par3']:
df['uniq_id'] = df[col].map(df.drop_duplicates(col).set_index(col)['uniq_id'])

Find common values in multiple in awk

I would like to find common values from multiple files and corresponding counts of appearance using awk. I have, say four files such as: input1, input2, input3, input4:
input1: input2: input3: input4
AA AB AA AC
AB AC AC AF
AC AF AF AD
AD AG AH AH
AF AH AK AK
AI
I would like the answer to be:
Variable: Count
AA 2
AB 2
AC 4
AD 2
AF 4
AH 3
AK 2
AI 1
Any comments, please !!
awk '{a[$0]++}END{for(x in a)print x,a[x]}' input*
with your inputs, output would be:
AA 2
AB 2
AC 4
AD 2
AF 4
AG 1
AH 3
AI 1
AK 2

Resources