Backschedule calculation based on previous row's calculated column - calculated-columns

We have a table namely "Job" with list of jobs and their operation steps. We only have the job's due date available and will need to back schedule each operation's end date based on its lead time.
Here is an example of what the end results should look like:
Job#   Oper#   JobDueDate   OperLT   OperEndDate
123    50    3/15/2019   5     3/15/2019
123    40    3/15/2019   3     3/10/2019
123    30     3/15/2019   2     3/7/2019
123    20    3/15/2019   10   3/5/2019
123    10    3/15/2019   3    2/23/2019
456    30     2/10/2019   15    2/10/2019
456    20    2/10/2019   5     1/26/2019
456    10    2/10/2019   4     1/21/2019
I used the window function Lag() Over(), but it only worked for the immediate next row.
The SQL statement is as follows:
SELECT Job#, Oper#, JobDueDate, OperLT, lag(dateadd(d, OperLT*-1, JobDueDate),1,JobDueDate) OVER (partition BY Job# ORDER BY Job#, Oper# desc) AS OperEndDate FROM Job ORDER BY Job#, Oper# desc;
Could someone please point out what is missing in my code or a better way to accomplish this if Lag() Over() is not the right solution?

Related

Changing multiindex in a pandas series?

I have a dataframe like this:
mainid pidl pidw score
0 Austria 1 533
1 Canada 2 754
2 Canada 3 267
3 Austria 4 852
4 Taiwan 5 124
5 Slovakia 6 344
6 Spain 7 1556
7 Taiwan 8 127
I want to select top 5 pidw for each pidl.
When I have grouped by on column 'pidl' and then sorted the score in descending order in each group , i got the following series, s..
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5)
pidl pidl pidw score
Austria Austria 49 948
47 859
48 855
50 807
46 727
Belgium Belgium 15 2339
14 1861
45 1692
16 1626
46 1423
Name: score, dtype: float64
The result looks correct, but I wish I could remove a second 'pidl' from this series.
I have tried
s.reset_index('pidl')
to get 'ValueError: The name location occurs multiple times, use a level number'.
and
s.to_frame().reset_index()
ValueError: cannot insert pidl, already exists.
so I am not sure how to proceed about it.
Use group_keys=False parameter in DataFrame.groupby:
s= df.set_index(['pidl', 'pidw']).groupby('pidl', group_keys=False)['score'].nlargest(5)
print (s)
pidl pidw
Austria 4 852
1 533
Canada 2 754
3 267
Slovakia 6 344
Spain 7 1556
Taiwan 8 127
5 124
Name: score, dtype: int64
Or add Series.droplevel for remove first level (pandas count from 0, so used 0):
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5).droplevel(0)

How do I create a new column in pandas which is the sum of another column based on a condition?

I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()
as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()

Convert column headers into a new column and keep the values of each column

UPD: A dataframe has been pasted as an example at the bottom of the page.
My original xls file looks like this:
and I need two actions in order to make it look like the below:
Firstly, I need to fill in the empty row values with the values shown in the cell above them. That has been achieved with the following function:
def get_csv():
#Read csv file
df = pd.read_excel('test.xls')
df = df.fillna(method='ffill')
return df
Secondly, I have used stack with set_index:
df = (df.set_index(['Country', 'Gender', 'Arr-Dep'])
.stack()
.reset_index(name='Value')
.rename(columns={'level_3':'Year'}))
and I was wondering whether there is an easier way. Is there a library that transforms a dataframe, excel etc into the wanted format?
Original dataframe after excel import:
Country Gender Direction 1974 1975 1976
0 Austria Male IN 13728 8754 9695
1 NaN NaN OUT 17977 12271 9899
2 NaN Female IN 8541 6465 6447
3 NaN NaN OUT 8450 7190 6288
4 NaN Total IN 22269 15219 16142
5 NaN NaN OUT 26427 19461 16187
6 Belgium Male IN 2412 2245 2296
7 NaN NaN OUT 2800 2490 2413
8 NaN Female IN 2105 2022 2057
9 NaN NaN OUT 2100 2113 2004
10 NaN Total IN 4517 4267 4353
11 NaN NaN OUT 4900 4603 4417
Alternative solution is use melt, but if need same ordering of columns like stacked DataFrame is necessary add sort_values:
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date",value_name='Value')
)
print (df1)
Country Gender Direction Date Value
0 Austria Male IN 1974 13728
1 Austria Male OUT 1974 17977
2 Austria Female IN 1974 8541
3 Austria Female OUT 1974 8450
4 Austria Total IN 1974 22269
5 Austria Total OUT 1974 26427
6 Belgium Male IN 1974 2412
7 Belgium Male OUT 1974 2800
8 Belgium Female IN 1974 2105
9 Belgium Female OUT 1974 2100
10 Belgium Total IN 1974 4517
11 Belgium Total OUT 1974 4900
12 Austria Male IN 1975 8754
13 Austria Male OUT 1975 12271
14 Austria Female IN 1975 6465
15 Austria Female OUT 1975 7190
16 Austria Total IN 1975 15219
17 Austria Total OUT 1975 19461
18 Belgium Male IN 1975 2245
19 Belgium Male OUT 1975 2490
20 Belgium Female IN 1975 2022
21 Belgium Female OUT 1975 2113
22 Belgium Total IN 1975 4267
23 Belgium Total OUT 1975 4603
24 Austria Male IN 1976 9695
25 Austria Male OUT 1976 9899
26 Austria Female IN 1976 6447
27 Austria Female OUT 1976 6288
28 Austria Total IN 1976 16142
29 Austria Total OUT 1976 16187
30 Belgium Male IN 1976 2296
...
...
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date", value_name='Value')
.sort_values(['Country', 'Gender','Direction'])
.reset_index(drop=True))
print (df1)
Country Gender Direction Date Value
0 Austria Female IN 1974 8541
1 Austria Female IN 1975 6465
2 Austria Female IN 1976 6447
3 Austria Female OUT 1974 8450
4 Austria Female OUT 1975 7190
5 Austria Female OUT 1976 6288
6 Austria Male IN 1974 13728
7 Austria Male IN 1975 8754
8 Austria Male IN 1976 9695
9 Austria Male OUT 1974 17977
10 Austria Male OUT 1975 12271
11 Austria Male OUT 1976 9899
12 Austria Total IN 1974 22269
13 Austria Total IN 1975 15219
14 Austria Total IN 1976 16142
15 Austria Total OUT 1974 26427
16 Austria Total OUT 1975 19461
17 Austria Total OUT 1976 16187
18 Belgium Female IN 1974 2105
19 Belgium Female IN 1975 2022
20 Belgium Female IN 1976 2057
21 Belgium Female OUT 1974 2100
22 Belgium Female OUT 1975 2113
23 Belgium Female OUT 1976 2004
24 Belgium Male IN 1974 2412
25 Belgium Male IN 1975 2245
26 Belgium Male IN 1976 2296
27 Belgium Male OUT 1974 2800
28 Belgium Male OUT 1975 2490
29 Belgium Male OUT 1976 2413
30 Belgium Total IN 1974 4517
...
...
stack
I like your approach. I'd change it in a couple of ways.
use the specific method for foward filling ffill
rename the column axis prior to stacking to avoid the renaming of the column later (personal preference)
df.ffill().set_index(
['Country', 'Gender', 'Direction']
).rename_axis('Year', 1).stack().reset_index(name='Value')
Country Gender Direction Year Value
0 Austria Male IN 1974 13728
1 Austria Male IN 1975 8754
2 Austria Male IN 1976 9695
3 Austria Male OUT 1974 17977
4 Austria Male OUT 1975 12271
5 Austria Male OUT 1976 9899
...
Numpy
I wanted to put together a custom approach. This should be very fast.
def cstm_ffill(s):
i = np.flatnonzero(s.notna())
i = np.concatenate([[0], i, [len(s)]])
d = np.diff(i)
a = s.values[i[:-1].repeat(d)]
return a
def cstm_melt(df):
c = cstm_ffill(df.Country)
g = cstm_ffill(df.Gender)
d = cstm_ffill(df.Direction)
y = df.columns[3:].values
k = len(y)
i = np.column_stack([c, g, d])
v = np.column_stack([*map(df.get, y)]).ravel()
df_ = pd.DataFrame(
np.column_stack([i.repeat(k, axis=0), v, np.tile(y, len(i))]),
columns=['Country', 'Gender', 'Direction', 'Year', 'Value']
)
return df_
cstm_melt(df)
Country Gender Direction Year Value
0 Austria Male IN 13728 1974
1 Austria Male IN 8754 1975
2 Austria Male IN 9695 1976
3 Austria Male OUT 17977 1974
4 Austria Male OUT 12271 1975
5 Austria Male OUT 9899 1976
...

Merge matching rows in excel & summarizing matching columns

Looking to merge some data and summarize the results. Bene poking around google but haven't found anything that will match up duplicates and summarize.
The left side of the table is what I'm starting with, I would like the output on the right side.
Street Name Widgets Sprockets Nuts Bolts Street Name Widgets Sprockets Nuts Bolts
123 Any street ACB Co 10 248 2 50 123 Any street ACB Co 10 846 10 78
123 Any street Bob's plumbing 25 22 2 7 123 Any street Bob's plumbing 25 22 2 7
456 Another st Bill's cars 55 5 456 456 Another st Bill's cars 62 878 13 55
123 Any street ACB Co 54 4 6 789 789 Ave Shelley and co 5 2 2 78
456 Another st Bill's cars 7 878 8 55 789 Ave Divers down 7 90 10 11
789 Ave Shelley and co 5 2 2 78 456 Another st ACB Co 6 50 5
123 Any street ACB Co 544 4 22
456 Another st ACB Co 6 50 5
789 Ave Divers down 6 90 9 4
789 Ave Divers down 1 1 7
Use Pivot Tables an set the layout to tabular.
Details can be found here: https://www.youtube.com/watch?v=LkFPBn7sgEc

Couting duplicate data in total for excel

A B
1 8 Tiffney, Jennifer
2 8 Tiffney, Jennifer
3 8 Tiffney, Jennifer
4 8 Tiffney, Jennifer
5 8 Tiffney, Jennifer
6 8 Tiffney, Jennifer
7 9 Allen, Larry
8 9 Allen, Larry
9 9 Allen, Larry
10 9 Allen, Larry
11 9 Allen, Larry
12 10 Reid, Brian
13 10 Reid, Brian
14 10 Reid, Brian
15 10 Reid, Brian
16 10 Reid, Brian
17 10 Reid, Brian
18 10 Reid, Brian
19 10 Reid, Brian
20 10 Reid, Brian
21 10 Reid, Brian
22 10 Reid, Brian
23 11 Edington, Bruce
24 11 Edington, Bruce
25 11 Edington, Bruce
26 12 Almond, David
27 12 Almond, David
28 12 Almond, David
29 12 Almond, David
30 12 Almond, David
31 12 Almond, David
32 13 Mittal, Charu
33 13 Mittal, Charu
34 13 Mittal, Charu
35 13 Mittal, Charu
36 13 Mittal, Charu
37 13 Mittal, Charu
There are tons of duplicate data in excel, Is there any way can count how many people will in total? I tried to use "Count" and "Countif" formulas, but there are duplicate data.
there should be 6 people in total as above, any solution to do this?
Use the following formula:
=COUNT(IF(FREQUENCY(MATCH(A1:A37,A1:A37,0),MATCH(A1:A37,A1:A37,0))>0,1))
or
In the EXCEL, click Data Tab. You will find Remove Duplicates. Select your column and click remove duplicates and all the duplicates will be removed. Now you have distinct data and you will get only 6 records
Try this:
=SUMPRODUCT(1/COUNTIF($A$1:$A$100,$A$1:$A$100))
Where $A$1:$A$100 is your data range.

Resources