I'm running into some issues regarding the use of apply in Pandas.
I have a dataframe, where there is multiple measurements made on certain days, on different measurement sites. To give an example, Site 1 has 2 measurements to make, every 7 or so days.
We know it has to be 2 measurements. So what I'm trying to do now is to check where on which days there were not enough measurements made.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 4 2 01-01-2020
3 3 5 2 01-01-2020
4 2 1 2 08-01-2020
5 2 4 2 08-01-2020
I've made a sorted and aggregated DataFrame, that has aggregated the measurements, to basically be able to iterate over the measurements as to not go over the same days twice when there are multiple measurements.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 9 2 01-01-2020
3 2 5 2 08-01-2020
For a function, I'm now using the function filter_amount(df_sorted, df, group).
group is to count the amount of measurements made df.groupby(["site_id", "date"]).count()
site measurement date
0 1 2 01-01-2020
1 2 1 01-01-2020
2 3 2 01-01-2020
3 2 2 08-01-2020
The current function basically goes like this:
def filter_amount(df_sorted, df, group):
for i in df_sorted.index:
"Locate amount of measurements for this day and site actually made, in group"
Check how many measurements are expected.
If not enough measurements:
find all measurements in normal df and drop them
So in this example, the measurements from site 2 on 1-1-20 have to be dropped, because there are not enough measurements. The ones from 8-1-20 are valid, because they expect 2, and 2 happen.
The problem is that this is extremely slow with over 500k rows.
The variables I need, I get by using .at[], but I'm trying to make it faster by using apply so I can parallelize the operations, but can't figure it out. I'm doing the apply on df_sorted and passing the arguments needed in, but it's not actually dropping the measurements from the original df.
I have a feeling that it's possible to do it with some sort of groupby on the original df to save operations..
I hope it's clear enough, happy to elaborate any questions.
Related
Note, I have edited my original question to clarify my problem:
As the title suggests, I am looking for a way to combine the SUMPRODUCT functionalities with an INDEX and MATCH formula, but if a better approach exists to help solve the problem below I am also open to it.
In the below example, imagine that the tables are on different sheets. I have a report that has the sales of each ID in the rows and each month in the columns (first table). Unfortunately, the report only has IDs and not the region they belong to, but I do have a look up table which labels each ID with their respective region (second table):
A
B
C
D
1
ID
January
February
March
2
1
10
5
20
3
3
5
5
10
4
7
0
10
5
5
14
10
25
5
6
25
5
10
10
7
27
10
10
10
8
44
5
5
5
A
B
1
ID
Region
2
1
East
3
3
East
4
7
Central
5
14
Central
6
25
Central
7
27
West
8
44
West
My goal is to be able to aggregate the sales by region as per the result below. However I would only like to show sales data that belong to the month that is shown in cell D2.
Goal:
A
B
C
D
1
Region
Sales
February
2
East
10
3
Central
45
4
West
15
I have used the INDEX and MATCH combination to return a single value, but not sure how I can return multiple values with it and aggregate them at the same time. Any insight would be appreciated!
You may just use:
=SUMPRODUCT((Sheet1!B$1:D$1=D$1)*(Sheet1!H$2:H$8=A2),Sheet1!B2:D8)
Remember, SUMPRODUCT() could be quite heavy processing huge data, therefor to combine INDEX() and MATCH() is not a bad idea, but let's do it the other way around and nest the latter two into SUMPRODUCT() instead =):
=SUMPRODUCT(INDEX(Sheet1!B$2:D$8,0,MATCH(D$2,Sheet1!B$1:D$1,0))*(Sheet1!H$2:H$8=A2))
Another option using SUMIF+INDEX+MATCH function as in
In "Sheet2" B2, copied down :
=SUMIF(Sheet1!H:H,A2,INDEX(Sheet1!B$1:D$1,MATCH(D$2,Sheet1!B$1:D$1,0)))
I'm looking for a way to compare multiple rows with data to each other, trying to find the best possible match. Each number in every column must be an approximately match the other numbers in the same column.
Example:
Customer #1: 1 5 10 9 7 7 8 2 3
Customer #2: 10 5 9 3 5 7 4 3 2
Customer #3: 1 4 10 9 8 7 6 2 2
Customer #4: 9 5 6 7 2 1 10 5 6
In this example customer #1 and #3 is quite similar, and I need to find a way to highlight or sort the rows so I can easily find the best match.
I've tried using conditional formatting to highlight the numbers that are the similar, but that is quite confusing, because the amount of data is quite big.
Any ideas of how I could solve this?
Thanks!
The following formula entered in (say) L1 and pulled down gives the best match with the current row based on the sum of the absolute differences between corresponding cells:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(ABS($C1:$K1-$C$1:$K$4),TRANSPOSE(COLUMN($C$1:$K$4))^0))))
It is an array formula and must be entered with CtrlShiftEnter.
You can then sort on column L to bring the customers with lowest similarity scores to the top or use conditional formatting to highlight rows with a certain similarity value.
EDIT
If you wanted to penalise large differences in individual columns more heavily than small differences to try and avoid pairs of customers which are fairly similar except for having some columns very different, you could try something like the square of the differences:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(($C1:$K1-$C$1:$K$4)^2,TRANSPOSE(COLUMN($C$1:$K$4))^0))))
then the scores for your test data would come out as 7,127,7,127.
I'm assuming you want to compare customers 2-4 with customer 1 and that you are comparing only within each column. In this case, you could implement a 'scoring system' using multiple IFs. For example,:
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2
3 Customer 3 0 1 0
you could use in E2
=if(B2=$B$1,1,0)+if(C2=$C$1,1,0)+if(D2=$D$1,1,0)
This will return a 'score' of 1 when you have a match and a 'score' of 0 when you don't. It then adds up the scores and your highest value will be your best match. Copying down would then give
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2 2
3 Customer 3 0 1 0 1
so customer 2 is the best match.
I need to create something that is used to track and report class ratings. The class evaluation form has eight questions and the ratings are from 1-4. We need to show the average rating for each question per class. It will be an ongoing list so needs to have functionality that allows users to continue to add new class ratings without re-formatting the spreadsheet.
Example, Class PD100 has a total of 5 for Q1. I need to show the average score of 1.6 for that question. Then the same for all the other questions grouped by Class.
Class # Q1 Q2 Q3 Q4
PD100 1 2 3 1
PD100 3 2 3 4
PD100 1 2 3 1
PD200 2 1 2 2
PD200 1 2 3 4
PD200 1 4 1 4
PD300 1 4 4 1
This is a bit unwieldy as the data is already laid out in a pivot, but the total average columns give you what you're looking for in creation of a pivot table.
if the data was laid out class question answer the pivot could be cleaner.
.
I'm using Excel pivot-tables to produce a report. The pivot-table connects to a SSAS cube. I have 2 measures- measure 1 is a 'real' measure, measure 2 is calculated based upon measure 1. Measure 1 must be shown broken out by dimB members across the columns. WIth Measure 2 I just want the totals column.
I've hidden the measure 2 columns as a workaround but this is less than ideal as when users expand or contract the dimension B members the pivot-table moves relative to the hidden columns and the report becomes a mess. It's also returning extra data which can't help performance.
Here is what I have:
Measure 1 Measure 2 Measure 1 Total Measure 2 Total
a b c a b c
DimA- member1 2 3 4 2 3 4 9 9
DimA- member2 1 4 5 1 2 5 10 8
This is what I want:
Measure 1 Measure 1 Total Measure 2 Total
a b c
DimA- member1 2 3 4 9 9
DimA- member2 1 4 5 10 8
Is there a way to achieve the second option? Either with perhaps some mdx on the calculated measure (scope/custom rollup etc) or with the pivottable itself?
Basically I want the total without the dimension B breakdown for measure 2.
I don't know that technique, and afraid, that it's impossible.
What if to NULLify all members in scope at least for the lowest levels?
Something like this:
SCOPE ([Measures].[Measure 2 Hide]);
THIS = IIF(Axis(1).Item(0).Item(0).Hierarchy.Level.Ordinal=1
,[Measures].[Measure 2],null);
END SCOPE;
Or <2 instead of =1. Measure 2 will be at least empty for users.
Here is my Excel example (measure is called Measure 2 Hide):
CREATE MEMBER CURRENTCUBE.[Measures].[Measure 2 Hide]
AS
[Measures].[Count]-1,
VISIBLE = 1;
I understand that it's not a solution, but maybe will help somehow.
I am retired so I have a lot of free time on my hands so I like playing the greyhounds. Using Excel I attempted to help myself save time by sorting out two columns at a time (Post - B-SPD) etc, because you have to sort some low and some high. The columns not marked Post are formula columns =abs(a3) etc. but when I try to sort the Post columns will sort, but the column I try to sort with it does not sort and match the post number it is assigned to.
Can this be done or does the =ABS formula prevent it. Plus even using =ABS some of the number are negative but the -minus sign does not appear. I have tried everything using numbers, using currency, using general, but nothing works. One person sent me to control panel to the Clock, Language, and Region settings. go additional setting and set your negative setting there. But that would not let the -minus sign show up on negative numbers either.
Is any of the above even possible?
POST B-SPD POST A-SPD POST 8TH POST L-SPD POST A-FIN
1 31.43 1 31.84 1 0 1 3 1 5.83
2 31.43 2 31.67 2 35.14 2 0 2 4.67
3 31.79 3 31.9 3 59.11 3 6 3 5.67
4 31.32 4 31.73 4 65.5 4 3 4 3.83
5 31.47 5 31.68 5 29.71 5 4 5 3.33
6 31.76 6 32.18 6 100 6 9 6 5.01
7 31.48 7 31.99 7 41.13 7 1 7 5.67
8 31.69 8 31.99 8 75.79 8 6 8 4.83
LOW LOW HIGH HIGH LOW
The requirement is not clear but use of ABS should not cause sorting issues. =ABS (for absolute) function is to strip out minus signs (so seems to be working properly).
OP may be advised to delete all but one of the POST columns, select headers and data (not including 'LOW' or 'HIGH') and Insert > Tables - Table, check My table has headers, OK, then use the filter buttons to choose whether to Sort Smallest to Largest or Sort Largest to Smallest for whichever attribute is to be ordered (the other columns will 'follow suit'). The filter button will show which button is determining the ordering and whether ascending or descending (eg table below is sorted descending on '8TH'):