Ranking Dates Based on Another Column - Spotfire - spotfire

Does anyone know of way to circumvent the Spotfire limitation for using the OVER function to RANK or order dates when using a custom expression?
Providing a little background, I am trying to identify or mark a lease based on the below data as 1, 2, 3 etc. For example, since we see twice 63 in the left column, I would like to return a 1 and a 2 to identify the two different leases, starting on 1/1/2016 and 8/1/2016. Then a 1 and 2 for 72, a 1 for 140 and so one. Unfortunately, OVER functions can only be used with aggregation methods and I don't know of another method to produce the result that I am looking for.
Tenant Lease_From Lease_To Tenant_status
63 1/1/2016 1/31/2017 Current
63 8/1/2017 7/31/2018 Current
72 10/1/2016 7/31/2017 Current
72 8/1/2017 7/31/2018 Current
140 2/1/2017 7/31/2018 Current
149 8/1/2016 7/31/2017 Current
149 8/1/2017 7/31/2018 Current
156 1/15/2017 3/31/2018 Current
156 4/1/2018 3/31/2019 Current

Use this:
Rank([Lease_From], [Tenant])
Gives this as the result:
Tenant Lease_From Lease_To Tenant_status Rank([Lease_From], [Tenant])
63 1/1/2016 1/31/2017 Current 1
63 8/1/2017 7/31/2018 Current 2
72 10/1/2016 7/31/2017 Current 1
72 8/1/2017 7/31/2018 Current 2
140 2/1/2017 7/31/2018 Current 1
149 8/1/2016 7/31/2017 Current 1
149 8/1/2017 7/31/2018 Current 2
156 1/15/2017 3/31/2018 Current 1
156 4/1/2018 3/31/2019 Current 2

please consider #blakeoft's answer as the correct one!
that said, as an FYI, First() is considered an aggregation method, and OVER statements can be included inside of an If()! so you can accomplish the same thing with an expression like:
If([Lease_From] = First([Lease_From]) OVER ([Tenant]), 1, 2)
when you combine If() and OVER in this way, you can get some really cool and powerful visualizations, BUT you do lose the ability to mark data effectively. this is because the expression is evaluated from the context of the If() rather than the OVER; in other words, all rows are considered instead of only the ones selected.
you can get around this with some black magic (AKA data functions) but it's a bit contrived.
again, in this situation, Rank() is absolutely the correct solution.

Related

DAX help: % monthly share of another table

I have a DAX formula for my Powerpivot I cannot get to solve and was hoping for help.
I have two pivot tables connected already
Showing a cohort of actions taken within Month 1,….X on the sign up month
Total Sign Ups on monthly basis
I have tried to attached the sheet here but somehow I cant so I have add a screenshot of the sheet.1
What I have so far is:
=DIVIDE(
SUM(Range[conversion to KYC completed]),
SUM('Range 1'[Sum of signups]))
But this does not give me what I want as I think I’m missing the monthly grouping somehow.
Question 1:
What I want is to get the share of actions completed within 1,...,X months out of the total sign up that given month (e.g. Jan) (so the data from Table 2)
Question 2:
In best case I would also like to show total sign ups in the beginning of the cohort to make the cohort easier to understand, so having the monthly total sign up (which the cohort is calculated based on). But now I cannot get just the totals month by month. Is there anyways just to add in a monthly total column in the pivot without applying these number as a value across all columns?
Something like this is the ultimate outcome for me 2
UPDATED WITH SAMPLE DATA
Signup month, KYC completed month, Age by month, signups, conversion to KYC completed
Jan-17 Jul-18 18 97 75
Jan-17 Jul-18 18 99 79
Jan-17 Dec-18 23 95 80
Feb-17 May-18 15 99 74
Feb-17 Jul-18 17 90 75
Feb-17 Jul-18 17 95 76
Feb-17 Aug-18 18 92 71
Mar-17 May-18 14 94 73
Apr-17 Jul-18 15 93 75
May-17 Sep-18 16 94 70
May-17 Oct-18 17 98 72
Jun-17 May-18 11 95 79
Jul-17 Oct-18 15 97 74
Jul-17 Jul-18 12 94 78
Aug-17 Sep-18 13 96 74
Sep-17 Nov-18 14 95 80
Sep-17 Oct-18 13 94 79
DESIRED OUTCOME
The % for Month 1....X is calculated KYC Completed / Monthly Sign up
OUTPUT WITH THIS CODE
=VAR SignUpMonth = IF(HASONEVALUE('Range 1'[Row Labels]), BLANK())
RETURN
DIVIDE(CALCULATE(SUM([conversion to KYC completed])),
CALCULATE(SUM('Range 1'[Sum of signups]),
FILTER(ALL(Range), Range[Signup month (Month Index)] = SignUpMonth)))
[
Thanks for the sample data Franzi. Still not too clear what you're asking for, but perhaps this will help a little.
Signed Up to Signed In Ratio =
VAR SignUpMonth = SELECTEDVALUE(Table1[Signup month], BLANK())
RETURN
DIVIDE(CALCULATE(SUM([conversion to KYC completed])),
CALCULATE(SUM(Table1[ signups]),
FILTER(ALL(Table1), Table1[Signup month] = SignUpMonth)))
So. Let's break it down.
If I understand correct, you want to see the cross section of number of signins for a given month ( x axis ) signup combo ( y axis ) and divide that number by the total signups ( y axis ) per signup month.
number of signins for a given month ( x axis ) signup combo ( y axis ):
CALCULATE(SUM([conversion to KYC completed]))
TOTAL signups ( y axis ) per signup month
CALCULATE(SUM(Table1[ signups]),
FILTER(ALL(Table1), Table1[Signup month] = SignUpMonth))

Adding B where A is same [in excel]

I want to add two rows where a variable is same.
sample.csv:
A B
corn 56
apple 43
banana 54
corn 87
mango 63
apple 67
corn 30
I want to add values of B where A is same and want to store answer in another column as follows:
corn 162
apple 110
banana 54
mango 63
Can I do this in excel? If yes, then what is the formula for that? I searched a-lot but unable to reach solution.
Try,
Select an empty cell.
Go to Data - Data Tools - Consolidate
Select the range (both columns)
Press add
Tick Left column
Press ok

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Excel date/product count to specified limit

Column A "Sales Dates", Column B "=A2-A1" for "Date Diff", Column C "Customer Name", Column D "Item", Column E "Items Ordered Count"
My issue is I have to do a running 30 day total for each customer to see that specific items are not being ordered above "x" number within any 30-day period.
Does anyone have any ideas?
I may not be fully understanding your question, but I don't think you can do what you ask in excel. This might be a situation where a database that can do SQL might come in handy.
The best I can come up with in excel is a Pivot Table, with the customers as rows, dates as columns (group by month), and sum of Items Ordered in the data area. Then conditional format the data area to highlight values > your limit.
Perhaps if you provide some sample data & output I can come up with something more like what you need.
The formula would look something like this:
{=SUM(IF((A$2:A2>=A2-29)*(D$2:D2=D2),E$2:E2,0))}
It should be entered into cell F2 and copied down to the last row of your data. I pasted in a test spreadsheet below so you can see where things go (sorry for the formatting--hopefully it will look better if you paste it into Excel).
IMPORTANT: This is an array formula, so after you type in the formula (and don't type in the braces {} when you do), you must press Ctrl-Shift-Enter instead of just Enter (see this link for more details).
What does the formula do? It does two loops:
First, it loops through all the Sales Dates from the beginning of the log to the current row and checks if each date is between the date of the current row and 29 days earlier (which makes a 30-day window). (By "current row" I mean the row where the formula is located.)
Second, it loops through all the Items from the beginning of the log to the current row and checks if there is a match with the Item of the current row.
For any row where both checks are true (the "*" in the formula does an "and" operation), Items Ordered Count is added to the sum, otherwise zero is added to the sum. So, when it's finished, you have a count for each row of how many orders there were in the past 30 days for that item.
HTH,
-Dan
Sales Dates Date Diff Customer Name Item Items Ordered Count 30-Day Count
1/1/2009 0 dfsadf 11336 70 70
1/2/2009 1 asdfd 10218 121 121
1/3/2009 1 fsdfjkfl 10942 101 101
1/6/2009 3 slkdjflsk 13710 80 80
1/7/2009 1 slkdjls 10480 127 127
1/9/2009 2 sdjjf 11336 143 213
1/11/2009 2 woieuriwe 11501 84 84
1/14/2009 3 owqieyurtn 10191 78 78
1/15/2009 1 weisd 10480 113 240
1/16/2009 1 woieuriwe 12024 133 133
1/17/2009 1 vkcjl 13818 125 125
1/20/2009 3 sdflkj 11336 128 341
1/23/2009 3 jnbkdl 10480 141 381
1/25/2009 2 pqcvnlz 10480 137 518
1/27/2009 2 hwodkjgfh 12878 80 80
1/28/2009 1 zjdnfg;pwlkd 10942 123 224
1/31/2009 3 zlkdjnf;psod 13173 93 93
2/2/2009 2 zlknpdodfg 11336 119 390
2/4/2009 2 zjhdfpwskjh 12004 57 57
2/5/2009 1 asdfd 10218 121 121
2/8/2009 3 fsdfjkfl 10942 101 224
2/11/2009 3 slkdjflsk 13710 80 80
2/14/2009 3 slkdjls 10480 127 405
2/16/2009 2 sdjjf 11336 143 390
2/18/2009 2 woieuriwe 11501 84 84
2/21/2009 3 owqieyurtn 10191 78 78
2/24/2009 3 weisd 10480 113 240
2/25/2009 1 woieuriwe 12024 133 133
2/27/2009 2 vkcjl 13818 125 125
2/28/2009 1 sdflkj 11336 128 390

Resources