sort pyspark dataframe within groups - apache-spark

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?

This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

Related

Ranking Dates Based on Another Column - Spotfire

Does anyone know of way to circumvent the Spotfire limitation for using the OVER function to RANK or order dates when using a custom expression?
Providing a little background, I am trying to identify or mark a lease based on the below data as 1, 2, 3 etc. For example, since we see twice 63 in the left column, I would like to return a 1 and a 2 to identify the two different leases, starting on 1/1/2016 and 8/1/2016. Then a 1 and 2 for 72, a 1 for 140 and so one. Unfortunately, OVER functions can only be used with aggregation methods and I don't know of another method to produce the result that I am looking for.
Tenant Lease_From Lease_To Tenant_status
63 1/1/2016 1/31/2017 Current
63 8/1/2017 7/31/2018 Current
72 10/1/2016 7/31/2017 Current
72 8/1/2017 7/31/2018 Current
140 2/1/2017 7/31/2018 Current
149 8/1/2016 7/31/2017 Current
149 8/1/2017 7/31/2018 Current
156 1/15/2017 3/31/2018 Current
156 4/1/2018 3/31/2019 Current
Use this:
Rank([Lease_From], [Tenant])
Gives this as the result:
Tenant Lease_From Lease_To Tenant_status Rank([Lease_From], [Tenant])
63 1/1/2016 1/31/2017 Current 1
63 8/1/2017 7/31/2018 Current 2
72 10/1/2016 7/31/2017 Current 1
72 8/1/2017 7/31/2018 Current 2
140 2/1/2017 7/31/2018 Current 1
149 8/1/2016 7/31/2017 Current 1
149 8/1/2017 7/31/2018 Current 2
156 1/15/2017 3/31/2018 Current 1
156 4/1/2018 3/31/2019 Current 2
please consider #blakeoft's answer as the correct one!
that said, as an FYI, First() is considered an aggregation method, and OVER statements can be included inside of an If()! so you can accomplish the same thing with an expression like:
If([Lease_From] = First([Lease_From]) OVER ([Tenant]), 1, 2)
when you combine If() and OVER in this way, you can get some really cool and powerful visualizations, BUT you do lose the ability to mark data effectively. this is because the expression is evaluated from the context of the If() rather than the OVER; in other words, all rows are considered instead of only the ones selected.
you can get around this with some black magic (AKA data functions) but it's a bit contrived.
again, in this situation, Rank() is absolutely the correct solution.

Pivot table calculate field based on grouped values

after years of quietly learning from this site I've finally hit a question who's answer I cannot seem to find on StackOverflow...
I have a pivot table that needs to calculate Net Promoter Score from several groups within a population.
Net promoter score is calculated like so:
[% of Population that give 9 or 10/10] - [% of Population that give 1 to 6/10]
Each individual record in my source data can only have a single Score of between 1 and 10:
RAW DATA:
Date (dd/mm) Country Type Score (1-10) NPS Category
01/05 US Order enq. 9 Promoter
13/05 US Check-out 5 Detractor
28/05 US Order enq. 7 Passive
So, with help from the answers below I've added a column to categorise each individual into the Promoter (9 or 10), Passive (7 or 8) and Detractor (1 to 6) groups based on that score: screenshot of raw data (with sensitive items hidden).
All that remains now is:
How can I create a calculated 'NPS' column like the one shown in my (rudimentary) representation of a pivot table below that takes the Detractor value from the Promoter value?
D = Detractor group
Pa = Passive group
Pr = Promoter group
| Order enquiry | Check-out |
| D Pa Pr NPS | D Pa Pr NPS |
-------------------------------------------------- |
GB | | |
May | 0 0 100 100 | 30 20 50 20 |
Jun | 10 30 60 50 | 35 35 60 25 |
Jul | 30 20 50 20 | 40 10 40 0 |
US | | |
May | 45 15 40 - 5 | 50 10 40 -10 |
Jun | 40 30 30 -10 | 40 30 30 -10 |
Jul | 5 35 60 55 | 20 40 40 20 |
My attempt at a calculated column can be seen in this screenshot. This results in an error and of course I haven't managed to convert the NPS counts into percentages yet.
It would be my suggestion to create a new column in the source that calculates D, Pa,Pr by a formula.
You can now create the % for these values in the pivot. The NPS column can either be calculated after pivoting the output field, or by creating a pivot-column formula in Excel.
It's not clear from your question how your data is laid out, or what exactly you're asking. From what I can see, you need to add a column in your raw data table, which says something like
=COUNTIFS(UniqueID,MyUniqueID,Score,">=9")-COUNTIFS(UniqueID,MyUniqueID,Score,"<=6")
Then another column that says
=IF(NetPromoter>=9,"Pr",IF(NetPromoter>=7,"Pa","D"))
And then in your pivot table you add the Classification as a new subcolumn, and add the NPS as the Average of your NPS column, or something like that.
Please show your data if you want the formulas changed to meet your actual range/variable terms.

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Modifying A SAS Data set after it was created using PROC IMPORT

I have a dataset like this
Obs MinNo EurNo MinLav EurLav
1 103 15.9 92 21.9
2 68 18.5 126 18.5
3 79 15.9 114 22.3
My goal is to create a data set like this from the dataset above:
Obs Min Eur Lav
1 103 15.9 No
2 92 21.9 Yes
3 68 18.5 No
4 126 18.5 Yes
5 79 15.9 No
6 114 22.3 Yes
Basically I'm taking the 4 columns and appending them into 2 columns + a Categorical indicating which set of 2 columns they came from
Here's what I have so far
PROC IMPORT DATAFILE='f:\data\order_effect.xls' DBMS=XLS OUT=orderEffect;
RUN;
DATA temp;
INFILE orderEffect;
INPUT minutes euros ##;
IF MOD(_N_,2)^=0 THEN lav='Yes';
ELSE lav='No';
RUN;
My question though is how I can I import an Excel sheet but then modify the SAS dataset it creates so I can shove the second two columns below the first two and add a third column based on which columns in came from?
I know how to do this by splitting the dataset into two datasets then appending one onto the other but with the mode function above it would be a lot faster.
You were very close, but misunderstanding what PROC IMPORT does.
When PROC EXPORT completes, it will have created a SAS data set named orderEffect containing SAS variables from the columns in your worksheet. You just need to do a little data step program to give the result you want. Try this:
data want;
/* Define the SAS variables you want to keep */
format Min 8. Eur 8.1;
length Lav $3;
keep Min Eur Lav;
set orderEffect;
Min = MinNo;
Eur = EurNo;
Lav = 'No';
output;
Min = MinLav;
Eur = EurLav;
Lav = 'Yes';
output;
run;
This assumes that the PROC IMPORT step created a data set with those names. Run that step first to be sure and revise the program if necessary.

How can I align columns where the biggest number or greatest string is the align indicator?

How can I right align (and left align?) a block of numbers or text in vim like this:
from:
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
to this:
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
That means the biggest number or greatest string in every column doesn't move.
In the first column it is 45+34, in the second column 209+120, in the third column 300 and in the last column 12.
Have a look at the align plugin, it can do this and much more. Great tool in your utility belt!
Found here
After some serious vimhelp/reading I found the correct AlignCtrl mapping...
Visually select the table, e.g. by using ggVG, then do a \Tsp i.e. <leader>Tsp
Then I get this:
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
From vimhelp:
\Tsp : use Align to make a table separated by blanks |alignmap-Tsp|
(right justified)
You can look into the Tabularize plugin. So if you have something like
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
just select those lines in the visual mode and type :Tab/ and it will format it as
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
Also, it looks like you don't have an equal number of spaces separating the numbers at the moment. So before you use the plugin, replace all the multiple spaces with a single space with the following regex:
%s![^ ]\zs \+! !g
With the Align plugin you can select the rows you want to align and hit :
<Leader>Tsp
From Align.txt
\Tsp : use Align to make a table separated by blanks |alignmap-Tsp|
(right justified)
(The help mention \ because it is the default leader but in case you have changed it to something else you must adapt accordingly)
Just trying on my install, I got the following result :
45 209 25 1
2 4 2 3
34 5 300 5
34 120 34 12
In my opinion Align plugin is great but the "align maps" and various commands are not really easy to remember.
With the Align and AlignMaps plugins: select using V, then \anum (AlignMaps comes with Align). One advantage of \anum is that it also handles decimal points (commas) and scientific notation.
I think the best thing to do is to first eat all multiple spaces with
:{range}s/ \+/ /g
And then call Tabularize
:Tab / /r1
Or change that r to l.

Resources