I have the below dataframe,
|id |lat |lng |timestamp |
+-----+---------+-----------+-------------------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|
+-----+---------+-----------+-------------------+
And I can find the count (number of occurrences), precount(previous count) and pretsp (previous timestamp) using a window and partitionBy id and timestamp.
val specDevicePartiton = Window.partitionBy("id").orderBy("timestamp")
val specDevicePartitonTimeStamp = Window.partitionBy("id", "timestamp").orderBy("timestamp")
val userProfileDF = deviceDF.withColumn("prelatitude", lag(deviceDF("lat"), 1).over(specDevicePartiton))
.withColumn("prelongitude", lag(deviceDF("lng"), 1).over(specDevicePartiton))
.withColumn("pretimestamp", lag(deviceDF("timestamp"), 1).over(specDevicePartiton))
.withColumn("pretsp", when((col("timestamp") === col("pretimestamp")), first(col("pretimestamp"))
.over(specDevicePartitonTimeStamp)).otherwise(col("pretimestamp")))
.withColumn("count", count("timestamp").over(specDevicePartitonTimeStamp))
.withColumn("previousCount", lag(col("count"), 1).over(specDevicePartiton))
.withColumn("precount", when((col("timestamp") === col("pretimestamp")), first(col("previousCount"))
.over(specDevicePartitonTimeStamp)).otherwise(col("previousCount")))
.withColumn("preFirstLat", when((col("precount").>(1)) && (col("count") === 1), first(col("lat")).over(specDevicePartitonPreTimeStamp.rowsBetween(-2, -1))))
.withColumn("preFirstLng", when((col("precount").>(1)) && (col("count") === 1), first(col("lng")).over(specDevicePartitonPreTimeStamp)))
.drop("prelatitude", "prelongitude", "nxtlatitude", "nxtlongitude", "pretimestamp")
You can find below the output dataframe,
|id |lat |lng |timestamp |pretsp |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1 |1 |null |null |
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1 |1 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |null |null |
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1 |2 |3.1357369 |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |null |null |
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |null |null |
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |null |null |
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
I want to find out first and last from previous for current row. Expected output would be like this,
|id |lat |lng |timestamp |pretsp |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1 |null |null |null|
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1 |1 |3.1357369 |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |3.1360323 |101.6874385|
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |3.1360323 |101.6874385|
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1 |2 |3.1357369 |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |3.1357369 |101.6863713|
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |3.1357369 |101.6863713|
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |3.1357369 |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |3.1357369 |101.6863713|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
Logic:
Find first lat and long value from previous rows for current row. Here previous row with same timestamp and different lat and long value.
Example: Check the timestamp = 2017-11-07 04:16:30 and 2017-11-07 05:05:03 so on in the above output.
I have tried to rowsbetween(start, end) dynamically by considering precount as start and -1 as end but I do know how to achieve this.
I have to do the same for figuring out last value if I get the solution to find out the first value then it will be same for last value I think.
Here is the simple example,
Id val1 date prefirstVal preLastVal
1 10 2017-11-06 null null
1 20 2017-11-07 10 10
1 25 2017-11-07 10 10
**1 30 2017-11-08 20 25**
1 35 2017-11-09 30 30
1 40 2017-11-09 30 30
Here the in the highlighted row 2017-11-08, prefristVal 20 is from 2017-11-07 first entry and preLastVal 25 is from last entry of 2017-11-07.
Thanks,
Related
This is my first post so let me know if I need to give more details.
I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am using the when() .otherwise function in spark but I'm not getting the result I would expect.
Below is the code I'm using:
val evaluation = evaluation_raw
.withColumn("immediate",
when(col("intended_outcome_review").isNull
&& col("outcome").isNull
&& col("impact").isNull
&& col("impact_self").isNull
&& col("next_step").isNull,
lit(false))
.otherwise(lit(true)))
.select(
col("id"),
col("intended_outcome_review"),
col("outcome"),
col("impact"),
col("impact_self"),
col("next_step"),
col("immediate"))
Desired outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |false |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
Actual outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |true |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
If anyone knows what I may be doing wrong please let me know.
Thanks!
You can use a trick and cast column.isNull() to in int and calculate sum of them. if the sum is above 0 then it's true.
.withColumn(
'immediate',
(
F.col('intended_outcome_review').isNull().cast('int') +
F.col('outcome').isNull().cast('int') +
F.col('impact').isNull().cast('int') +
F.col('next_step').isNull().cast('int')
) != 0
)
Turns out some of the columns are converted from Null to "" when other parts of the form are filled out.
Answer below considers empty strings and Null values:
.withColumn("immediate",
when((col("intended_outcome_review").isNull || col("intended_outcome_review") ==="")
&& (col("outcome").isNull || col("outcome") === "")
&& (col("impact").isNull || col("outcome") === "")
&& (col("impact_self").isNull || col("impact_self") === "")
&& (col("next_step").isNull || col("next_step") === ""),
lit(false))
.otherwise(lit(true)))
I have below data in dataframe
+----------+--------------+-------------------+---------------+
|id |mid |ppp |qq |
+----------+--------------+-------------------+---------------+
|A |4 |[{P}] |null |
|B |4 |[{P}] |null |
|A |4 |null |[{P}] |
|A |4 |null |[{Q}] |
|C |4 |null |[{Q}] |
|D |4 |null |[{Q}] |
|A |4 |null |[{R}] |
+----------+--------------+-------------------+---------------+
I have below code
String[] array = {"id", "mid", "ppp", "qq"};
List<String> columns = Arrays.asList(array)
Column[] columns = columns
.stream()
.filter(field -> !field.equals("id") && !field.equals("mid"))
.map(column -> flatten(when(size(collect_list(column)).equalTo(0), null)
.otherwise(collect_list(column)))
.as(column))
.collect(Collectors.toList()).toArray(new Column[0]);
Dataset<Row> output = df
.groupBy(functions.col("id"), functions.col("mid"))
.agg(columns[0], Arrays.copyOfRange(columns, 1, columns.length));
The above code produces groups by id and mid and then collect_list collects elements of ppp and qq into arrays in both columns.
Output :
+----------+--------------+-------------------+----------------+
|id |mid | ppp |qq |
+----------+--------------+-------------------+----------------+
|A |4 |[[P]] |[[R], [P], [Q]] |
|B |4 |null |[[Q]] |
|C |4 |[[P]] |null |
|D |4 |null |[[Q]] |
Code works fine exactly as required where if collect_list creates empty list, I am replacing that by null.
Is there a way to avoid calling collect_list twice in when and otherwise and achieve the same result that if collect_list creates empty list, replace that by null.
of course you can do that, just call size on the array on set it to null if it is 0, something like
df
.groupBy()
.agg(
collect_list($"mycol").as("arr_mycol")
)
// set empty arrays to null
.withColumn("arr_mycol",when(size($"arr_mycol")>0,$"arr_mycol"))
Suppose we want to track the hops made by an package from warehouse to the customer.
We have a table which store the data but the data is in a column SAY Route
The package starts at the Warehouse – YYY,TTT,MMM
The hops end when the package is delivered to the CUSTOMER
The values in the Route column are separated by space
ID Route
1 TTT A B X Y Z CUSTOMER
2 YYY E Y F G I P B X Q CUSTOMER
3 MMM R T K L CUSTOMER
Expected Output
ID START END
1 TTT A
1 A B
1 B X
.
.
.
1 Z CUSTOMER
2 YYY E
2 E Y
2 Y F
.
.
2 Q CUSTOMER
3 MMM R
.
.
3 L CUSTOMER
Is there anyway to achieve this in pyspark
Add an index to the split route using posexplode, and get the location at the next index for each starting location using lead. If you want to remove the index simply add .drop('index') at the end.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df2 = df.select(
'ID',
F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
'end',
F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()
df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end |
+---+-----+-----+--------+
|1 |0 |TTT |A |
|1 |1 |A |B |
|1 |2 |B |X |
|1 |3 |X |Y |
|1 |4 |Y |Z |
|1 |5 |Z |CUSTOMER|
|2 |0 |YYY |E |
|2 |1 |E |Y |
|2 |2 |Y |F |
|2 |3 |F |G |
|2 |4 |G |I |
|2 |5 |I |P |
|2 |6 |P |B |
|2 |7 |B |X |
|2 |8 |X |Q |
|2 |9 |Q |CUSTOMER|
|3 |0 |MMM |R |
|3 |1 |R |T |
|3 |2 |T |K |
|3 |3 |K |L |
|3 |4 |L |CUSTOMER|
+---+-----+-----+--------+
Current DF (filter by a single userId, flag is 1 when the loss is > 0, -1 when is <=0):
display(df):
+------+----------+---------+----+
| user|Date |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01| -16.5| 1|
|100364|2019-02-02| 73.5| -1|
|100364|2019-02-03| 31| -1|
|100364|2019-02-09| -5.2| 1|
|100364|2019-02-10| -34.5| 1|
|100364|2019-02-13| -8.1| 1|
|100364|2019-02-18| 5.68| -1|
|100364|2019-02-19| 5.76| -1|
|100364|2019-02-20| 9.12| -1|
|100364|2019-02-26| 9.4| -1|
|100364|2019-02-27| -30.6| 1|
+----------+------+---------+----+
the desidered outcome df should show the number of days since lastwin ('RecencyLastWin') and since lastloss ('RecencyLastLoss')
display(df):
+------+----------+---------+----+--------------+---------------+
| user|Date |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01| -16.5| 1| null| null|
|100364|2019-02-02| 73.5| -1| 1| null|
|100364|2019-02-03| 31| -1| 2| 1|
|100364|2019-02-09| -5.2| 1| 8| 6|
|100364|2019-02-10| -34.5| 1| 1| 7|
|100364|2019-02-13| -8.1| 1| 1| 10|
|100364|2019-02-18| 5.68| -1| 5| 15|
|100364|2019-02-19| 5.76| -1| 6| 1|
|100364|2019-02-20| 9.12| -1| 7| 1|
|100364|2019-02-26| 9.4| -1| 13| 6|
|100364|2019-02-27| -30.6| 1| 14| 1|
+----------+------+---------+----+--------------+---------------+
My approach was the following:
from pyspark.sql.window import Window
w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")
last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))
finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.withColumn('previous_date_played' , F.lag('EventDate').over(w))\
.withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
.withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))
then I tried to add (multiple attempts..) but failed to 'perfectly' return what I want.
finalcheck = finalcheck.withColumn('previous_last_pos' , F.last('last_positive_day_count', True).over(w2))\
.withColumn('previous_last_neg' , F.last('last_negative_day_count', True).over(w2))\
.withColumn('previous_last_pos_date' , F.last('last_positive_day', True).over(w2))\
.withColumn('previous_last_neg_date' , F.last('last_negative_day', True).over(w2))\
.withColumn('recency_last_positive' , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
.withColumn('day_since_last_negative_v1' , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
.withColumn('days_off' , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
.withColumn('recency_last_negative' , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_negative_v2' , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_positive_v2' , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))
Any suggestion/tips?
(I found a similar question but didn't figured out how to apply in my specific case):
How to calculate days between when last condition was met?
Here is my try.
There are two parts to calculate this. The first one is that when the wins and losses keep going, then the difference of dates should be summed. To achieve this, I have marked the consecutive losses and wins as 1, and split them into the partition groups by cumulative summing until the current row of the marker. Then, I can calculate the cumulative days from the last loss or win after the consecutive losses and wins the end.
The second one is that when the wins and losses changed, simply get the date difference from the last match and this match. It can be easily obtained by the date difference of current and previous one.
Finally, merge those results in a column.
from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window
w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')
df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
.withColumn('lastDate', lag('Date', 1).over(w1)) \
.withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
.withColumn('consecutiveLoss', expr('if(flag = 1 or lastFlag = 1, 0, 1)')) \
.withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
.withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
.withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
.withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
.withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
.withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
.withColumn('lastWin' , expr('if(lastFlag = 1, dateDiff, null)')) \
.withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
.withColumn('RecencyLastWin', coalesce('lastWin' , 'daysLastwin' )) \
.orderBy('Date')
df2.show(11, False)
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user |Date |RealLoss|flag|lastFlag|lastDate |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |null |0 |1 |0 |1 |null |null |null |null |null |null |
|100364|2019-02-02|73.5 |-1 |1 |2019-02-01|1 |0 |0 |0 |1 |null |1 |null |1 |null |1 |
|100364|2019-02-03|31.0 |-1 |-1 |2019-02-02|1 |1 |0 |1 |1 |null |2 |1 |null |1 |2 |
|100364|2019-02-09|-5.2 |1 |-1 |2019-02-03|6 |0 |0 |1 |1 |6 |8 |6 |null |6 |8 |
|100364|2019-02-10|-34.5 |1 |1 |2019-02-09|1 |0 |1 |1 |2 |7 |null |null |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |1 |2019-02-10|3 |0 |1 |1 |3 |10 |null |null |3 |10 |3 |
|100364|2019-02-18|5.68 |-1 |1 |2019-02-13|5 |0 |0 |1 |3 |15 |5 |null |5 |15 |5 |
|100364|2019-02-19|5.76 |-1 |-1 |2019-02-18|1 |1 |0 |2 |3 |null |6 |1 |null |1 |6 |
|100364|2019-02-20|9.12 |-1 |-1 |2019-02-19|1 |1 |0 |3 |3 |null |7 |1 |null |1 |7 |
|100364|2019-02-26|9.4 |-1 |-1 |2019-02-20|6 |1 |0 |4 |3 |null |13 |6 |null |6 |13 |
|100364|2019-02-27|-30.6 |1 |-1 |2019-02-26|1 |0 |0 |4 |3 |1 |14 |1 |null |1 |14 |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)
+------+----------+--------+----+---------------+--------------+
|user |Date |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |
|100364|2019-02-02|73.5 |-1 |null |1 |
|100364|2019-02-03|31.0 |-1 |1 |2 |
|100364|2019-02-09|-5.2 |1 |6 |8 |
|100364|2019-02-10|-34.5 |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |10 |3 |
|100364|2019-02-18|5.68 |-1 |15 |5 |
|100364|2019-02-19|5.76 |-1 |1 |6 |
|100364|2019-02-20|9.12 |-1 |1 |7 |
|100364|2019-02-26|9.4 |-1 |6 |13 |
|100364|2019-02-27|-30.6 |1 |1 |14 |
+------+----------+--------+----+---------------+--------------+
I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)