Split numerical count in Spark DataFrame column into several columns - apache-spark

Let's say I have a spark DataFrame like this
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Barack |2018-03-26|Action2 | 3|
|Barack |2018-03-26|Action3 | 1|
|Donald |2018-03-26|Action3 | 29|
|Hillary |2018-03-24|Action1 | 4|
|Hillary |2018-03-26|Action2 | 2|
and I'd like to have counts for Action1/Action2/Action3 in the separate counts, so to convert it into another DataFrame like this
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Barack |2018-03-26| 0| 3| 0|
|Barack |2018-03-26| 0| 0| 1|
|Donald |2018-03-26| 0| 0| 29|
|Hillary |2018-03-24| 4| 0| 0|
|Hillary |2018-03-26| 0| 2| 0|
As I'm a newbie to Spark, my attempt to reach that was quite dull and straightforward:
Get 3 new DF's from filtering by each "action"
Join original DF with each of new ones, using the second DF's "count" in the new DF
The code I tried looked like this:
val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
.join(a1,
($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"),
"left_outer")
.select($"o.user", $"o.dt", $"a1.count".as("action1_count"))
Then do the same with Action2/Action3, then join those.
However, even at this stage I've already got several problems with such approach:
It doesn't work at all - I mean fails with an error the reason of which I don't understand: org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];
Even if it succeeded, I assume I would have got nulls where I need zeros.
I feel there should be a better way to reach this. Like some map construct or something. But at the moment I don't feel I'm able to construct the transform required to convert first dataframe into second one.
So as right now I don't have working solution at all, I'll be very thankful for any suggestions.
UPD: I might also get DF's that don't contain all of 3 possible "action" values, for instance
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Hillary |2018-03-24|Action1 | 4|
For those, I still need the resulting DF with 3 columns:
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Hillary |2018-03-24| 4| 0| 0|

You can avoid multiple join by using when to select appropriate value of column.
About your join, I don't really think it got exception like cannot resolve 'o.user', you may want to check your code again.
val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")
val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))
+-------+----------+-------+-----+------+------+------+
|user |dt |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19 |19 |0 |0 |
|Albert |2018-03-25|Action1|1 |1 |0 |0 |
|Albert |2018-03-26|Action1|6 |6 |0 |0 |
|Barack |2018-03-26|Action2|3 |0 |3 |0 |
|Barack |2018-03-26|Action3|1 |0 |0 |1 |
|Donald |2018-03-26|Action3|29 |0 |0 |29 |
|Hillary|2018-03-24|Action1|4 |4 |0 |0 |
|Hillary|2018-03-26|Action2|2 |0 |2 |0 |
+-------+----------+-------+-----+------+------+------+

Here's one approach using pivot and first, with the advantage of not having to know what the action values are:
val df = Seq(
("Albert", "2018-03-24", "Action1", 19),
("Albert", "2018-03-25", "Action1", 1),
("Albert", "2018-03-26", "Action1", 6),
("Barack", "2018-03-26", "Action2", 3),
("Barack", "2018-03-26", "Action3", 1),
("Donald", "2018-03-26", "Action3", 29),
("Hillary", "2018-03-24", "Action1", 4),
("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")
val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
na.fill(0).
orderBy("user", "dt", "action")
// +-------+----------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1|
// | Donald|2018-03-26|Action3| 0| 0| 29|
// |Hillary|2018-03-24|Action1| 4| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0|
// +-------+----------+-------+-------+-------+-------+
[UPDATE]
Per comments, if you have more Action? to be created as columns than those in the pivot column, you can traverse the missing Action? to add them as zero-filled as columns:
val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")
val missingActions = fullActionList.diff(
pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)
missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0| 0| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1| 0| 0|
// | Donald|2018-03-26|Action3| 0| 0| 29| 0| 0|
// |Hillary|2018-03-24|Action1| 4| 0| 0| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0| 0| 0|
// +-------+----------+-------+-------+-------+-------+-------+-------+

Related

Spark SQL orderBy and global ordering across partitions

I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.
val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+

Pyspark groupby for all column with unpivot

I have 101 columns from a pipe delimited and looking to get counts for all columns with all untransposing the data.
Sample data:
+----------------+------------+------------+------------+------------+------------+------------+
|rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+----------------+------------+------------+------------+------------+------------+------------+
| 193012020044| 0| 0| 0| 0| 0| 0|
| 115012030044| 0| 0| 1| 1| 1| 1|
| 140012220044| 0| 0| 0| 0| 0| 0|
| 189012240044| 0| 0| 0| 0| 0| 0|
| 151012350044| 0| 0| 0| 0| 0| 0|
+----------------+------------+------------+------------+------------+------------+------------+
I have tried each column based out like
df.groupBy("flag_011622").count().show()
+------------+--------+
|flag_011622| count|
+------------+--------+
| 1| 192289|
| 0|69861980|
+------------+--------+
Instead I'm looking something like
I'm looking something like: Any suggestions to handle instead of loop in each time
+----------------+------------+------------+
|rm_ky|flag_010961|flag_name|counts|
+----------------+------------+------------+--------
| flag_011622| 1| 192289|
| flag_011622| 0| 69861980|
| flag_009670| 1| 120011800|
| flag_009670| 0| 240507|
| flag_009708| 1| 119049838|
| flag_009708| 0| 1202469|
+----------------+------------+------------+--------
You could use stack function that returns a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe.
Using your sample as df:
df = df.select(
"rm_ky",
F.expr(
"""stack(5,
'flag_010961', flag_010961,
'flag_009670', flag_009670,
'flag_009708', flag_009708,
'flag_009890', flag_009890,
'flag_009893', flag_009893
) AS (flag_name, value)"""
),
)
gives:
+------------+-----------+-----+
|rm_ky |flag_name |value|
+------------+-----------+-----+
|193012020044|flag_010961|0 |
|193012020044|flag_009670|0 |
|193012020044|flag_009708|0 |
|193012020044|flag_009890|0 |
|193012020044|flag_009893|0 |
|115012030044|flag_010961|0 |
|115012030044|flag_009670|0 |
|115012030044|flag_009708|1 |
|115012030044|flag_009890|1 |
|115012030044|flag_009893|1 |
|140012220044|flag_010961|0 |
|140012220044|flag_009670|0 |
|140012220044|flag_009708|0 |
|140012220044|flag_009890|0 |
|140012220044|flag_009893|0 |
|189012240044|flag_010961|0 |
|189012240044|flag_009670|0 |
|189012240044|flag_009708|0 |
|189012240044|flag_009890|0 |
|189012240044|flag_009893|0 |
|151012350044|flag_010961|0 |
|151012350044|flag_009670|0 |
|151012350044|flag_009708|0 |
|151012350044|flag_009890|0 |
|151012350044|flag_009893|0 |
+------------+-----------+-----+
Which you can then group and order:
df = (
df.groupBy("flag_name", "value")
.agg(F.count("*").alias("counts"))
.orderBy("flag_name", "value")
)
to get:
+-----------+-----+------+
|flag_name |value|counts|
+-----------+-----+------+
|flag_009670|0 |5 |
|flag_009708|0 |4 |
|flag_009708|1 |1 |
|flag_009890|0 |4 |
|flag_009890|1 |1 |
|flag_009893|0 |4 |
|flag_009893|1 |1 |
|flag_010961|0 |5 |
+-----------+-----+------+
Exemple:
data = [ ("193012020044",0, 0, 0, 0, 0, 1)
,("115012030044",0, 0, 1, 1, 1, 1)
,("140012220044",0, 0, 0, 0, 0, 0)
,("189012240044",0, 1, 0, 0, 0, 0)
,("151012350044",0, 0, 0, 1, 1, 0)]
columns= ["rm_ky","flag_010961","flag_011622","flag_009670","flag_009708","flag_009890","flag_009893"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|193012020044| 0| 0| 0| 0| 0| 1|
|115012030044| 0| 0| 1| 1| 1| 1|
|140012220044| 0| 0| 0| 0| 0| 0|
|189012240044| 0| 1| 0| 0| 0| 0|
|151012350044| 0| 0| 0| 1| 1| 0|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
Creating an expression to unpivot:
x = ""
cnt = 0
for col in df.columns:
if col != 'rm_ky':
cnt += 1
x += "'"+str(col)+"', " + str(col) + ", "
x = x[:-2]
xpr = """stack({}, {}) as (Type,Value)""".format(cnt,x)
print(xpr)
>> stack(6, 'flag_010961', flag_010961, 'flag_011622', flag_011622, 'flag_009670', flag_009670, 'flag_009708', flag_009708, 'flag_009890', flag_009890, 'flag_009893', flag_009893) as (Type,Value)
Then, using expr and pivot:
from pyspark.sql import functions as F
df\
.drop('rm_ky')\
.select(F.lit('dummy'),F.expr(xpr))\
.drop('dummy')\
.groupBy('Type')\
.pivot('Value')\
.agg(*[F.count(x).alias(x) for x in df_output.columns if x not in {"Type"}])\
.fillna(0)\
.show()
+-----------+---+---+
| Type| 0| 1|
+-----------+---+---+
|flag_009890| 3| 2|
|flag_009893| 3| 2|
|flag_011622| 4| 1|
|flag_010961| 5| 0|
|flag_009708| 3| 2|
|flag_009670| 4| 1|
+-----------+---+---+
i think this is what you are looking for
>>> df2.show()
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|193012020044| 0| 0| 0| 0| 0| 0|
|115012030044| 0| 0| 1| 1| 1| 1|
|140012220044| 0| 0| 0| 0| 0| 0|
|189012240044| 0| 0| 0| 0| 0| 0|
|151012350044| 0| 0| 0| 0| 0| 0|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
>>> unpivotExpr = "stack(6, 'flag_010961',flag_010961,'flag_011622',flag_011622,'flag_009670',flag_009670, 'flag_009708',flag_009708, 'flag_009890',flag_009890, 'flag_009893',flag_009893) as (flag,flag_val)"
>>> unPivotDF = df2.select("rm_ky", expr(unpivotExpr))
>>> unPivotDF.show()
+------------+-----------+--------+
| rm_ky| flag|flag_val|
+------------+-----------+--------+
|193012020044|flag_010961| 0|
|193012020044|flag_011622| 0|
|193012020044|flag_009670| 0|
|193012020044|flag_009708| 0|
|193012020044|flag_009890| 0|
|193012020044|flag_009893| 0|
|115012030044|flag_010961| 0|
|115012030044|flag_011622| 0|
|115012030044|flag_009670| 1|
|115012030044|flag_009708| 1|
|115012030044|flag_009890| 1|
|115012030044|flag_009893| 1|
|140012220044|flag_010961| 0|
|140012220044|flag_011622| 0|
|140012220044|flag_009670| 0|
|140012220044|flag_009708| 0|
|140012220044|flag_009890| 0|
|140012220044|flag_009893| 0|
|189012240044|flag_010961| 0|
|189012240044|flag_011622| 0|
+------------+-----------+--------+
only showing top 20 rows
>>> unPivotDF.groupBy("flag","flag_val").count().show()
+-----------+--------+-----+
| flag|flag_val|count|
+-----------+--------+-----+
|flag_009670| 0| 4|
|flag_009708| 0| 4|
|flag_009893| 0| 4|
|flag_009890| 0| 4|
|flag_009670| 1| 1|
|flag_009893| 1| 1|
|flag_011622| 0| 5|
|flag_010961| 0| 5|
|flag_009890| 1| 1|
|flag_009708| 1| 1|
+-----------+--------+-----+
>>> unPivotDF.groupBy("rm_ky","flag","flag_val").count().show()
+------------+-----------+--------+-----+
| rm_ky| flag|flag_val|count|
+------------+-----------+--------+-----+
|151012350044|flag_009708| 0| 1|
|115012030044|flag_010961| 0| 1|
|140012220044|flag_009670| 0| 1|
|189012240044|flag_010961| 0| 1|
|151012350044|flag_009670| 0| 1|
|115012030044|flag_009890| 1| 1|
|151012350044|flag_009890| 0| 1|
|189012240044|flag_009890| 0| 1|
|193012020044|flag_011622| 0| 1|
|193012020044|flag_009670| 0| 1|
|115012030044|flag_009670| 1| 1|
|140012220044|flag_011622| 0| 1|
|151012350044|flag_009893| 0| 1|
|140012220044|flag_009893| 0| 1|
|189012240044|flag_011622| 0| 1|
|189012240044|flag_009893| 0| 1|
|115012030044|flag_009893| 1| 1|
|140012220044|flag_009708| 0| 1|
|189012240044|flag_009708| 0| 1|
|193012020044|flag_010961| 0| 1|
+------------+-----------+--------+-----+

Get all possible combinations recursively in an RDD in pyspark

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, any tip how improve it please
import pandas as pd import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)
num_list = list(map(ranges, nums)) return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list)) print(data)
You can use crossJoin with DataFrame:
Here we have a simple example trying to compute the cross-product of three arrays,
i.e. [1,0], [2,1,0], [3,2,1,0]. Their cross-product should have 2*3*4 = 24 elements.
The code below shows how to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1,),(0,)], ['v1'])
df2 = spark.createDataFrame([(2,), (1,),(0,)], ['v2'])
df3 = spark.createDataFrame([(3,), (2,),(1,),(0,)], ['v3'])
df1.show()
df2.show()
df3.show()
+---+
| v1|
+---+
| 1|
| 0|
+---+
+---+
| v2|
+---+
| 2|
| 1|
| 0|
+---+
+---+
| v3|
+---+
| 3|
| 2|
| 1|
| 0|
+---+
df = df1.crossJoin(df2).crossJoin(df3)
print('----------- Total rows: ', df.count())
df.show(30)
----------- Total rows: 24
+---+---+---+
| v1| v2| v3|
+---+---+---+
| 1| 2| 3|
| 1| 2| 2|
| 1| 2| 1|
| 1| 2| 0|
| 1| 1| 3|
| 1| 1| 2|
| 1| 1| 1|
| 1| 1| 0|
| 1| 0| 3|
| 1| 0| 2|
| 1| 0| 1|
| 1| 0| 0|
| 0| 2| 3|
| 0| 2| 2|
| 0| 2| 1|
| 0| 2| 0|
| 0| 1| 3|
| 0| 1| 2|
| 0| 1| 1|
| 0| 1| 0|
| 0| 0| 3|
| 0| 0| 2|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Your computation is pretty big:
(10953+1)*(10423+1)*(10053+1)=1148010922784, about 1 trillion rows. I would suggest increase the numbers slowly, spark is not as fast as you think when it involves table joins.
Also, try use broadcast on all your initial DataFrames, i.e. df1, df2, df3. See if it helps.

Convert Dataframe to multiple 2D arrays

I have this dataset:
+----+-----+-------+-----+
|code|code2|machine|value|
+----+-----+-------+-----+
| 1| 2| A| 42|
| 2| 1| A| 11|
| 1| 4| A| 55|
| 1| 1| B| 2|
| 3| 3| B| 34|
| 3| 2| B| 111|
+----+-----+-------+-----+
I want that for each machine a kind of matrix like the following:
code and code2 are the column and at the intersection I want to fill the value.
Machine A
+----+----+----+----+----+
| A| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0| 11| 0| 0|
| 2| 42| 0| 0| 0|
| 3| 0| 0| 0| 0|
| 4| 55| 0| 0| 0|
+----+----+----+----+----+
Machine B
+----+----+----+----+----+
| B| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 2| 0| 0| 0|
| 2| 0| 0| 111| 0|
| 3| 0| 0| 34| 0|
| 4| 0| 0| 0| 0|
+----+----+----+----+----+
I have multiple machine there (unknown number) and the codes can only be 0-255.
So my problem is how to achieve that matrix...
My fist naive idea was to make a hashmap and as key the machine name and as value a 256x256 2D array. But I don't think it would be efficient and I also don't know how to achieve that.
Or probably have a dataset for each machine??
If someone has an idea I would like to listen.
Btw I'm using Scala.
For maximum coding flexibility, you could switch to the RDD API. An example of a solution would give you a RDD that maps a machine to its matrix, represented as a scala two-dimensional array. Note that Array.ofDimInt creates a two-dim array of sine n*m with zeros everywhere.
df
.map(x=> x.getAs[String]("machine") -> (x.getAs[Int]("code"), x.getAs[Int]("code2"),x.getAs[Int]("value")))
.groupByKey
.mapValues( seq => {
var result = Array.ofDim[Int](256, 256)
seq.foreach{ case (i,j,value) => result(i)(j) = value }
result
})

SPARK : Set a column value based on multiple row conditions

I have a dataframe of the below format:
+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
| 30| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 44| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 29| 0| 0| 0| 0| 0|
Now based on the values of columns AGEF and SEX I need to assign 1 to corresponding column name. Each column name is self explanatory like F0_34 is female between age 0 to 34 similarly for other case.
Expected output is
+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
| 30| 0| 1| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 44| 0| 0| 1| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 29| 0| 1| 0| 0| 0|
Thanks in Advance!!!
Typically the most efficient approach is to operate directly on SQL expressions. For example:
def categorize(ageRanges: Seq[(Int, Int)], sexValues: Seq[(Int, String)]) = for {
(ageL, ageH) <- ageRanges
(sexV, sexL) <- sexValues
} yield ($"SEX" === sexL && $"AGEF".between(ageL, ageH)).alias(
s"$sexL-$ageL-$ageH"
)
df.select(
$"*" +: categorize(Seq((0, 34), (35, 44)), Seq((0, "F"), (1, "M"))): _*
)
Simplest way is to make a UDF that takes 5 parameters (e.g.: actual_age, actual_sex, target_sex, target_min_age, target_max_age) and returns either 1 or 0. Something like this:
val ageRanger = udf[Int,Int,Int,Int,Int,Int]((age: Int, sex: Int, targetSex: Int, targetMinAge: Int, targetMaxAge: Int) => {
if (age >= targetMinAge && age <= targetMaxAge && sex == targetSex) 1 else 0
})
Then if you had this DataFrame:
val df = Seq((30,0),(94,1),(94,0),(44,0)).toDF("AGEF", "SEX")
// +----+---+
// |AGEF|SEX|
// +----+---+
// | 30| 0|
// | 94| 1|
// | 94| 0|
// | 44| 0|
// +----+---+
df.withColumn("F0_34", ageRanger($"AGEF", $"SEX", lit(0), lit(0), lit(34)))
.withColumn("F35_44", ageRanger($"AGEF", $"SEX", lit(0), lit(35), lit(44)))
.show
// +----+---+-----+------+
// |AGEF|SEX|F0_34|F35_44|
// +----+---+-----+------+
// | 30| 0| 1| 0|
// | 94| 1| 0| 0|
// | 94| 0| 0| 0|
// | 44| 0| 0| 1|
// +----+---+-----+------+
Note that you have to pass values into the UDF as Columns, so I use lit(...) to wrap my Int values for the hard-coded values. There could be a slicker way to do that, but it works fine this way.

Resources