I have a dataframe like this
data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])),
(("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
+---+----------------------------+
|ID |MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May] |
|ID3|[October, June] |
+---+----------------------------+
I want to compare every row with a default list, such that if the value is present assign 1 else 0
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
Hence my expected output is this
+---+----------------------------+------------------+
|ID |MonthList |Binary_MonthList |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May] |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June] |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+
I am able to do this in python, but don't know how to do this in pyspark
You can try to use such a udf.
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))
df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))
df.show()
# output
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
How about using array_contains():
from pyspark.sql.functions import array, array_contains
df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
pissall answer is completely fine. I'm just posting a more general solution that works without an udf and doesn't require you to be aware of possible values.
A CountVectorizer does exactly that what you want. This algorithm adds all distinct values to his dictionary as long as they fullfil certain criteria (e.g. minimum or maximum occurence). You can apply this model on a dataframe and it will return one-hot encoded a sparse vector column (which can be converted to a dense vector column) which represents the items of the given input column.
from pyspark.ml.feature import CountVectorizer
data = [(("ID1", ['October', 'September', 'August']))
, (("ID2", ['August', 'June', 'May', 'August']))
, (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
#binary=True checks only if a item of the dictionary is present and not how often
#vocabSize defines the maximum size of the dictionary
#minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)
cvModel = cv.fit(df)
df = cvModel.transform(df)
df.show(truncate=False)
cvModel.vocabulary
Output:
+---+----------------------------+
|ID | MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2| [August, June, May, August]|
|ID3| [October, June] |
+---+----------------------------+
+---+----------------------------+-------------------------+
|ID | MonthList | Binary_MonthList |
+---+----------------------------+-------------------------+
|ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])|
|ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])|
|ID3|[October, June] | (5,[0,2],[1.0,1.0]) |
+---+----------------------------+-------------------------+
['June', 'August', 'October', 'September', 'May']
Related
My Pyspark data frame looks like this. I have to remove group by function from pyspark code to increase the performance of the code. I have to perform operations on 100k data.
[Initial Data]
To create Dataframe
df = spark.createDataFrame([
(0, ['-9.53', '-9.35', '0.18']),
(1, ['-7.77', '-7.61', '0.16']),
(2, ['-5.80', '-5.71', '0.10']),
(0, ['1', '2', '3']),
(1, ['4', '5', '6']),
(2, ['8', '98', '32'])
], ["id", "Array"])
And the expected output is produced using this code.
import pyspark.sql.functions as f
df.groupBy('id').agg(f.collect_list(f.col("Array")).alias('Array')).\
select("id",f.flatten("Array")).show()
I have to achieve the output in this format. The above code is giving me this output. I have to achieve the same by removing the groupby function.
+---+-------------------------------+
|id |flatten(Array) |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
If you don't want to do group by, you can use window functions:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df2 = df.select(
"id",
f.flatten(f.collect_list(f.col("Array")).over(Window.partitionBy("id"))).alias("Array")
).distinct()
df2.show(truncate=False)
+---+-------------------------------+
|id |Array |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
You can also try
df.select(
'id',
f.explode('Array').alias('Array')
).groupBy('id').agg(f.collect_list('Array').alias('Array'))
Although I'm not sure if it'll be faster.
Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?
Here is the code to create my dataframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data1 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 13),
('George', datetime(2020, 9, 24, 3, 19, 6), 8),
('George', datetime(2009, 12, 12, 17, 21, 30), 5),
('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)
Here is the dataframe:
+-------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13 |
|George |2020-09-24 07:19:06|8 |
|George |2009-12-12 22:21:30|5 |
|Micheal|2010-11-22 18:29:40|12 |
|Maggie |2010-02-08 08:31:23|8 |
|Ravi |2009-01-01 09:19:47|2 |
|Xien |2010-03-02 09:33:51|3 |
+-------+-------------------+-------------+
Here is a working example to replace one string:
from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()
Here is the output:
+------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+------------+-------------------+-------------+
| George|2010-03-24 07:19:58| 13|
| George|2020-09-24 07:19:06| 8|
| George|2009-12-12 22:21:30| 5|
| Micheal|2010-11-22 18:29:40| 12|
| Maggie|2010-02-08 08:31:23| 8|
|Ravi_renamed|2009-01-01 09:19:47| 2|
| Xien|2010-03-02 09:33:51| 3|
+------------+-------------------+-------------+
In pandas I could replace multiple strings in one line of code with a lambda expression:
df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')
I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.
This is what you're looking for:
Using when() (most readable)
df1.withColumn('name',
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)
With mapping expr (less explicit but handy if there's many values to replace)
df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))
or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
the above but only using pyspark imported functions
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
Result
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+
I want to calculate the time spent per SeqID for each user. I have a dataframe like this.
However, the time is split between two actions for every user, Action_A and Action_B.
The total time per user, per seqID would be sum across all such pairs
For first user, it is 5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]
So first user has ideally spent 8 mins for SeqID 1 (and not 23 mins).
Similarly user 2 has spent 1 + 5 = 6 mins
How can I calculate this using pyspark?
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()
+---+-----+-------------------+--------+
| ID|SeqID| Timestamp| Action|
+---+-----+-------------------+--------+
|ID1| 15|2019-12-10 10:00:00|Action_A|
|ID1| 15|2019-12-10 10:05:00|Action_B|
|ID1| 15|2019-12-10 10:20:00|Action_A|
|ID1| 15|2019-12-10 10:23:00|Action_B|
|ID2| 23|2019-12-10 11:10:00|Action_A|
|ID2| 23|2019-12-10 11:11:00|Action_B|
|ID2| 23|2019-12-10 11:30:00|Action_A|
|ID2| 23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+
Once I have the data for each pair, I can sum across the group (ID, SeqID)
Expected output (could be seconds also)
+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1| 15| 8|
|ID2| 23| 6|
+---+-----+--------+
Here is a possible solution using Higher-Order Functions (Spark >=2.4):
transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"
df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
.withColumn("transformed_ts_array", expr(transform_expr)) \
.withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
.drop("transformed_ts_array", "ts_array") \
.show(truncate=False)
Steps:
Collect all timestamps to array for each group ID, SeqID and sort them in ascending order
Apply a transform to the array with lambda function (x, i) => Double. Where x is the actual element and i its index. For each timestamp in the array, we calculate the diff with the next timestamp. And we multiply by (i+1)%2 in order to have only the diff as pairs 2 per 2 (first with the second, third with the fourth, ...) as there are always 2 actions.
Finally, we aggregate the result array of transformation to sum all the elements.
Output:
+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15 |8.0 |
|ID2|23 |6.0 |
+---+-----+--------+
A possible (might be complicated as well) way to do it with flatMapValues and rdd
Using your data variable
df = spark.createDataFrame(data, ["id", "seq_id", "ts", "action"]). \
withColumn('ts', func.col('ts').cast('timestamp'))
# func to calculate the duration | applied on each row
def getDur(groupedrows):
"""
"""
res = []
for row in groupedrows:
if row.action == 'Action_A':
frst_ts = row.ts
dur = 0
elif row.action == 'Action_B':
dur = (row.ts - frst_ts).total_seconds()
res.append([val for val in row] + [float(dur)])
return res
# run the rules on the base df | row by row
# grouped on ID, SeqID - sorted on timestamp
dur_rdd = df.rdd. \
groupBy(lambda k: (k.id, k.seq_id)). \
flatMapValues(lambda r: getDur(sorted(r, key=lambda ok: ok.ts))). \
values()
# specify final schema
dur_schema = df.schema. \
add('dur', 'float')
# convert to DataFrame
dur_sdf = spark.createDataFrame(dur_rdd, dur_schema)
dur_sdf.orderBy('id', 'seq_id', 'ts').show()
+---+------+-------------------+--------+-----+
| id|seq_id| ts| action| dur|
+---+------+-------------------+--------+-----+
|ID1| 15|2019-12-10 10:00:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:05:00|Action_B|300.0|
|ID1| 15|2019-12-10 10:20:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:23:00|Action_B|180.0|
|ID2| 23|2019-12-10 11:10:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:11:00|Action_B| 60.0|
|ID2| 23|2019-12-10 11:30:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:35:00|Action_B|300.0|
+---+------+-------------------+--------+-----+
# Your required data
dur_sdf.groupBy('id', 'seq_id'). \
agg((func.sum('dur') / func.lit(60)).alias('dur_mins')). \
show()
+---+------+--------+
| id|seq_id|dur_mins|
+---+------+--------+
|ID1| 15| 8.0|
|ID2| 23| 6.0|
+---+------+--------+
This fits the data you've described, but check if it fits your all your cases.
Another possible solution using Window Functions
spark = SparkSession.Builder().master("local[3]").appName("TestApp").enableHiveSupport().getOrCreate()
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.registerTempTable("tmpTbl")
df = spark.sql("select *, lead(Timestamp,1) over (partition by ID,SeqID order by Timestamp) Next_Timestamp from tmpTbl")
updated_df = df.filter("Action != 'Action_B'")
final_df = updated_df.withColumn("diff", (F.unix_timestamp('Next_Timestamp') - F.unix_timestamp('Timestamp'))/F.lit(60))
final_df.groupBy("ID","SeqID").agg(F.sum(F.col("diff")).alias("Duration")).show()
Output
I have the following sample dataframe
fruit_list = ['apple', 'apple', 'orange', 'apple']
qty_list = [16, 2, 3, 1]
spark_df = spark.createDataFrame([(101, 'Mark', fruit_list, qty_list)], ['ID', 'name', 'fruit', 'qty'])
and I would like to create another column which contains a result similar to what I would achieve with a pandas groupby('fruit').sum()
qty
fruits
apple 19
orange 3
The above result could be stored in the new column in any form (either a string, dictionary, list of tuples...).
I've tried an approach similar to the following one which does not work
sum_cols = udf(lambda x: pd.DataFrame({'fruits': x[0], 'qty': x[1]}).groupby('fruits').sum())
spark_df.withColumn('Result', sum_cols(F.struct('fruit', 'qty'))).show()
One example of result dataframe could be
+---+----+--------------------+-------------+-------------------------+
| ID|name| fruit| qty| Result|
+---+----+--------------------+-------------+-------------------------+
|101|Mark|[apple, apple, or...|[16, 2, 3, 1]|[(apple,19), (orange,3)] |
+---+----+--------------------+-------------+-------------------------+
Do you have any suggestion on how I could achieve that?
Thanks
Edit: running on Spark 2.4.3
As #pault mentioned, as of Spark 2.4+, you can use Spark SQL built-in function to handle your task, here is one way with array_distinct + transform + aggregate:
from pyspark.sql.functions import expr
# set up data
spark_df = spark.createDataFrame([
(101, 'Mark', ['apple', 'apple', 'orange', 'apple'], [16, 2, 3, 1])
, (102, 'Twin', ['apple', 'banana', 'avocado', 'banana', 'avocado'], [5, 2, 11, 3, 1])
, (103, 'Smith', ['avocado'], [10])
], ['ID', 'name', 'fruit', 'qty']
)
>>> spark_df.show(5,0)
+---+-----+-----------------------------------------+----------------+
|ID |name |fruit |qty |
+---+-----+-----------------------------------------+----------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|
|103|Smith|[avocado] |[10] |
+---+-----+-----------------------------------------+----------------+
>>> spark_df.printSchema()
root
|-- ID: long (nullable = true)
|-- name: string (nullable = true)
|-- fruit: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qty: array (nullable = true)
| |-- element: long (containsNull = true)
Set up the SQL statement:
stmt = '''
transform(array_distinct(fruit), x -> (x, aggregate(
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
, 0
, (y,z) -> int(y + z)
))) AS sum_fruit
'''
>>> spark_df.withColumn('sum_fruit', expr(stmt)).show(10,0)
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|ID |name |fruit |qty |sum_fruit |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |[[apple, 19], [orange, 3]] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|[[apple, 5], [banana, 5], [avocado, 12]]|
|103|Smith|[avocado] |[10] |[[avocado, 10]] |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
Explanation:
Use array_distinct(fruit) to find all distinct entries in the array fruit
transform this new array (with element x) from x to (x, aggregate(..x..))
the above function aggregate(..x..) takes the simple form of summing up all elements in array_T
aggregate(array_T, 0, (y,z) -> y + z)
where the array_T is from the following transformation:
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
which iterate through the array fruit, if the value of fruit[i] = x , then return the corresponding qty[i], otherwise return 0. for example for ID=101, when x = 'orange', it returns an array [0, 0, 3, 0]
There may be a fancy way to do this using only the API functions on Spark 2.4+, perhaps with some combination of arrays_zip and aggregate, but I can't think of any that don't involve an explode step followed by a groupBy. With that in mind, using a udf may actually be better for you in this case.
I think creating a pandas DataFrame just for the purpose of calling .groupby().sum() is overkill. Furthermore, even if you did do it that way, you'd need to convert the final output to a different data structure because a udf can't return a pandas DataFrame.
Here's one way with a udf using collections.defaultdict:
from collections import defaultdict
from pyspark.sql.functions import udf
def sum_cols_func(frt, qty):
d = defaultdict(int)
for x, y in zip(frt, map(int, qty)):
d[x] += y
return d.items()
sum_cols = udf(
lambda x: sum_cols_func(*x),
ArrayType(
StructType([StructField("fruit", StringType()), StructField("qty", IntegerType())])
)
)
Then call this by passing in the fruit and qty columns:
from pyspark.sql.functions import array, col
spark_df.withColumn(
"Result",
sum_cols(array([col("fruit"), col("qty")]))
).show(truncate=False)
#+---+----+-----------------------------+-------------+--------------------------+
#|ID |name|fruit |qty |Result |
#+---+----+-----------------------------+-------------+--------------------------+
#|101|Mark|[apple, apple, orange, apple]|[16, 2, 3, 1]|[[orange, 3], [apple, 19]]|
#+---+----+-----------------------------+-------------+--------------------------+
If you have spark < 2.4, use the follwoing to explode (otherwise check this answer):
df_split = (spark_df.rdd.flatMap(lambda row: [(row.ID, row.name, f, q) for f, q in zip(row.fruit, row.qty)]).toDF(["ID", "name", "fruit", "qty"]))
df_split.show()
Output:
+---+----+------+---+
| ID|name| fruit|qty|
+---+----+------+---+
|101|Mark| apple| 16|
|101|Mark| apple| 2|
|101|Mark|orange| 3|
|101|Mark| apple| 1|
+---+----+------+---+
Then prepare the result you want. First find the aggregated dataframe:
df_aggregated = df_split.groupby('ID', 'fruit').agg(F.sum('qty').alias('qty'))
df_aggregated.show()
Output:
+---+------+---+
| ID| fruit|qty|
+---+------+---+
|101|orange| 3|
|101| apple| 19|
+---+------+---+
And finally change it to the desired format:
df_aggregated.groupby('ID').agg(F.collect_list(F.struct(F.col('fruit'), F.col('qty'))).alias('Result')).show()
Output:
+---+--------------------------+
|ID |Result |
+---+--------------------------+
|101|[[orange, 3], [apple, 19]]|
+---+--------------------------+
I have a spark Dataset of rows in Java that looks like this.
+-------+-------------------+---------------+----------+--------------------+-----+
|item_id| date_time|horizon_minutes|last_value| values|label|
+-------+-------------------+---------------+----------+--------------------+-----+
| 8|2019-04-30 09:55:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
| 8|2019-04-30 10:00:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
| 8|2019-04-30 10:05:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
I want to filter the Dataframe to take only those rows whose month is inside a list of integers (e.g. 1,2,5,12)
I have tried the filter function based on strings
rowsDS.filter("month(date_time)" ???)
But I don't know how to include the "isin list" of integers condition.
I have also tried to filter through a lambda function with no luck.
rowsDS.filter(row -> listofints.contains(row.getDate(1).getMonth()))
Evaluation failed. Reason(s):
Lambda expressions cannot be used in an evaluation expression
Is there any simple way to do this?. I would preferably want to use lambda functions as I do not like much the string based filters of SparkSQL such as the first example.
For Dataframe:
val result = df.where(month($"date_time").isin(2, 3, 4))
In Java:
Dataset<Row> result = df.where(month(col("date_time")).isin(2, 3, 4));
For get "col" and "month" function in Java:
import static org.apache.spark.sql.functions.*;
You can define UDF as described here and here
My example:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val ls = List("A", "B")
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
def rawFilterFunc(r: String) = ls.contains(r)
ss.udf.register("ff", rawFilterFunc _)
df1.filter(callUDF("ff", df1("cA"))).show()
Gives output:
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| A|abc| 0.1|0.0| 0|
| B|def|0.15|0.5| 0|
+---+---+----+---+---+