Spark join by values - apache-spark

I have a two pair RDDs in spark like this
rdd1 = (1 -> [4,5,6,7])
(2 -> [4,5])
(3 -> [6,7])
rdd2 = (4 -> [1001,1000,1002,1003])
(5 -> [1004,1001,1006,1007])
(6 -> [1007,1009,1005,1008])
(7 -> [1011,1012,1013,1010])
I would like to combine them to look like this.
joinedRdd = (1 -> [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
(2 -> [1000,1001,1002,1003,1004,1006,1007])
(3 -> [1005,1007,1008,1009,1010,1011,1012,1013])
Can someone suggest me how to do this.
Thanks
Dilip

With Scala Spark API, this would be
import org.apache.spark.SparkContext._ // enable PairRDDFunctions
val rdd1Flat = rdd1.flatMapValues(identity).map(_.swap)
val rdd2Flat = rdd2.flatMapValues(identity)
rdd1Flat.join(rdd2Flat).values.distinct.groupByKey.collect
Result of this operation is
Array[(Int, Iterable[Int])] = Array(
(1,CompactBuffer(1001, 1011, 1006, 1002, 1003, 1013, 1005, 1007, 1009, 1000, 1012, 1008, 1010, 1004)),
(2,CompactBuffer(1003, 1004, 1007, 1000, 1002, 1001, 1006)),
(3,CompactBuffer(1008, 1009, 1007, 1011, 1005, 1010, 1013, 1012)))
The approach proposed by Gabor will not work, since Spark doesn't support RDD operations performed within other RDD operation. You'll get a Java NPE thrown by a worker when trying to access the SparkContext available on the driver only.

Related

Delta lake stream-stream left outer join using watermarks

My aim here is to:
Read two delta lake tables as streams.
Add a timestamp column to the data frames.
Create a watermark using the timestamp columns.
Use the watermarks for an Outer left join.
Write the join as a stream to a delta lake table.
Here is my code:
from delta import *
from pyspark.sql.functions import *
import pyspark
my_working_dir = "/working"
spark = (
pyspark.sql.SparkSession.builder.appName("test_delta_stream").master("yarn").getOrCreate()
)
df_stream_1 = spark.readStream.format("delta").load(f"{my_working_dir}/table_1").alias("table_1")
df_stream_2 = spark.readStream.format("delta").load(f"{my_working_dir}/table_2").alias("table_2")
df_stream_1 = df_stream_1.withColumn("inserted_stream_1", current_timestamp())
df_stream_2 = df_stream_2.withColumn("inserted_stream_2", current_timestamp())
stream_1_watermark = df_stream_1.selectExpr(
"ID AS STREAM_1_ID", "inserted_stream_1"
).withWatermark("inserted_stream_1", "10 seconds")
stream_2_watermark = df_stream_2.selectExpr(
"ID AS STREAM_2_ID", "inserted_stream_2"
).withWatermark("inserted_stream_2", "10 seconds")
left_join_df = stream_1_watermark.join(
stream_2_watermark,
expr(
"""
STREAM_1_ID = STREAM_2_ID AND
inserted_stream_1 <= inserted_stream_2 + interval 1 hour
"""
),
"leftOuter",
).select("*")
left_join_df.writeStream.format("delta").outputMode("append").option(
"checkpointLocation", f"{my_working_dir}/output_delta_table/mycheckpoint"
).start(f"{my_working_dir}/output_delta_table")
Unfortunately, I get the error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/some-dir/python/pyspark/sql/streaming.py", line 1493, in start
return self._sq(self._jwrite.start(path))
File "/some-dir/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/some-dir/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;
It seems that even when I create watermarks, I still cannot do a left outer join. Any ideas why I am getting that error even though I clearly have defined a watermark on the joining columns?

Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error:
Error:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 170, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1212, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1199, in _get_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_collections.py", line 501, in convert
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Code
delta_table = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table)
Can anyone suggest me why the spark session is not able to create another delta table instance .
I need to perform both merge and compaction in the same script since I want to run the compaction only on the partitions in which the merge is performed . The partitions are derived from the unique values present in the dataframe df created from input_file.csv
I think your problem lies with delta_table variable - at first it is a string containing delta lake path, but then you are creating a delta table object trying to pass it into .load() method. Separating those variables could help:
delta_table_path = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table_path)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table_path).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table_path )

How to handle timestamp in Pyspark Structured Streaming

I'm trying to parse the datetime to later do group by at certain hours in structured streaming.
Currently I have code like this:
distinct_table = service_table\
.select(psf.col('crime_id'),
psf.col('original_crime_type_name'),
psf.to_timestamp(psf.col('call_date_time')).alias('call_datetime'),
psf.col('address'),
psf.col('disposition'))
Which gives output in console:
+---------+------------------------+-------------------+--------------------+------------+
| crime_id|original_crime_type_name| call_datetime| address| disposition|
+---------+------------------------+-------------------+--------------------+------------+
|183652852| Burglary|2018-12-31 18:52:00|600 Block Of Mont...| HAN|
|183652839| Passing Call|2018-12-31 18:51:00|500 Block Of Clem...| HAN|
|183652841| 22500e|2018-12-31 18:51:00|2600 Block Of Ale...| CIT|
When I try to apply this udf to convert the timestamp (call_datetime column):
import pyspark.sql.functions as psf
from dateutil.parser import parse as parse_date
#psf.udf(StringType())
def udf_convert_time(timestamp):
d = parse_date(timestamp)
return str(d.strftime('%y%m%d%H'))
I get a Nonetype error..
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 149, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 74, in <lambda>
return lambda *a: f(*a)
File "/Users/PycharmProjects/data-streaming-project/solution/streaming/data_stream.py", line 29, in udf_convert_time
d = parse_date(timestamp)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 301, in parse
res = self._parse(timestr, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 349, in _parse
l = _timelex.split(timestr)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 143, in split
return list(cls(s))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 137, in next
token = self.get_token()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 68, in get_token
nextchar = self.instream.read(1)
AttributeError: 'NoneType' object has no attribute 'read'
This is the query plan:
pyspark.sql.utils.StreamingQueryException: u'Writing job aborted.\n=== Streaming Query ===\nIdentifier: [id = 958a6a46-f718-49c4-999a-661fea2dc564, runId = fc9a7a78-c311-42b7-bbed-7718b4cc1150]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {KafkaSource[Subscribe[service-calls]]: {"service-calls":{"0":200}}}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [crime_id#25, original_crime_type_name#26, call_datetime#53, address#33, disposition#32, udf_convert_time(call_datetime#53) AS parsed_time#59]\n+- Project [crime_id#25, original_crime_type_name#26, to_timestamp(\'call_date_time, None) AS call_datetime#53, address#33, disposition#32]\n +- Project [SERVICE_CALLS#23.crime_id AS crime_id#25, SERVICE_CALLS#23.original_crime_type_name AS original_crime_type_name#26, SERVICE_CALLS#23.report_date AS report_date#27, SERVICE_CALLS#23.call_date AS call_date#28, SERVICE_CALLS#23.offense_date AS offense_date#29, SERVICE_CALLS#23.call_time AS call_time#30, SERVICE_CALLS#23.call_date_time AS call_date_time#31, SERVICE_CALLS#23.disposition AS disposition#32, SERVICE_CALLS#23.address AS address#33, SERVICE_CALLS#23.city AS city#34, SERVICE_CALLS#23.state AS state#35, SERVICE_CALLS#23.agency_id AS agency_id#36, SERVICE_CALLS#23.address_type AS address_type#37, SERVICE_CALLS#23.common_location AS common_location#38]\n +- Project [jsontostructs(StructField(crime_id,StringType,true), StructField(original_crime_type_name,StringType,true), StructField(report_date,StringType,true), StructField(call_date,StringType,true), StructField(offense_date,StringType,true), StructField(call_time,StringType,true), StructField(call_date_time,StringType,true), StructField(disposition,StringType,true), StructField(address,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(agency_id,StringType,true), StructField(address_type,StringType,true), StructField(common_location,StringType,true), value#21, Some(America/Los_Angeles)) AS SERVICE_CALLS#23]\n +- Project [cast(value#8 as string) AS value#21]\n +- StreamingExecutionRelation KafkaSource[Subscribe[service-calls]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
I'm using StringType for all columns and using to_timestamp for timestamp columns (which seems to work).
I verified and all the data I'm using (just like 100 rows) do have values. Any idea how to debug this?
EDIT
Input is coming from Kafka - Schema is shown above in the error log (all StringType())
Best not to use udf because they don't use spark catalyst optimizer and especially when the spark.sql.functions modules have functions available. This code will transform your timestamp.
import pyspark.sql.functions as F
import pyspark.sql.types as T
rawData = [(183652852, "Burglary", "2018-12-31 18:52:00", "600 Block Of Mont", "HAN"),
(183652839, "Passing Call", "2018-12-31 18:51:00", "500 Block Of Clem", "HAN"),
(183652841, "22500e", "2018-12-31 18:51:00", "2600 Block Of Ale", "CIT")]
df = spark.createDataFrame(rawData).toDF("crime_id",\
"original_crime_type_name",\
"call_datetime",\
"address",\
"disposition")
date_format_source="yyyy-MM-dd HH:mm:ss"
date_format_target="yyyy-MM-dd HH"
df.select("*")\
.withColumn("new_time_format",\
F.from_unixtime(F.unix_timestamp(F.col("call_datetime"),\
date_format_source),\
date_format_target)\
.cast(T.TimestampType()))\
.withColumn("time_string", F.date_format(F.col("new_time_format"), "yyyyMMddHH"))\
.select("call_datetime", "new_time_format", "time_string")\
.show(truncate=True)
+-------------------+-------------------+-----------+
| call_datetime| new_time_format|time_string|
+-------------------+-------------------+-----------+
|2018-12-31 18:52:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
+-------------------+-------------------+-----------+

Using broadcasted dataframe in pyspark UDF

Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application.
My Code calls the broadcasted Dataframe inside a pyspark dataframe like below.
fact_ent_df_data =
sparkSession.sparkContext.broadcast(fact_ent_df.collect())
def generate_lookup_code(col1,col2,col3):
fact_ent_df_count=fact_ent_df_data.
select(fact_ent_df_br.TheDate.between(col1,col2),
fact_ent_df_br.Ent.isin('col3')).count()
return fact_ent_df_count
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
sparkSession.sql('select sample4,generate_lookup_code(sample1,sample2,sample 3) as count_hol from table_t')
I am getting local variable used before assignment error when i use the broadcasted df_bc. Any help is appreciated
And the Error i am getting is
Traceback (most recent call last):
File "C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py", line 46, in <module>
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 323, in register
self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 148, in _judf
self._judf_placeholder = self._create_judf()
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 157, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 33, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py", line 2391, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py", line 575, in dumps
return cloudpickle.dumps(obj, 2)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 918, in dumps
cp.dump(obj)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 249, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o24.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Think about Spark Broadcast variable as a Python simple data type like list, So the problem is how to pass a variable to the UDF functions. Here is an example:
Suppose we have ages list d and a data frame with columns name and age. So we want to check if the age of each person is in ages list.
from pyspark.sql.functions import udf, col
l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows
rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])
def check_age (age, age_list):
if age in l:
return "true"
return "false"
def udf_check_age(age_list):
return udf(lambda x : check_age(x, age_list))
df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
Output:
+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10| false|
| bob| 21| true|
+-----+---+--------------+
Just trying to contribute with a simpler example based on Soheil's answer.
from pyspark.sql.functions import udf, col
def check_age (_age):
return _age > 18
dict_source = {"alice": 10, "bob": 21}
broadcast_dict = sc.broadcast(dict_source) # define broadcast variable
rdd = sc.parallelize(list(dict_source.keys()))
result = rdd.map(
lambda _name: check_age(broadcast_dict.value.get(_name)) # Here you specify the broadcasted var `.value`
)
print(result.collect())

Pyspark: Using lambda function and .withColumn produces a none-type error I'm having trouble understanding

I have the following code below. Essentially what I'm trying to do is to generate some new columns from the values in existing ones. After I do that, I save the dataframe with the new columns as a table in the cluster. Sorry I'm new to pyspark still.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DecimalType
import numpy as np
import math
df = sqlContext.sql('select * from db.mytable')
angle_av = udf(lambda (x, y): -10 if x == 0 else math.atan2(y/x)*180/np.pi, DecimalType(20,10))
df = df.withColumn('a_v_angle', angle_av(array('a_v_real', 'a_v_imag')))
df.createOrReplaceTempView('temp')
sqlContext.sql('create table new_table as select * from temp')
These operations actually don't produce any errors. I then attempt to store the df as a table and get the following error (i'm guessing since this is when the operations are actually executed):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 103, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<stdin>", line 14, in <lambda>
TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
This happens because input values are null / None. function should check its input and proceed accordingly.
f x == 0 or x is None
or just
if not x

Resources