Insert/Upsert/Delete (CDC) PySpark Structured Streaming - apache-spark

Lets supose that we have a initial file like this:
Id
Number
ChangeMode
1
10
insert
2
20
insert
3
30
insert
4
40
insert
5
50
insert
My table in mariaDB should be something like this:
Id
Number
1
10
2
20
3
30
4
40
5
50
Then other file like this come to folder:
Id
Number
ChangeMode
1
123
upsert
2
456
upsert
3
30
remove
And the table should be like this :
Id
Number
1
123
2
456
4
40
5
50
How can i use the column "ChangeMode" as a reference to say to spark when it will insert/update/delete?
I already wrote this part of code, but i dont know how to proceed from here, and also dont know how to implement delete.
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = (SparkSession
.builder
.appName("Spark Structured Streaming CDC")
.config("spark.driver.extraClassPath", "E:\\pyspark_projects\\mariadb-java-client-2.7.1.jar")
.getOrCreate())
streamingSchema = StructType([
StructField("Id", IntegerType(),True),
StructField("Number", IntegerType(),True),
StructField("ChangeMode", StringType(),True),
])
streamingDF = (spark.readStream
.format("csv")
.option("sep", "|")
.schema(streamingSchema)
.csv("E:\\pyspark_projects\\stream_cdc\\files\\input\\"))
db_target_properties = {"user":"root", "password":"root", "driver":"org.mariadb.jdbc.Driver"}
db_target_url = "jdbc:mariadb://127.0.0.1:3306/projects"
streamingInsert = streamingDF.where("ChangeMode == 'insert'")
streamingUpsert = streamingDF.where("ChangeMode == 'upsert'")
def insert(df, epoch_id):
streamingInsert.write.jdbc(url=db_target_url, table="cdc", mode="append", properties=db_target_properties)
pass
def upsert(df, epoch_id):
streamingUpsert.write.jdbc(url=db_target_url, table="cdc", mode="update", properties=db_target_properties)
pass
queryInsert = streamingInsert.writeStream.foreachBatch(insert).start()
queryUpdate = streamingUpsert.writeStream.foreachBatch(upsert).start()
spark.streams.awaitAnyTermination()
I'm having the following error:
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
File "C:\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "C:\Spark\python\pyspark\sql\utils.py", line 207, in call
raise e
File "C:\Spark\python\pyspark\sql\utils.py", line 204, in call
self.func(DataFrame(jdf, self.sql_ctx), batch_id)
File "main.py", line 32, in insert
streamingInsert.write.jdbc(url=db_target_url, table="cdc", mode="append", properties=db_target_properties)
File "C:\Spark\python\pyspark\sql\dataframe.py", line 231, in write
return DataFrameWriter(self)
File "C:\Spark\python\pyspark\sql\readwriter.py", line 645, in __init__
self._jwrite = df._jdf.write()
File "C:\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Spark\python\pyspark\sql\utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
at py4j.Protocol.getReturnValue(Protocol.java:476)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
at com.sun.proxy.$Proxy17.call(Unknown Source)
at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$.$anonfun$callForeachBatch$1(ForeachBatchSink.scala:56)
at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$.$anonfun$callForeachBatch$1$adapted(ForeachBatchSink.scala:56)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:36)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:572)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:570)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:570)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:223)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
If anyone knows another method of doing the same, please let me know.

I found a way to do that, usign another module to write in mariaDB, to insert/update i only use one command, and to delete i use a separate command:
Hope it helps someone in future!
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import mariadb
spark = (SparkSession
.builder
.appName("Spark Structured Streaming CDC")
.getOrCreate())
streamingSchema = StructType([
StructField("Id", IntegerType(),True),
StructField("Number", IntegerType(),True),
StructField("ChangeMode", StringType(),True)
])
streamingDF = (spark.readStream
.format("csv")
.option("sep", "|")
.schema(streamingSchema)
.csv("E:\\pyspark_projects\\stream_cdc\\files\\input\\"))
class RowWriter:
def open(self, partition_id, epoch_id):
print("Opened %d, %d" % (partition_id, epoch_id))
return True
def process(self, row):
conn = mariadb.connect(
user="root",
password="root",
host="127.0.0.1",
port=3306,
database="projects"
)
cur = conn.cursor()
if(row[2] == 'insert' or 'update'):
cur.execute("INSERT INTO cdc (Id,Number) VALUES ("+str(row[0])+", "+str(row[1])+") ON DUPLICATE KEY UPDATE Number = "+str(row[1])+"")
if(row[2] == 'delete'):
cur.execute("DELETE FROM cdc WHERE Id = "+str(row[0])+"")
conn.commit()
conn.close()
def close(self, error):
print("Closed with error: %s" % str(error))
query = (streamingDF.writeStream
.foreach(RowWriter())
.option("checkpointLocation", "E:\\pyspark_projects\\stream_cdc\\files\\checkpoint")
.start())
query.awaitTermination()

Related

Unable to send Pyspark data frame to Kafka topic

I am trying to send data from a daily batch to a Kafka topic using pyspark, but I currently receive the following error:
Traceback (most recent call last): File "", line 5, in
File
"/usr/local/rms/lib/hdp26_c5000/spark2/python/pyspark/sql/readwriter.py",
line 548, in save
self._jwrite.save() File "/usr/local/rms/lib/hdp26_c5000/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/usr/local/rms/lib/hdp26_c5000/spark2/python/pyspark/sql/utils.py",
line 71, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"Invalid call to toAttribute on
unresolved object, tree: unresolvedalias('shop_id, None)"
The code I am using is as follows:
from pyspark.sql import SparkSession
from pyspark.sql import functions
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.debug.maxToStringFields", 100000) \
.getOrCreate()
df = spark.sql('''select distinct shop_id, item_id
from sale.data
''')
df.selectExpr("shop_id", "item_id") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "myserver.local:443") \
.option("topic","test_topic_01") \
.save()
Currently used versions are:
-Spark 2.1.1.2.6.2.0-205
-Kafka Broker 0.11
Kafka expects that a key and a value is written into its topic. Although the key is not mandatory. It does that by looking at the names of the dataframe columns which should be "key" and "value".
In your query, you are only selecting the column "shop_id", so no key or value column is existing. The error message: "unresolvedalias('shop_id, None)" tells you that the column "shop_id" is selected as the key (as it is the first column), but nothing is interpreted as the mandatory value.
You can solve your issue by renaming the column to "value", something like:
df = spark.sql('''select distinct shop_id, item_id
from sale.data
''')
df.withColumn("value", col("shop_id").cast(StringType)) \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "myserver.local:443") \
.option("topic","test_topic_01") \
.save()

Errors when creating a Value and Timestamp dataframe with Azure Databricks

I'm not too familiar with Spark but I'm forced to use it to consume some data. I've tried basically every syntax I could find to make a dataframe with a value and a timestamp that I can put into a database to track when I get updates from the datasource. The errors are endless and I'm out of ideas and short on reasons for why I can't make something this simple. Below is the sample of code I'm trying to get working
sc = spark.sparkContext
df = sc.parallelize([[1,pyspark.sql.functions.current_timestamp()]]).toDF(("Value","CreatedAt"))
and this error doesn't really help
py4j.Py4JException: Method __getstate__([]) does not exist
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<command-1699228214903488> in <module>
29
30 sc = spark.sparkContext
---> 31 df = sc.parallelize([[1,pyspark.sql.functions.current_timestamp()]]).toDF(("Value","CreatedAt"))
/databricks/spark/python/pyspark/context.py in parallelize(self, c, numSlices)
557 return self._jvm.PythonParallelizeServer(self._jsc.sc(), numSlices)
558
--> 559 jrdd = self._serialize_to_jvm(c, serializer, reader_func, createRDDServer)
560
561 return RDD(jrdd, self, serializer)
/databricks/spark/python/pyspark/context.py in _serialize_to_jvm(self, data, serializer, reader_func, createRDDServer)
590 try:
591 try:
--> 592 serializer.dump_stream(data, tempFile)
593 finally:
594 tempFile.close()
I've also tried this
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
df = sqlContext.createDataFrame(
[( current_timestamp(), '12a345')],
['CreatedAt','Value'] # the row header/column labels should be entered here
)
With the error
AssertionError: dataType <py4j.java_gateway.JavaMember object at 0x7f43d97c6ba8> should be an instance of <class 'pyspark.sql.types.DataType'>
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<command-2294571935273349> in <module>
33 df = sqlContext.createDataFrame(
34 [( current_timestamp(), '12a345')],
---> 35 ['CreatedAt','Value'] # the row header/column labels should be entered here
36 )
37
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
305 Py4JJavaError: ...
306 """
--> 307 return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
308
309 #since(1.3)
/databricks/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
815 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
816 else:
--> 817 rdd, schema = self._createFromLocal(map(prepare, data), schema)
818 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
Well I code something to work eventually. I couldn't get it to work with TimestampType() though, spark would flip out when inserting the data. I think that may be a runtime error and not a coding issue though.
import adal
import datetime;
from pyspark.sql.types import *
# Set Access Token
access_token = token["accessToken"]
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
schema = StructType([
StructField("CreatedAt", StringType(), True),
StructField("value", StringType(), True)
])
da = datetime.datetime.now().strftime("%m/%d/%Y %H:%M:%S")
df = sqlContext.createDataFrame(
[(da,'12a345')],schema
)
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", url)\
.option("dbtable", "dbo.RunStart")\
.option("accessToken", access_token)\
.option("databaseName", database_name) \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("applicationintent", "ReadWrite") \
.mode("append") \
.save()

How to connect to a secured Kafka cluster from Zeppelin ("Failed to construct kafka consumer")?

I am trying to read some data from a Kafka broker using structured streaming to display it in a Zeppelin note. I am using Spark 2.4.3, Scala 2.11, Python 2.7, Java 9 and Kafka 2.2 with SSL enabled hosted on Heroku, but get the StreamingQueryException: 'Failed to construct kafka consumer'.
I am using the following dependencies (set in the Spark interpreter settings):
org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.3
org.apache.spark:spark-streaming_2.11:2.4.3
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
I have tried older and newer versions, but these should match Spark/Scala versions I am using.
I have successfully written and read from Kafka using simple Python producer and consumer.
The code I am using:
%pyspark
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when
schema = StructType().add("power", IntegerType()).add("colorR", IntegerType()).add("colorG",IntegerType()).add("colorB",IntegerType()).add("colorW",IntegerType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", brokers) \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", "/home/ubuntu/kafka/truststore.jks") \
.option("kafka.ssl.keystore.location", "/home/ubuntu/kafka/keystore.jks") \
.option("kafka.ssl.keystore.password", password) \
.option("kafka.ssl.truststore.password", password) \
.option("kafka.ssl.endpoint.identification.algorithm", "") \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
schema = ArrayType(
StructType([StructField("power", IntegerType()),
StructField("colorR", IntegerType()),
StructField("colorG", IntegerType()),
StructField("colorB", IntegerType()),
StructField("colorW", IntegerType())]))
readDF = df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))
query = readDF.writeStream.format("console").start()
query.awaitTermination()
And the error I get:
Fail to execute line 43: query.awaitTermination()
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2171412221151055324.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 43, in <module>
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 103, in awaitTermination
return self._jsq.awaitTermination()
File "/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
StreamingQueryException: u'Failed to construct kafka consumer\n=== Streaming Query ===\nIdentifier: [id = 2ee20c47-8293-469a-bc0b-ef71a1f118bc, runId = 72422290-090a-4b6d-bd66-088a5a534240]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [cast(key#7 as string) AS key#22, jsontostructs(ArrayType(StructType(StructField(power,IntegerType,true), StructField(colorR,IntegerType,true), StructField(colorG,IntegerType,true), StructField(colorB,IntegerType,true), StructField(colorW,IntegerType,true)),true), cast(value#8 as string), Some(Etc/UTC)) AS jsontostructs(CAST(value AS STRING))#21]\n+- StreamingExecutionRelation KafkaV2[Subscribe[tanana-44614.lightbulb]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
When I use read and write instead of readStream and writeStream I do not get any errors, but nothing appears on the console when I send some data to Kafka.
What else should I try?
It looks like the Kafka Consumer cannot access ~/kafka/truststore.jks and hence the exception. Replace ~ with the fully-specified path (without the tilde) and the issue should go away.

Using broadcasted dataframe in pyspark UDF

Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application.
My Code calls the broadcasted Dataframe inside a pyspark dataframe like below.
fact_ent_df_data =
sparkSession.sparkContext.broadcast(fact_ent_df.collect())
def generate_lookup_code(col1,col2,col3):
fact_ent_df_count=fact_ent_df_data.
select(fact_ent_df_br.TheDate.between(col1,col2),
fact_ent_df_br.Ent.isin('col3')).count()
return fact_ent_df_count
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
sparkSession.sql('select sample4,generate_lookup_code(sample1,sample2,sample 3) as count_hol from table_t')
I am getting local variable used before assignment error when i use the broadcasted df_bc. Any help is appreciated
And the Error i am getting is
Traceback (most recent call last):
File "C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py", line 46, in <module>
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 323, in register
self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 148, in _judf
self._judf_placeholder = self._create_judf()
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 157, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 33, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py", line 2391, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py", line 575, in dumps
return cloudpickle.dumps(obj, 2)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 918, in dumps
cp.dump(obj)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 249, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o24.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Think about Spark Broadcast variable as a Python simple data type like list, So the problem is how to pass a variable to the UDF functions. Here is an example:
Suppose we have ages list d and a data frame with columns name and age. So we want to check if the age of each person is in ages list.
from pyspark.sql.functions import udf, col
l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows
rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])
def check_age (age, age_list):
if age in l:
return "true"
return "false"
def udf_check_age(age_list):
return udf(lambda x : check_age(x, age_list))
df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
Output:
+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10| false|
| bob| 21| true|
+-----+---+--------------+
Just trying to contribute with a simpler example based on Soheil's answer.
from pyspark.sql.functions import udf, col
def check_age (_age):
return _age > 18
dict_source = {"alice": 10, "bob": 21}
broadcast_dict = sc.broadcast(dict_source) # define broadcast variable
rdd = sc.parallelize(list(dict_source.keys()))
result = rdd.map(
lambda _name: check_age(broadcast_dict.value.get(_name)) # Here you specify the broadcasted var `.value`
)
print(result.collect())

Querying a spark streaming application from spark-shell (pyspark)

I am following this example in the pyspark console and everything works perfectly.
After that I wrote it as an PySpark application as follows:
# -*- coding: utf-8 -*-
import sys
import click
import logging
from pyspark.sql import SparkSession
from pyspark.sql.types import *
#click.command()
#click.option('--master')
def most_idiotic_bi_query(master):
spark = SparkSession \
.builder \
.master(master)\
.appName("stream-test")\
.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
some_schema = .... # Schema removed
some_stream = spark\
.readStream\
.option("sep", ",")\
.schema(some_schema)\
.option("maxFilesPerTrigger", 1)\
.csv("/data/some_stream", header=True)
streaming_counts = (
linkage_stream.groupBy(some_stream.field_1).count()
)
query = streaming_counts.writeStream\
.format("memory")\
.queryName("counts")\
.outputMode("complete")\
.start()
query.awaitTermination()
if __name__ == "__main__":
logging.getLogger("py4j").setLevel(logging.ERROR)
most_idiotic_bi_query()
The app is executed as:
spark-submit test_stream.py --master spark://master:7077
Now, If I open a new spark driver in another terminal:
pyspark --master spark://master:7077
And try to run:
spark.sql("select * from counts")
It fails with:
During handling of the above exception, another exception occurred:
AnalysisExceptionTraceback (most recent call last)
<ipython-input-3-732b22f02ef6> in <module>()
----> 1 spark.sql("select * from id_counts").show()
/usr/spark-2.0.2/python/pyspark/sql/session.py in sql(self, sqlQuery)
541 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
542 """
--> 543 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
544
545 #since(2.0)
/usr/local/lib/python3.4/dist-packages/py4j-0.10.4-py3.4.egg/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/spark-2.0.2/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'Table or view not found: counts; line 1 pos 14'
I don't understand what is happening.
This is an expected behavior. If you check the documentation for memory sink:
The output is stored in memory as an in-memory table. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Hence, use it with caution.
As you can see memory sink doesn't create a persistent table or global temporary view but a local structure limited to a driver. Hence it cannot be queried from another Spark application.
So the memory output has to be queried from the driver, in which it is written. For example you could mimic console mode as shown below.
A dummy writer:
import pandas as pd
import numpy as np
import tempfile
import shutil
def producer(path):
temp_path = tempfile.mkdtemp()
def producer(i):
df = pd.DataFrame({
"group": np.random.randint(10, size=1000)
})
df["val"] = (
np.random.randn(1000) +
np.random.random(1000) * df["group"] +
np.random.random(1000) * i % 7
)
f = tempfile.mktemp(dir=temp_path)
df.to_csv(f, index=False)
shutil.move(f, path)
return producer
Spark application:
from pyspark.sql.types import IntegerType, DoubleType, StructType, StructField
schema = StructType([
StructField("group", IntegerType()),
StructField("val", DoubleType())
])
path = tempfile.mkdtemp()
query_name = "foo"
stream = (spark.readStream
.schema(schema)
.format("csv")
.option("header", "true")
.load(path))
query = (stream
.groupBy("group")
.avg("val")
.writeStream
.format("memory")
.queryName(query_name)
.outputMode("complete")
.start())
And some events:
from rx import Observable
timer = Observable.timer(5000, 5000)
timer.subscribe(producer(path))
timer.skip(1).subscribe(lambda *_: spark.table(query_name).show())
query.awaitTermination()

Resources