Is it possible to output whole value content of pyspark structured streams? - python-3.x

I am fetching pyspark stream data:
spark = SparkSession \
.builder \
.getOrCreate()
raw_stream = spark \
.readStream \
.option("endpoint", conf.get('config', 'endpoint')) \
.option("app.name", conf.get('config', 'app_name')) \
.option("app.secret", conf.get('config', 'app_key')) \
.option("dc", conf.get('config', 'dc')) \
.option("source.topic", conf.get('config', 'topic')) \
.option("group.name", conf.get('config', 'group')) \
.option("source.value.binaryType", 'false') \
.load()
raw_stream_str = raw_stream \
.selectExpr("CAST(value AS STRING)")
value_batch = raw_stream_str \
.writeStream \
.queryName("value_query") \
.format("memory") \
.start()
spark.sql("select * from value_query").show()
which output is as below:
+--------------------+
| value|
+--------------------+
|{"message":"DGc6K...|
+--------------------+
How can I expand its content or print the whole line without this format?

Related

Reading from Azure Event hub with Kafka driver doesn't seem to get any data

I'm running the following code in an Azure Databricks python notebook:
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhubns.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhubns.servicebus.windows.net/;SharedAccessKeyName=MyKeyName;SharedAccessKey=myaccesskey;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
df_write = df.writeStream \
.outputMode("append") \
.format("console") \
.start() \
.awaitTermination()
This shows no output in the notebook. How could I debug what the problem is?
If you use .format("console") then output won't be in the notebook, it will be in the driver & executor logs - it's a difference between Spark and Databricks.
If you want to see the data, just use the display function:
display(df)
This code is now writing data with quite low latency. Newest datapoint is around 10 seconds old when I do a select in a sql warehouse. The problem still is that foreachBatch is not run, but otherwise it's working.
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhub.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=mykeyname;SharedAccessKey=mykey;EntityPath=myentitypath;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
n = 100
count = 0
def run_command(batchDF, epoch_id):
global count
count += 1
if count % n == 0:
spark.sql("OPTIMIZE firstcatalog.bronze.factorydatas3 ZORDER BY (readtimestamp)")
...Omitted code where I transform the data in the value column to strongly typed data...
myTypedDF.writeStream \
.foreachBatch(run_command) \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.partitionBy("somecolumn") \
.toTable("myunitycatalog.bronze.mytable")

Why pyspark cannot show any data?

when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.

Spark Structred Streaming Pyspark Sink Csv Does'nt Append

Write json to Kafka Topic and read json from kafka topic. Actually I subscribe topic and write console line by line. But I have to sink/write file csv. But I can't. I write csv one time but doesn't append.
You can see my code bellow.
Thank you!
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as func
spark = SparkSession.builder\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0') \
.appName('kafka_stream_test')\
.getOrCreate()
ordersSchema = StructType() \
.add("a", StringType()) \
.add("b", StringType()) \
.add("c", StringType()) \
.add("d", StringType())\
.add("e", StringType())\
.add("f", StringType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "product-views") \
.load()\
df_query = df \
.selectExpr("cast(value as string)") \
.select(func.from_json(func.col("value").cast("string"),ordersSchema).alias("parsed"))\
.select("parsed.a","parsed.b","parsed.c","parsed.d","parsed.e","parsed.f")\
df = df_query \
.writeStream \
.format("csv")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.start()
df.awaitTermination()
Yes, because you need this extra option .option("format", "append") :
aa = df_query \
.writeStream \
.format("csv")\
.option("format", "append")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.outputMode("append") \
.start()

PySpark Structured Streaming data writing into Cassandra not populating data

I want to write spark structured streaming data into cassandra. My spark version is 2.4.0.
My input source from Kafka is with JSON, so when writing to the console, it is OK, but when I query in the cqlsh Cassandra there is no record appended to the table. Can you tell me what is wrong?
schema = StructType() \
.add("humidity", IntegerType(), True) \
.add("time", TimestampType(), True) \
.add("temperature", IntegerType(), True) \
.add("ph", IntegerType(), True) \
.add("sensor", StringType(), True) \
.add("id", StringType(), True)
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options("spark.cassandra.connection.host", "cassnode1, cassnode2") \
.options(table="sensor", keyspace="sensordb") \
.save()
# Load json format to dataframe
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafkanode") \
.option("subscribe", "iot-data-sensor") \
.load() \
.select([
get_json_object(col("value").cast("string"), "$.{}".format(c)).alias(c)
for c in ["humidity", "time", "temperature", "ph", "sensor", "id"]])
df.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
I had the same issue in pyspark. try below steps
First, validate if it is connecting to cassandra. You can either point to a table which is not available and see if it is failing because of "table not found"
Try writeStream as below (include trigger and output mode before calling the cassandra update)
df.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \

Spark structed streaming window issue

I have a problem regrading the window in Spark Structed Streaming. I want to group the data i'm receiving continuously from kafka source in sliding window and count the number of data. The issue is that writestream streams the window dataframe each time there is data coming and update the count of the current window.
I'm using the following code to create the window:
#Define schema of the topic to be consumed
jsonSchema = StructType([ StructField("State", StringType(), True) \
, StructField("Value", StringType(), True) \
, StructField("SourceTimestamp", StringType(), True) \
, StructField("Tag", StringType(), True)
])
spark = SparkSession \
.builder \
.appName("StructuredStreaming") \
.config("spark.default.parallelism", "100") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
.load() \
.select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
.select("data.*")
df = df.withColumn("time", current_timestamp())
Window = df \
.withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
.withColumn("time", current_timestamp())
#Write back to kafka
query = Window.select(to_json(struct("count","window","time")).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.outputMode("update") \
.option("topic", "structed") \
.option("checkpointLocation", "/home/superman/notebook/checkpoint") \
.start()
The windows are not sorted and are updated each time there is a change in count. How can we wait for the end of the window and stream the final count one time. Instead of this output:
{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}
I would like this:
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
The expected ouput wait for the window to be closed based on comparaison between the end timestamp and the current time.

Resources