Processing json tabular data incoming from Kafka topics in Python - apache-spark

I have events streaming into multiple Kafka topics in the form of key:value jsons (without nested structure) for example:
event_1: {"name": "Alex", "age": 27, "hobby": "pc games"},
event_2: {"name": "Bob", "age": 33, "hobby: "swimming"},
event_3: {"name": "Charlie", "age": 12, "hobby: "collecting stamps"}
I am working in Python 3.7, and wish to consume a batch of events from those topics, let's say, every 5 minutes, transform it into a dataframe, do some processing and enrichment with this data and save the result to a csv file.
I'm new to Spark and searched for documentation to help me with this task but did not find any.
Is there any updated source of information recommended?
Also, if there is any other recommended Big Data framework that would suit this task, I'd love to hear about it.

Refer: triggers section of Structured Streaming Programming Guide. There are 3 different types of trigger, with default as micro-batch, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
In you case you need Fixed interval micro-batches where you can specify the duration on which the query has to be triggered. Following is the code snippet to do that.
df.writeStream \
.format("csv") \
.option("header", True) \
.option("path", "path/to/destination/dir") \
.trigger(processingTime='5 minutes') \ # fixed interval trigger
.start()
Brief code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType
# Define schema of kafak message
schema = StructType([
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("hobby", StringType, true),
])
# Initialize spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Read Kafka topic and load data using schema
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers","x.x.x.x:2181")\
.option("startingOffsets", "latest")\
.option("subscribe","testdata")\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("data"))\
.select(f.col("data.*"))\
# Do some transformation
df1 = df...
# Write the resultant dataframe as CSV file
df1.writeStream \
.format("csv") \
.option("header", True) \
.option("path", "path/to/destination/dir") \
.trigger(processingTime='5 minutes') \
.start()
You can also repartition the final dataframe before writing as csv file if needed

Related

How to use multiple input and multiple output streams in a single pyspark session?

I am using spark v2.4.0 and I am reading two separate streams from kafka and doing some different transformation on each one of them, now I want to persist both the streaming data-frames, but only One of them is getting persisted and the other one does not seem to work simultaneously, would be highly grateful for any help provided.
Below is my code,
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import from_json, col, to_date
# Created a SparkSession here, as it is an entry point to underlying Spark functionality
spark = SparkSession.builder \
.master('spark://yash-tech:7077') \
.appName('Streaming') \
.getOrCreate()
# Defined a schema for our data being streamed from kafka
schema = StructType([
StructField("messageId", StringType(), True),
StructField("type", StringType(), True),
StructField("userId", StringType(), True),
StructField('data', StringType(), True),
StructField("timestamp", StringType(), True),
])
profileDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", 'test') \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("value"))
# Using readStream on SparkSession to load a streaming Dataset from Kafka
clickStreamDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", 'test_new') \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("value"))
# Selecting every column from the DF
clickStreamDFToPersist = clickStreamDF.select("value.*")
profileDFToPersist = profileDF.select("value.*")
# Added a new column containing date(yyyy-MM-dd) parsed from timestamp column for day wise partitioning
clickStreamDFToPersist = clickStreamDFToPersist.withColumn(
"date", to_date(col("timestamp"), "yyyy-MM-dd"))
# Writing data on local disk as json files, partitioned by userId.
clickStream_writing_sink = clickStreamDFToPersist.repartition(1) \
.writeStream \
.partitionBy('userId', 'date') \
.format("json") \
.option("path", "/home/spark/data/") \
.outputMode("append") \
.option("checkpointLocation", "/home/spark/event_checkpoint/") \
.trigger(processingTime='20 seconds') \
.start()
profile_writing_sink = profileDFToPersist.repartition(1) \
.writeStream \
.partitionBy('userId') \
.format("json") \
.option("path", "/home/spark/data/") \
.outputMode("append") \
.option("checkpointLocation", "/home/spark/profile_checkpoint/") \
.trigger(processingTime='30 seconds') \
.start()
clickStream_writing_sink.awaitTermination()
profile_writing_sink.awaitTermination()
NOTE:
I want both the writeStreams to write on the same path.
If I give different data paths in both the writeStreams then the code seems to work but the data gets persisted on different locations, is there a way that I can persist both the streams on same location, or if I can do both these transformation and persist data using single stream only as the location is same for both?
In one stream I am partitioning only using userId and in the other one I am doing userId + date partitioning.
Hi as we have the same path provided for the sink directory location so output are over written.
You cannot change the "part" prefix while using any of the standard output formats.
it could be possible if you can overwrite recordWriter().

How to check if n consecutive events from kafka stream is greater or less than threshold limit

I an new to pyspark. I have written a pyspark program to read kafka stream using window operation. I am publishing the below message to kafka every second with different sources and temperatures along with the timestamp.
{"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"}
{"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"}
{"temperature":28,"time":"2019-04-17 12:53:04","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:05","source":"1010101"}
{"temperature":45,"time":"2019-04-17 12:53:06","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:07","source":"1010102"}
{"temperature":29,"time":"2019-04-17 12:53:08","source":"1010102"}
{"temperature":28,"time":"2019-04-17 12:53:09","source":"1010102"}
{"temperature":34,"time":"2019-04-17 12:53:10","source":"1010102"}
{"temperature":45,"time":"2019-04-17 12:53:11","source":"1010102"}
How do I check if n consecutive temperature records for a source crosses threshold limit (<30 and >40) and then publish the alerts to Kafka. Also please let me know if the below program is efficient to read the kafka stream or require any changes?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("test-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3 = df2.select(df2.source, df2.temperature, from_unixtime(
unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes", "1 minutes"),
df3.source).agg(avg("temperature"))
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()

Spark Streaming Job is running very slow

I am running a spark streaming job in my local and it is taking approximately 4 to 5 min for one batch. Can someone suggest what could be the issue with the bellow code?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("poc-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "poc") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3=df2.select(df2.source,df2.temperature,from_unixtime(unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes","1 minutes"), df3.source).count()
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()
with mini-batch streaming you usually want to reduce the # of output partitions ... since you are doing some aggregation (wide transformation) every time you persist it will default to 200 partitions to disk because of
spark.conf.get("spark.sql.shuffle.partitions")
try lowering this config to a smaller output partition and place it at the beginning of your code so when the aggregation is performed it outputs 5 partitions to disk
spark.conf.set("spark.sql.shuffle.partitions", 5)
you can also get a feel by looking at the # of files in the output write stream directory as well as identifying the # of partitions in your aggregated df
df3.rdd.getNumPartitions()
btw since you are using a local mode for testing try setting to local[8] instead of local[4] so it increases the parallelism on your cpu cores (i assume you have 4)

How to calculate lag difference in Spark Structured Streaming?

I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.
To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))
query = ds \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
I get this error:
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not
supported on streaming DataFrames/Datasets;;\nWindow
[lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68,
timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING) AS prev_timestamp#129L]
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets
Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.
You should use the window in this way:
words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))
w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('value'))

Convert csv files to parquet on s3 using Spark structured streaming

I'm trying to create a Spark application that will read my csv files from s3, convert it to parquet files and write the results to s3.
I have 8 new csv files every minute compressed with gzip (~60MB each gzip file), each row have ~200 columns and ~99% are at the same date (my partition column).
The cluster have 3 workers with 10 cores and memory of 20 GB each.
Here is my code:
val spark = SparkSession
.builder()
.appName("Csv2Parquet")
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.s3a.access.key", "MY ACESS KEY")
.config("fs.s3a.secret.key", "MY SECRET")
.config("spark.executor.memory", "15G")
.config("spark.driver.memory", "5G")
.getOrCreate()
import spark.implicits._
val schema= StructType(Array(
StructField("myDate", DateType, nullable=false),
StructField("myTimestamp", TimestampType, nullable=true),
...
...
...
StructField("myColumn200", StringType, nullable=true)
))
val df = spark.readStream
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "false")
.option("mode", "DROPMALFORMED")
.option("delimiter","\t")
.load("s3a://my-bucket/raw-data/*.gz")
.withColumn("myPartitionDate", $"myDate")
val query = df.repartition($"myPartitionDate").writeStream
.option("checkpointLocation", "/shared/checkpoints/csv2parquet")
.trigger(Trigger.ProcessingTime(60000))
.format("parquet")
.option("path", "s3a://my-bucket/parquet-data")
.partitionBy(myPartitionDate)
.start("s3a://my-bucket/parquet-data")
query.awaitTermination()
The problem is that only one task is responsible for writing the "main" partition (that includes 99% of the events) to s3 and it takes ~4 minutes to handle this task. how can i improve it?

Resources