Convert csv files to parquet on s3 using Spark structured streaming - apache-spark

I'm trying to create a Spark application that will read my csv files from s3, convert it to parquet files and write the results to s3.
I have 8 new csv files every minute compressed with gzip (~60MB each gzip file), each row have ~200 columns and ~99% are at the same date (my partition column).
The cluster have 3 workers with 10 cores and memory of 20 GB each.
Here is my code:
val spark = SparkSession
.builder()
.appName("Csv2Parquet")
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.s3a.access.key", "MY ACESS KEY")
.config("fs.s3a.secret.key", "MY SECRET")
.config("spark.executor.memory", "15G")
.config("spark.driver.memory", "5G")
.getOrCreate()
import spark.implicits._
val schema= StructType(Array(
StructField("myDate", DateType, nullable=false),
StructField("myTimestamp", TimestampType, nullable=true),
...
...
...
StructField("myColumn200", StringType, nullable=true)
))
val df = spark.readStream
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "false")
.option("mode", "DROPMALFORMED")
.option("delimiter","\t")
.load("s3a://my-bucket/raw-data/*.gz")
.withColumn("myPartitionDate", $"myDate")
val query = df.repartition($"myPartitionDate").writeStream
.option("checkpointLocation", "/shared/checkpoints/csv2parquet")
.trigger(Trigger.ProcessingTime(60000))
.format("parquet")
.option("path", "s3a://my-bucket/parquet-data")
.partitionBy(myPartitionDate)
.start("s3a://my-bucket/parquet-data")
query.awaitTermination()
The problem is that only one task is responsible for writing the "main" partition (that includes 99% of the events) to s3 and it takes ~4 minutes to handle this task. how can i improve it?

Related

How to create stream using pyspark and kafka an read it row by row

I'm trying to use pyspark to read a kafka stream and then in further stages I will process each row and store it in influxdb. The problem is pyspark is not reading the stream, no errors are shown.
It's not printing anything but in my code the foreach(show_data) is supposed to print 'test' for each row.
An example row of the stream sent by kafka is attached in the second picture
Code:
spark = (
SparkSession.builder.appName("Kafka Pyspark Streaming")
.master("local[*]")
.getOrCreate()
)
spark.sparkContext.setLogLevel('ERROR')
# Read stream from json and fit schema
inputStream = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "SWAT")\
.option("startingOffsets", "latest")\
.load()
inputStream = inputStream.select(col("value").cast("string").alias("data"))
# inputStream.printSchema()
#inputStream = inputStream.selectExpr("CAST(value AS STRING)")
print(inputStream)
# Read stream and process
def show_data(row):
print(f"test")
print(f"> Reading the stream and storing ...")
query = (inputStream
.writeStream
.outputMode("append")
.foreach(show_data)
.option("checkpointLocation", "checkpoints")
.start())
query.awaitTermination()

Fetch dbfs files as a stream dataframe in databricks

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.
I thought about a solution when I would get a streaming dataframe from dbutils.fs.ls() output and then call a function that creates a table inside the forEachBatch().
I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?
Kindly check with the below code block.
package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val schema = StructType(
List(
StructField("Zipcode", IntegerType, true),
)
)
val df = spark.readStream
.schema(schema)
.json("Your directory")
df.printSchema()
val groupDF = df.select("Zipcode")
.groupBy("Zipcode").count()
groupDF.printSchema()
groupDF.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
}
}

Processing json tabular data incoming from Kafka topics in Python

I have events streaming into multiple Kafka topics in the form of key:value jsons (without nested structure) for example:
event_1: {"name": "Alex", "age": 27, "hobby": "pc games"},
event_2: {"name": "Bob", "age": 33, "hobby: "swimming"},
event_3: {"name": "Charlie", "age": 12, "hobby: "collecting stamps"}
I am working in Python 3.7, and wish to consume a batch of events from those topics, let's say, every 5 minutes, transform it into a dataframe, do some processing and enrichment with this data and save the result to a csv file.
I'm new to Spark and searched for documentation to help me with this task but did not find any.
Is there any updated source of information recommended?
Also, if there is any other recommended Big Data framework that would suit this task, I'd love to hear about it.
Refer: triggers section of Structured Streaming Programming Guide. There are 3 different types of trigger, with default as micro-batch, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
In you case you need Fixed interval micro-batches where you can specify the duration on which the query has to be triggered. Following is the code snippet to do that.
df.writeStream \
.format("csv") \
.option("header", True) \
.option("path", "path/to/destination/dir") \
.trigger(processingTime='5 minutes') \ # fixed interval trigger
.start()
Brief code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType
# Define schema of kafak message
schema = StructType([
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("hobby", StringType, true),
])
# Initialize spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Read Kafka topic and load data using schema
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers","x.x.x.x:2181")\
.option("startingOffsets", "latest")\
.option("subscribe","testdata")\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("data"))\
.select(f.col("data.*"))\
# Do some transformation
df1 = df...
# Write the resultant dataframe as CSV file
df1.writeStream \
.format("csv") \
.option("header", True) \
.option("path", "path/to/destination/dir") \
.trigger(processingTime='5 minutes') \
.start()
You can also repartition the final dataframe before writing as csv file if needed

Spark Streaming Job is running very slow

I am running a spark streaming job in my local and it is taking approximately 4 to 5 min for one batch. Can someone suggest what could be the issue with the bellow code?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("poc-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "poc") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3=df2.select(df2.source,df2.temperature,from_unixtime(unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes","1 minutes"), df3.source).count()
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()
with mini-batch streaming you usually want to reduce the # of output partitions ... since you are doing some aggregation (wide transformation) every time you persist it will default to 200 partitions to disk because of
spark.conf.get("spark.sql.shuffle.partitions")
try lowering this config to a smaller output partition and place it at the beginning of your code so when the aggregation is performed it outputs 5 partitions to disk
spark.conf.set("spark.sql.shuffle.partitions", 5)
you can also get a feel by looking at the # of files in the output write stream directory as well as identifying the # of partitions in your aggregated df
df3.rdd.getNumPartitions()
btw since you are using a local mode for testing try setting to local[8] instead of local[4] so it increases the parallelism on your cpu cores (i assume you have 4)

Get max,min of offset from kafka dataframe

Below is how I'm reading data from kafka.
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", """{"topic1":{"1":-1}}""")
.load()
val df = inputDf.selectExpr("CAST(value AS STRING)","CAST(topic AS STRING)","CAST (partition AS INT)","CAST (offset AS INT)","CAST (timestamp AS STRING)")
How can I get the max & min offsets and timestamp from above dataframe? I want to save it to some external source for future reference.I cannot use 'agg' function as i'm writing same dataframe to writestream(as shown below)
val kafkaOutput = df.writeStream
.outputMode("append")
.option("path", "/warehouse/download/data1")
.format("console")
.option("checkpointLocation", checkpoint_loc)
.start()
.awaitTermination()
If you can upgrade your Spark version to 2.4.0 you will be able to solve this issue.
In Spark 2.4.0, you have spark foreachbatch api through which you can write the same DataFrame to multiple sinks.
Spark.writestream.foreachbatch((batchDF, batchId) => some_fun(batchDF)).start()
some_fun(batchDF): { persist the DF & perform the aggregation}

Resources