How to read data from a csv file as a stream - apache-spark

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I am able to sort the entries as a batch process.
scala> dataDS.sort(col("count")).show(100);
I now want to try if I can do the same using streaming. To do this, I suppose I will have to read the file as a stream.
scala> val staticSchema = dataDS.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.
| readStream.
| schema(staticSchema).
| option("header","true").
| csv("data/flight-data/csv/2015-summary.csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> dataStream.isStreaming;
res245: Boolean = true
But I am not able to progress further w.r.t. how to read the data as a stream.
I have executed the sort transformation` process
scala> dataStream.sort(col("count"));
res246: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I suppose now I should use Dataset's writeStream method. I ran the following two commands but both returned errors.
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("complete").
| start();
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
and this one
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("append").
| start();
org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
From the errors, it seems I should be aggregating (group) data but I thought I don't need to do it as I can run any batch operation as a stream.
How can I understand how to sort data which arrives as a stream?

Unfortunately what the error messages tell you is accurate.
Sorting is supported only in complete mode (i.e. when each window returns complete dataset).
Complete mode requires aggregation (otherwise it would require unbounded memory - Why does Complete output mode require aggregation?)
The point you make:
but I thought I don't need to do it as I can run any batch operation as a stream.
is not without merit, but it misses a fundamental point, that Structured Streaming is not tightly bound to micro-batching.
One could easily come up with some unscalable hack
import org.apache.spark.sql.functions._
dataStream
.withColumn("time", window(current_timestamp, "5 minute")) // Some time window
.withWatermark("time", "0 seconds") // Immediate watermark
.groupBy("time")
.agg(sort_array(collect_list(struct($"count", $"DEST_COUNTRY_NAME", $"ORIGIN_COUNTRY_NAME"))).as("data"))
.withColumn("data", explode($"data"))
.select($"data.*")
.select(df.columns.map(col): _*)
.writeStream
.outputMode("append")
...
.start()

Related

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.
I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.
These parquet files need to be read latter by hive queries.
So
1) Is this strategy works in production environment ? or does it lead to any small file problem later ?
2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?
3) How these kind of things generally handled in Production?
Thank you.
I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.
My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.
Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.
val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
session.streams.addListener(AppListener(config,session))
class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
}
object AppListener {
def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
val configs = config.kafka(config.key.get)
if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {
println(
s"""
|Current Timestamp : ${currentTs}
|Merge Files : ${Processed.ts.minusHours(1)}
|
|""".stripMargin)
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val ts = Processed.ts.minusHours(1)
val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
val path = new Path(hdfsPath)
if(fs.exists(path)) {
val hdfsFiles = fs.listLocatedStatus(path)
.filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
.map(_.getPath).toList
println(
s"""
|Total files in HDFS location : ${hdfsFiles.length}
| ${hdfsFiles.length > 1}
|""".stripMargin)
if(hdfsFiles.length > 1) {
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total Available files : ${hdfsFiles.length}
|Status : Running
|
|""".stripMargin)
val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
df.repartition(1)
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(s"/tmp${hdfsPath}")
df.cache().unpersist()
spark
.read
.format(configs.writeFormat)
.load(s"/tmp${hdfsPath}")
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(hdfsPath)
Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total files : ${hdfsFiles.length}
|Status : Completed
|
|""".stripMargin)
}
}
}
}
def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
}
object Processed {
var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
}
Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB
val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
val dataSize = bytes.toLong
val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt
df.repartition(if(numPartitions == 0) 1 else numPartitions)
.[...]
Edit-1
Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.
scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709
scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB
scala> import sys.process._
import sys.process._
scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r----- 3 svcmxns hdfs 0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r----- 3 svcmxns hdfs 727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0
We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.
As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.
This is a common burning question of spark streaming with no any fixed answer.
I took an unconventional approach which is based on idea of append.
As you are using spark 2.4.1, this solution will be helpful.
So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch.
However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition.
e.g.
/prod/mobility/cdr_data/date=01–01–2010/version=12345/file1.parquet
/prod/mobility/cdr_data/date=01–01–2010/version=23456/file1.parquet
What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.
As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.
I have described how to achieve this optimally in the link. You might want to have a look.
https://medium.com/#kumar.rahul.nitk/solving-small-file-problem-in-spark-structured-streaming-a-versioning-approach-73a0153a0a
You can set a trigger.
df.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
The larger the trigger size, the larger the file size.
Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.

what is the use of _spark_metadata directory

I am trying to get my head around how streaming works in spark.
I have a file in a /data/flight-data/csv/ directory. It has the following data:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
I thought to test what will happen if I read the file as a stream instead of as a batch. I first created a Dataframe using read
scala> val dataDF = spark.read.option("inferSchema","true").option("header","true").csv("data/flight-data/csv/2015-summary.csv");
[Stage 0:> dataDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
then took the schema fromm it and created a new Dataframe
scala> val staticSchema = dataDF.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.readStream.schema(staticSchema).option("header","true").csv("data/flight-data/csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
Then I started the stream. The path for checkpoint and output (I suppose) is `/home/manu/test" directory which is initially empty.
scala> dataStream.writeStream.option("checkpointLocation","home/manu/test").start("/home/manu/test");
res5: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5c7df5f1
The return value of the start is StreamingQuery which I read is A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
I notice that now the directory has a directory _spark_metadatabut there is nothing else.
Question1 - What is _spark_metadata directory? I notice it is empty. What is it used for?
Question 2 - I don't see anything else happening. Is it because I am not running any query on the Dataframe dataStream (or shall I say that the query isn't doing anything useful)?

Splitting Kafka Message Line by line in Spark Structured Streaming

I want to read a message from Kafka topic in my Spark Structured Streaming job into a data frame. but I am getting entire message in one offset so in data frame only this message is coming into one row instead of multiple rows. (in my case it is 3 rows)
When I print this message I am getting below output:
The message "Text1", "Text2" and "Text3" I want in 3 rows in data frame so that I can process further.
Please help me.
you can use a user defined function (UDF) to convert the message string into a sequence of strings, and then apply the explode function on that column, to create a new row for each element in the sequence:
As illustrated below (in scala, same principle applies to pyspark):
case class KafkaMessage(offset: Long, message: String)
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.explode
val df = sc.parallelize(List(KafkaMessage(1000, "Text1\nText2\nText3"))).toDF()
val splitString = udf { s: String => s.split('\n') }
df.withColumn("splitMsg", explode(splitString($"message")))
.select("offset", "splitMsg")
.show()
this will yield the following output:
+------+--------+
|offset|splitMsg|
+------+--------+
| 1000| Text1|
| 1000| Text2|
| 1000| Text3|
+------+--------+

Sequential processing within a Spark batch

I have a question about sequential processing within a Spark batch. Here is a stylized version of the question I am trying to get answer on to keep it simple.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Simple Dataframe Processing")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
val df = spark.read.json("devices.json")
// Displays the content of the DataFrame to stdout
df.show()
// +-------------------------+
// | device-guid| Operation|
// +----+-------+-------------
// |1234 | Add 3 |
// |1234 | Sub 3 |
// |1234 | Add 2 |
// |1234 | Sub 2 |
// |1234 | Add 1 |
// |1234 | Sub 1 |
// +----+-------+------------+
//I have a Database with one table with following columns
// device-guid (primary key) result
//I would like to take df and for each row in the df do a update operation to a single DB row, Adding or removing number as described in Operation column
//So the result I am expecting at the end of this in the DB is a single row with
// device-guid result
// 1234 0
df.foreach { row =>
UpdateDB(row) //Update the DB with the row's Operation.
//Actual method not shown
}
Let us say I run this in a spark cluster with YARN with 5 executors with 2 core each across 5 worker nodes.
What in Spark guarantees that the UpdateDB operation is scheduled and executed in sequence of the rows in the dataframe and not EVER scheduled and executed in parallel?
i.e I always want to get answer of 0 in result column in my DB.
The question in the larger sense is 'what guarantees sequential processing of operations on a dataframe even with multiple executors and cores'?
Can you point me Spark document that indicates that these tasks will be processed in sequence?
Is there any Spark property that need to be set for this to work?
Regards,
Venkat
The question in the larger sense is 'what guarantees sequential processing of operations on a dataframe even with multiple executors and cores'?
Nothing, with exception to having no parallelism at all, either through having only a single partition.
A single core might have a similar effect, but keep not guarantee specific order of chunks.
If you really need sequential processing then you're using a wrong tool for a job.

Why does Complete output mode require aggregation?

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets;;
Why does Complete output mode require a streaming aggregation? What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
scala> spark.version
res0: String = 2.2.0
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to Timestamp
dropDuplicates("id").
withColumn("time", $"time" cast "long") // <-- convert time column back from Timestamp to Int
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
scala> val q = ids.
| writeStream.
| format("memory").
| queryName("dups").
| outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Complete output mode only
| trigger(Trigger.ProcessingTime(30.seconds)).
| option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save state between restarts
| start
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
Project [cast(time#10 as bigint) AS time#15L, id#6]
+- Deduplicate [id#6], true
+- Project [cast(time#5 as timestamp) AS time#10, id#6]
+- Project [_1#2 AS time#5, _2#3 AS id#6]
+- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:247)
... 57 elided
From the Structured Streaming Programming Guide - other queries (excluding aggregations, mapGroupsWithState and flatMapGroupsWithState):
Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
To answer the question:
What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
Probably OOM.
The puzzling part is why dropDuplicates("id") is not marked as aggregation.
I think the problem is the output mode. instead of using OutputMode.Complete, use OutputMode.Append as shown below.
scala> val q = ids
.writeStream
.format("memory")
.queryName("dups")
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime(30.seconds))
.option("checkpointLocation", "checkpoint-dir")
.start

Resources