In my spark application I read ONCE a directory with many CSVs.
But, in the DAG I see multiple CSV reads.
Why the spark reads multiple times the CSVs or it's not a real representation; and actually Spark reads them once.
Spark UI Screenshot:
Spark will read them multiple times if the DataFrame is not cached.
val df1 = spark.read.csv("path")
val df2_result = df1.filter(.......).save(......)
val df3_result = df1.map(....).groupBy(...).save(......)
Here df2_result and df3_result both will cause df1 to be rebuilt from csv files.
To avoid this you can cache like this. DF1 will built once from csv and the 2nd time it will not be build from files.
val df1 = spark.read.csv("path")
df1.cache()
val df2_result = df1.filter(.......).save(......)
val df3_result = df1.map(....).groupBy(...).save(......)
Related
I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())
I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.
I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.
These parquet files need to be read latter by hive queries.
So
1) Is this strategy works in production environment ? or does it lead to any small file problem later ?
2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?
3) How these kind of things generally handled in Production?
Thank you.
I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.
My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.
Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.
val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
session.streams.addListener(AppListener(config,session))
class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
}
object AppListener {
def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
val configs = config.kafka(config.key.get)
if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {
println(
s"""
|Current Timestamp : ${currentTs}
|Merge Files : ${Processed.ts.minusHours(1)}
|
|""".stripMargin)
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val ts = Processed.ts.minusHours(1)
val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
val path = new Path(hdfsPath)
if(fs.exists(path)) {
val hdfsFiles = fs.listLocatedStatus(path)
.filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
.map(_.getPath).toList
println(
s"""
|Total files in HDFS location : ${hdfsFiles.length}
| ${hdfsFiles.length > 1}
|""".stripMargin)
if(hdfsFiles.length > 1) {
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total Available files : ${hdfsFiles.length}
|Status : Running
|
|""".stripMargin)
val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
df.repartition(1)
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(s"/tmp${hdfsPath}")
df.cache().unpersist()
spark
.read
.format(configs.writeFormat)
.load(s"/tmp${hdfsPath}")
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(hdfsPath)
Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total files : ${hdfsFiles.length}
|Status : Completed
|
|""".stripMargin)
}
}
}
}
def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
}
object Processed {
var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
}
Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB
val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
val dataSize = bytes.toLong
val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt
df.repartition(if(numPartitions == 0) 1 else numPartitions)
.[...]
Edit-1
Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.
scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709
scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB
scala> import sys.process._
import sys.process._
scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r----- 3 svcmxns hdfs 0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r----- 3 svcmxns hdfs 727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0
We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.
As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.
This is a common burning question of spark streaming with no any fixed answer.
I took an unconventional approach which is based on idea of append.
As you are using spark 2.4.1, this solution will be helpful.
So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch.
However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition.
e.g.
/prod/mobility/cdr_data/date=01–01–2010/version=12345/file1.parquet
/prod/mobility/cdr_data/date=01–01–2010/version=23456/file1.parquet
What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.
As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.
I have described how to achieve this optimally in the link. You might want to have a look.
https://medium.com/#kumar.rahul.nitk/solving-small-file-problem-in-spark-structured-streaming-a-versioning-approach-73a0153a0a
You can set a trigger.
df.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
The larger the trigger size, the larger the file size.
Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.
I have Spark Streaming job which reads data from kafka partitions (one executor per partition).
I need to save transformed values to HDFS, but need to avoid empty files creation.
I tried to use isEmpty, but this doesn't help when not all partitions are empty.
P.S. repartition is not an acceptable solution due to perfomance degradation.
The code works for PairRDD only.
Code for text:
val conf = ssc.sparkContext.hadoopConfiguration
conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
classOf[TextOutputFormat[Text, NullWritable]]
classOf[OutputFormat[Text, NullWritable]])
kafkaRdd.map(_.value -> NullWritable.get)
.saveAsNewAPIHadoopFile(basePath,
classOf[Text],
classOf[NullWritable],
classOf[LazyOutputFormat[Text, NullWritable]],
conf)
Code for avro:
val avro: RDD[(AvroKey[MyEvent], NullWritable)]) = ....
val conf = ssc.sparkContext.hadoopConfiguration
conf.set("avro.schema.output.key", MyEvent.SCHEMA$.toString)
conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
classOf[AvroKeyOutputFormat[MyEvent]],
classOf[OutputFormat[AvroKey[MyEvent], NullWritable]])
avro.saveAsNewAPIHadoopFile(basePath,
classOf[AvroKey[MyEvent]],
classOf[NullWritable],
classOf[LazyOutputFormat[AvroKey[MyEvent], NullWritable]],
conf)
I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?
Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
Here is my code using spark-shell
import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever
I have about 700 million rows each row is about 1KB and this line
val df4 = spark.read.json(rdd) takes forever as I get the following output.
Stage 1:==========> (4866 + 24) / 25256]
so at this rate it will probably take roughly 3hrs.
I measured the network throughput rate of spark worker nodes using iftop and it is about 75MB/s (Megabytes per second) which is pretty good but I am not sure if it is reading partitions in parallel. Any ideas on how to make it faster?
Here is my DAG.