Write to elasticsearch from spark is very slow - apache-spark

I am processing a text file and writing transformed rows from a Spark application to elastic search as bellow
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + dir).save()
This runs very slow and takes around 8 minutes to write 287.9 MB / 1513789 records.
How can I tune spark and elasticsearch settings to make it faster given that network latency will always be there.
I am using spark in local mode and have 16 cores and 64GB RAM.
My elasticsearch cluster has one master and 3 data nodes with 16 cores and 64GB each.
I am reading text file as below
val readOptions: Map[String, String] = Map("ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"inferSchema" -> "false",
"header" -> "false",
"delimiter" -> "\t",
"comment" -> "#",
"mode" -> "PERMISSIVE")
....
val input = sqlContext.read.options(readOptions).csv(inputFile.getAbsolutePath)

First, Let's start with what's happening in your application. Apache Spark is reading 1 (not so big) csv file which is compressed. Thus first spark will spend time decompressing data and scan it before writing it in elasticsearch.
This will create a Dataset/DataFrame with one partition (confirmed by the result of your df.rdd.getNumPartitions mentioned in the comments).
One straight-forward solution would be to repartition your data on read and cache it, before writing it to elasticsearch. Now I'm not sure what your data looks like, so deciding the number of partitions is subject of benchmark from your side.
val input = sqlContext.read.options(readOptions)
.csv(inputFile.getAbsolutePath)
.repartition(100) // 100 is just an example
.cache
I'm not sure how much will be the benefit on your application, because I believe there might be other bottlenecks (network IO, disk type for ES).
PS: I ought converting csv to parquet files before building ETL over them. There is real gain of performance here. (personal opinion and benchmarks)
Another possible optimization would be to tweak the es.batch.size.entries setting for the elasticsearch-spark connector. The default value is 1000.
You need to be careful when setting this parameter because you might overload elasticsearch. I strongly advice you take a look at the available configurations here.
I hope this helps !

Related

Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Why?

I use global sort on my spark DF, and when I enable AQE and post-shuffle coalesce, my partitions after sort operation become even worse distributed than before.
"spark.sql.adaptive.enabled" -> "true",
"spark.sql.adaptive.coalescePartitions.enabled" -> "true",
"spark.sql.adaptive.advisoryPartitionSizeInBytes" -> "256mb",
"spark.sql.adaptive.coalescePartitions.minPartitionNum" -> "1",
"spark.sql.adaptive.coalescePartitions.initialPartitionNum" -> "20000"
My query, on high level, looks:
.readFromKafka
.deserializeJsonToRow
.cache
.sort(const_part, column which can cause skew, some salt columns)
.writeToS3
column which can cause skew -> yes my data is not well distributed, that's why I use salts.
I read data from Kafka, so I use Kafka partition + offset columns as salt.
Why Sort which uses reaprtitoinByRange under the hood doesn't help me and I want to enable AQE? -> Right now I see that my Kafka message can have a too big a difference in size. So I see that my partitions after range repartition have near the same amount of records, but still very uneven in bytes.
Why I think AQE must help me? -> I want to create many small ranges which even with my data skew will not be more than ~50mb, so post shuffle coalesce will be able to coalesce them to target size(256mb). In my case spikes, up to 320mb are ok.
My first assumption was, that even with a small range has a too big a spike.
But I check and confirm that repartition by a range gives me good distribution in records, but bad is size. I have nearly 200 partitions with near the same amount of records and size differences of up to 9x times, from ~100Mb to ~900mb.
But with AEQ and repartition to 18000 small ranges, the smallest partition was 18mib and the biggest 1.8Gib.
This state of things is much worse than without AEQ.
It's important to highlight that I use metrics from Spark UI -> Details for Stage tab to identify partitions size in bytes, and I have my own logs for records.
So I start to debug the issue, but AQE don't have enough logs on the input and output of
ShufflePartitionsUtil.coalescePartitions.
That why I rewrite my query to repartitionByRange.sortWithingPartitoins. And fork Physical Plan optimization with additional logging.
My logs show me, that my initial idea was right.
Input partitions after map and write shuffle stage was split to be small enough
Coalesce algorithm collects them to a correct number well distributed in bytes partition.
Input shuffleId:2 partitions:17999
Max partition size :27362117
Min partition size :8758435
And
Number of shuffle stages to coalesce 1
Reduce number of partitions from 17999 to 188
Output partition maxsize :312832323
Output partition min size :103832323
Min size is so different, because of the size of the last partition, it's expected. TRACE log level shows that 99% of partitions is near 290mib.
But why spark UI show so different results? ->
May spark UI be wrong? ->
Maybe, but except task size, the duration of a task is also too big, which makes me think spark UI is ok.
So my assumption is that the issue is with MapOutputStatistics in my stage. But does it always broken or only in my case? ->
Only in my case? -> I made a few checks to confirm it.
I read the same dataset from s3(parquet files with block size 120mb)-> and AQE work as expected. post shuffle coalesce return to me 188, well distributed by size, partitions. it's important to notice that data on s3 not well distributed, but spark during reading split it to 259 near 120mb size partitions, most of all because of parquet block size 120mb.
I read the same dataset from Kafka, but exclude the column with a skew from the partition function -> and AQE work as expected. post shuffle coalesce return to me 203, well distributed by size, partitions.
I try to disable cache -> this does not have any result. I use the cache, only to avoid double reading from kafka. Because repartition by a range use sampling.
I try to disable AQE and write 18000 partitions to s3 -> result was expected and the same as what my log on coalescing input show: 17999 files, smallest near 8mib and biggest 56mib.
All these checks make me think that MapOutputStatistics is wrong only for my case. May be an issue that how to relate to Kafka source or that my Kafka input data is very uneven distributed.
Questions:
So do anyone have an idea what I do wrong?
And what I can do with input data to make post shuffle coalesce work in my case?
If you think that I am right, please put a comment about it.
P.S.
I also want to mention that my input Kafka data frame is 2160, not even distributed partitions -> some partitions can be 2x time bigger then others. Read from Kafka topic with 720 partitions and minPartitions option * 3.
https://www.mail-archive.com/dev#spark.apache.org/msg26851.html
here is the answer.
The worst case of enabling AQE in cached data is not losing the
opportunity of using/reusing the cache, but rather just an extra shuffle if
the outputPartitioning happens to match without AQE and not match after
AQE. The chance of this happening is rather low.

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

Spark reading orc file in driver not in executors

I have 30GB ORC files ( 24 parts * 1.3G) in s3. I am using spark to read this orc and do some operations. But from the logs what I observed was even before doing any operation, spark is opening and reading all 24 parts from s3 (Taking 12 min just to read files ). But my concern here is that all this read operations are happening only in driver and executors are all idle at this time.
Can someone explain me why is happening? Is there any way I can utilize all executors for reading as well?
Does the same apply for parquet as well ?
Thanks in advance.
Have you provided the schema of your data ?
If not, Spark tries to get the schema of all the files, and then proceeds with the execution.
Both ORC and Parquet can do checks for summary data in the footers of files, and, depending on the s3 client and its config, may cause it to do some very inefficient IO. This may be the cause.
If you are using the s3a:// connector and the underlying JARs of Hadoop 2.8+ then you can tell it to the random IO needed for maximum performance on columnar data, and tune some other things.
val OPTIONS = Map(
"spark.hadoop.fs.s3a.experimental.fadvise" => "random"
"spark.hadoop.orc.splits.include.file.footer" -> "true",
"spark.hadoop.orc.cache.stripe.details.size" -> "1000",
"spark.hadoop.orc.filterPushdown" -> "true"
"spark.sql.parquet.mergeSchema" -> "false",
"spark.sql.parquet.filterPushdown" -> "true"
)

Spark skewing data to few executors

I'm running spark on standalone mode with 21 executors, and when I load my first SQL table using my sqlContext, I partition it in a way such that the data is perfectly distributed among all blocks by partitioning on a column that is sequential integers:
val brDF = sqlContext.load("jdbc", Map("url" -> srcurl, "dbtable" -> "basereading", "partitionColumn" -> "timeperiod", "lowerBound" ->"2", "upperBound" -> "35037", "numPartitions" -> "100"))
Additionally, the blocks are nicely distributed on each cluster so that each cluster has a similar memory usage.
Unfortunately, when I join it with a much smaller table idoM like so:
val mrDF = idoM.as('idom).join(brS1DF.as('br), $"idom.idoid" === $"br.meter")
Where idoM is a 1 column table and cache the result, the distribution of the way the RDD blocks are stored on the cluster changes:
screenshot of spark UI executors sorted by number of RDD blocks
Now, there are suddenly more RDD blocks on my fourth cluster and it uses more memory. Upon checking each RDD, their blocks seem to still be distributed nicely so my partitioning is still fine, just that all the blocks seem to only want to be written on one cluster, defeating the purpose of having multiple to begin with.
I suspect that my problem has something similar to
this question on the Apache mail list
but there is no answer, so anything would be greatly appreciated.
Not knowing your data, I assume that the distribution of the key you are joining on are the cause of the data skew.
Running idoM.groupBy("idoid").count.orderBy(desc("count")).show or brS1DF.groupBy("meter").count.orderBy(desc("count")).show will probably show you, that a few values have a lot of occurrences.
The issue was with idoM being loaded onto one machine, and spark trying to keep the data locality and doing the whole join on one machine, which was resolved in this case by broadcasting the smaller table to the larger one. I made sure that the keys of idoM were perfectly distributed on the column that was being joined, and unfortunately, repartitioning does not solve the issue as spark still tries to keep the locality and the whole dataFrame still ends up on one machine.

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.
If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.
OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?
It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).
I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.

Resources