I have 30GB ORC files ( 24 parts * 1.3G) in s3. I am using spark to read this orc and do some operations. But from the logs what I observed was even before doing any operation, spark is opening and reading all 24 parts from s3 (Taking 12 min just to read files ). But my concern here is that all this read operations are happening only in driver and executors are all idle at this time.
Can someone explain me why is happening? Is there any way I can utilize all executors for reading as well?
Does the same apply for parquet as well ?
Thanks in advance.
Have you provided the schema of your data ?
If not, Spark tries to get the schema of all the files, and then proceeds with the execution.
Both ORC and Parquet can do checks for summary data in the footers of files, and, depending on the s3 client and its config, may cause it to do some very inefficient IO. This may be the cause.
If you are using the s3a:// connector and the underlying JARs of Hadoop 2.8+ then you can tell it to the random IO needed for maximum performance on columnar data, and tune some other things.
val OPTIONS = Map(
"spark.hadoop.fs.s3a.experimental.fadvise" => "random"
"spark.hadoop.orc.splits.include.file.footer" -> "true",
"spark.hadoop.orc.cache.stripe.details.size" -> "1000",
"spark.hadoop.orc.filterPushdown" -> "true"
"spark.sql.parquet.mergeSchema" -> "false",
"spark.sql.parquet.filterPushdown" -> "true"
)
Related
When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -
Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?
Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?
Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?
Data on S3 is external to HDFS obviously.
You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.
If you use:
val df = spark.read.parquet("/path/to/parquet/file.../...")
then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.
But, this:
val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")
will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.
The Driver only gets the metadata if specifying S3 directly.
In your terms:
"... each executor downloads the data directly (on the worker node)? " YES
Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.
I am processing a text file and writing transformed rows from a Spark application to elastic search as bellow
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + dir).save()
This runs very slow and takes around 8 minutes to write 287.9 MB / 1513789 records.
How can I tune spark and elasticsearch settings to make it faster given that network latency will always be there.
I am using spark in local mode and have 16 cores and 64GB RAM.
My elasticsearch cluster has one master and 3 data nodes with 16 cores and 64GB each.
I am reading text file as below
val readOptions: Map[String, String] = Map("ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"inferSchema" -> "false",
"header" -> "false",
"delimiter" -> "\t",
"comment" -> "#",
"mode" -> "PERMISSIVE")
....
val input = sqlContext.read.options(readOptions).csv(inputFile.getAbsolutePath)
First, Let's start with what's happening in your application. Apache Spark is reading 1 (not so big) csv file which is compressed. Thus first spark will spend time decompressing data and scan it before writing it in elasticsearch.
This will create a Dataset/DataFrame with one partition (confirmed by the result of your df.rdd.getNumPartitions mentioned in the comments).
One straight-forward solution would be to repartition your data on read and cache it, before writing it to elasticsearch. Now I'm not sure what your data looks like, so deciding the number of partitions is subject of benchmark from your side.
val input = sqlContext.read.options(readOptions)
.csv(inputFile.getAbsolutePath)
.repartition(100) // 100 is just an example
.cache
I'm not sure how much will be the benefit on your application, because I believe there might be other bottlenecks (network IO, disk type for ES).
PS: I ought converting csv to parquet files before building ETL over them. There is real gain of performance here. (personal opinion and benchmarks)
Another possible optimization would be to tweak the es.batch.size.entries setting for the elasticsearch-spark connector. The default value is 1000.
You need to be careful when setting this parameter because you might overload elasticsearch. I strongly advice you take a look at the available configurations here.
I hope this helps !
I am writing this code
val inputData = spark.read.parquet(inputFile)
spark.conf.set("spark.sql.shuffle.partitions",6)
val outputData = inputData.sort($"colname")
outputData.write.parquet(outputFile) //write on HDFS
If I want to read the content of the file "outputFile" from HDFS, I don't find the same number of partitions and the data is not sorted. Is this normal?
I am using Spark 2.0
This is an unfortunate deficiency of Spark. While write.parquet saves files as part-00000.parquet, part-00001.parquet, ... , it saves no partition information, and does not guarantee that part-00000 on disk is read back as the first partition.
We have added functionality for our project to a) read back partitions in the same order (this involves doing some somewhat-unsafe partition casting and sorting based on the contained filename), and b) serialize partitioners to disk and read them back.
As far as I know, there is nothing you can do in stock Spark at the moment to solve this problem. I look forward to seeing a resolution in future versions of Spark!
Edit: My experience is in Spark 1.5.x and 1.6.x. If there is a way to do this in native Spark with 2.0, please let me know!
You should make use of the repartition() instead. This would write the parquet file the way you want it:
outputData.repartition(6).write.parquet("outputFile")
Then, it would be the same if you try to read it back .
Parquet preserves the order of rows. You should use take() instead of show() to check the contents. take(n) returns the first n rows and the way it works is by first reading the first partition to get an idea of the partition size and then getting the rest of the data in batches..
I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)
I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.
If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.
OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?
It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).
I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.