Writing data from Spark SQL vs RDD api - apache-spark

I recently performed ETL on a dataset using spark 2.3.0 in EMR 5.19 where i included a new sorting column. I used the following to do this and noticed that the output was much bigger than the original data set (both compressed parquet).
spark.sql("select * from schema.table where column = 'value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
I then reran this using the method below and got the expected data size (same as original).
spark.read.load("/originaldata").filter("column='value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
My write method is identical, but the way i'm bringing the data in is different. However, why is the first result about 4x bigger than the 2nd? Am i not doing the exact same thing either way? Tried to look up the differences between Spark SQL and RDD but can't see anything specifically on writing the data. Note that both the original data set and 2 results are all partitioned the same way (200 parts in all 3).

after getting the same larger-than-expected result with these approaches, i switched to this instead
spark.read.load("/originaldata").filter("column='value'").sort("column1","column2").write.save("/location")
this works as expected and does not fail. also does not use any unnecessary Hive saveAsTable features. a better option than sortBy which also requires bucketBy and saveAsTable

Related

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Spark: Most efficient way to sort and partition data to be written as parquet

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'.
In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet.
The second step does a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID').
The goal now is -- in order to avoid shuffling in the second step -- to efficiently load the data in the first step, as this is a one-timer.
Question Part 1: AFAIK, Spark preserves the partitioning when loading from parquet (which is actually the basis of any "optimized write consideration" to be made) - correct?
I came up with three possibilities:
df.orderBy('ID').write.partitionBy('TRIP_ID').parquet('/path/to/parquet')
df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')
df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')
I would set n such that the individual parquet files would be ~100MB.
Question Part 2: Is it correct that the three options produce "the same"/similar results in regard of the goal (avoid shuffling in the 2nd step)? If not, what is the difference? And which one is 'better'?
Question Part 3: Which of the three options performs better regarding step 1?
Thanks for sharing your knowledge!
EDIT 2017-07-24
After doing some tests (writing to and reading from parquet) it seems that Spark is not able to recover partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() seems to be determined by the number of cores and/or by spark.default.parallelism (if set), but not by the number of parquet partitions. So answer for question 1 would be WRONG, and questions 2 and 3 would be irrelevant.
So it turns out the REAL QUESTION is: is there a way to tell Spark, that the data is already partitioned by column X and sorted by column Y?
You probably will be interested in bucketing support in Spark.
See details here
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable(bucketedTableName)
Notice Spark 2.4 added support for bucket pruning (like partition pruning)
More direct functionality you're looking at is Hive' bucketed-sorted tables
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
This is not yet available in Spark (see PS section below)
Also notice that the sorting information will not be loaded by Spark automatically, but since the data is already sorted.. the sorting operation on it will actually be much faster as not much work to do - e.g. one pass on data just to confirm that it is already sorted.
PS.
Spark and Hive bucketing are slightly different.
This is umbrella ticket to provide a compatibility in Spark for bucketed tables created in Hive -
https://issues.apache.org/jira/browse/SPARK-19256
As far as I know, NO there is no way to read data from parquet and tell Spark that it is already partitioned by some expression and ordered.
In short, one file on HDFS etc. is too big for one Spark partition. And even if you read whole file to one partition playing with Parquet properties such as parquet.split.files=false, parquet.task.side.metadata=true etc. there are would be most costs compare to just one shuffle.
Try bucketBy. Also, partition discovery can help.

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151).
I tried this in spark-shell:
sqlContext.read.load("x.parquet").count
And Spark ran two stages, showing various aggregation steps in the DAG. I figure this means it reads through the file normally instead of using the row counts. (I could be wrong.)
The question is: Is Spark already using the row count fields when I run count? Is there another API to use those fields? Is relying on those fields a bad idea for some reason?
That is correct, Spark is already using the rowcounts field when you are running count.
Diving into the details a bit, the SpecificParquetRecordReaderBase.java references the Improve Parquet scan performance when using flat schemas commit as part of [SPARK-11787] Speed up parquet reader for flat schemas. Note, this commit was included as part of the Spark 1.6 branch.
If the query is a row count, it pretty much works the way you described it (i.e. reading the metadata). If the predicates are fully satisfied by the min/max values, that should work as well though that is not as fully verified. It's not a bad idea to use those Parquet fields but as implied in the previous statement, the key issue is to ensure that the predicate filtering matches the metadata so you are doing an accurate count.
To help understand why there are two stages, here's the DAG created when running the count() statement.
When digging into the two stages, notice that the first one (Stage 25) is running the file scan while the second stage (Stage 26) runs the shuffle for the count.
Thanks to Nong Li (the author of the SpecificParquetRecordReaderBase.java commit) for validating!
Updated
To provide additional context on the bridge between Dataset.count and Parquet, the flow of the internal logic surrounding this is:
Spark does not read any Parquet columns to calculate the count
Passing of the Parquet schema to the VectorizedParquetRecordReader is actually an empty Parquet message
Computing the count using the metadata stored in the Parquet file footers.
involves the wrapping of the above within an iterator that returns an InternalRow per InternalRow.scala.
To work with the Parquet File format, internally, Apache Spark wraps the logic with an iterator that returns an InternalRow; more information can be found in InternalRow.scala. Ultimately, the count() aggregate function interacts with the underlying Parquet data source using this iterator. BTW, this is true for both vectorized and non-vectorized Parquet reader.
Therefore, to bridge the Dataset.count() with the Parquet reader, the path is:
The Dataset.count() call is planned into an aggregate operator with a single count() aggregate function.
Java code is generated at planning time for the aggregate operator as well as the count() aggregate function.
The generated Java code interacts with the underlying data source ParquetFileFormat with an RecordReaderIterator, which is used internally by the Spark data source API.
For more information, please refer to Parquet Count Metadata Explanation.
We can also use
java.text.NumberFormat.getIntegerInstance.format(sparkdf.count)

Spark dataframe saveAsTable vs save

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.
The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").save(path)
I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)
Is there a reason save method is so slow?
Am I doing something wrong?
The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column.
Try to save it without partition by, there would be significant performance difference.
See my previous comments above regarding cardinality and partitionBy.
If you really want to partition it, and it's just one 50MB file, then use something like
dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)
repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.
The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.
Use save() to save at particular location may be at some blob location.
Use saveAsTable() to save dataframe as spark SQL tables

Resources