I'm attempting to generate aggregate HLL sketches in a Scala Spark job and push the data to a varbinary in Trino for dashboard aggregations.
I'm using the spark-alchemy library to generate the sketches in Spark, but continue to run into compatibility issues when running the cardinality function in Trino. Specifically, the error
Cannot deserialize HyperLogLog
Trino uses the HLL implementation from airlift, but bringing that library in and writing UDFs around that implementation seems awfully cumbersome. Is there a more streamline way to obtain interoperability between Spark HLL and Trino HLL?
Related
All: I am looking for someone with more knowledge to check my understanding of Hive and Spark
I have been researching different large scale database solutions and I am trying to understand the difference in execution between Hive and Spark. I attempted to install Hadoop, Hive, and Spark to see how they perform. I was able to get Hadoop and Spark to work. I was unable to get Hive to work.
When I ran queries in Spark after they passed through the optimizer, it seems that the biggest advantage is that only the relevant table data is selected from the source at the earliest inception. So if I only needed Table1.columns(A,B,C) in the final answer, but told the system to JOIN Table1 & Table2 on (Table1.A=Table2.B) it immediately reduces the carried table to only the relevant items...I do not think Hive performs that way. I believe it will do the full join and perform the reduction later.
There are also differences in the memory storage (Hive going back the the HDFS frequently, vs Spark keeping things in RAM). This has both advantages and disadvantages depending on the data set/query.
Unfortunately because I cannot get Hive to run, my theory is based off of reading outputs of other people running things in Hive.
I Think hive and spark originally have different goals, and their execution styles are based on those goals.
Apache spark is a framework that allows you to do calculations on big datasets. stored on hdfs
Hive is an SQL interface to retriev data stored in an hdfs, and other clusterized and object store filesystems (S3 is an example) in a structured way.
Spark keeps things on ram because its more focused on making calculations with the data sets. Hive is more focused on retrieving data in a structured way, so it does not focus on speed that much (that being said, there have been improvements in hive, like llap that are meant to improve performance).
I like to use analogies with traditional software tools. On one side, you can have a relational database, and on the other side, a programming language. They both overlap in some functionality (you can write and read to disk with the programming language, and you can do some calculations with the sql engine. However, if the task at hand requires intensive and complex calculations you would probably use the programming language. If you are looking for a system that lets you store data in a structured way, you would go for the sql engine.
Hive on Tez and Spark both use Ram(memory) for operating on data . The number of partitions computed which will be treated as individual tasks would be quite different from Hive on Tez vs Spark . Hive on Tez by default tries to use combiner to merge certain splits into single partition . Hive one Tez seem to handle autoscaling of clusters in a better way than spark and does work most of the time.Spark doesn't work with autoscaling it would have lot of shuffle errors and will fail when there are multiple stages . But given a fixed size of cluster Spark seems to perform better over Hive on TEZ this could be attributed to some of the optimizations done and also how the shuffle ,serialization etc are implemented .
I am running a Spark application that has a series of Spark SQL statements that are executed one after the other. The SQL queries are quite complex and the application is working (generating output). These days, I am working towards improving the performance of processing within Spark.
Please suggest whether Tungsten encoding has to be enabled separately or it kicks in automatically while running Spark SQL?
I am using Cloudera 5.13 for my cluster (2 node).
It is enabled by default in spark 2.X (and maybe 1.6: but i'm not sure on that).
In any case you can do this
spark.sql.tungsten.enabled=true
That can be enabled on the spark-submit as follows:
spark-submit --conf spark.sql.tungsten.enabled=true
Tungsten should be enabled if you see a * next to the plan:
Also see: How to enable Tungsten optimization in Spark 2?
Tungsten became the default in Spark 1.5 and can be enabled in an earlier version by setting the spark.sql.tungsten.enabled = true.
Even without Tungsten, SparkSQL uses a columnar storage format with Kyro serialization to minimize storage cost.
To make sure your code benefits as much as possible from Tungsten optimizations try to use the default Dataset API with Scala (instead of RDD).
Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. DataSet APIs are the most up to date and adds type-safety along with better error handling and far more readable unit tests.
I am using a SparkSession to connect to a hive database. I'm trying to decide what is the best way to enrichment the data. I was using Spark Sql but I am weary to use it.
Does the SparkSql just call Hive Sql? So would that mean there is no improved performance from using Spark?
If not, should I just create a large sql query to spark, or should I grab a table I wan't convert it to a data frame and manipulate it using sparks functions?
No, Spark will read the data from Hive, but use its own execution engine. Performance and capabilities will differ. How much depends on the execution engine you are using for Hive. (M/R, Tez, Spark, LLAP?)
That's the same thing. I would stick to SQL queries, and A-B-test against Hive in the beginning, but SQL is notoriously difficult to maintain, where Scala/Python code using Spark's DataSet API is more user friendly in the long term.
When I learn spark SQL, I have a question in my mind:
As said, the SQL execution result is SchemaRDD, but what happens behind the scene? How many transformations or actions in the optimized execution plan, which should be equivalent to plain RDD hand-written codes invoked?
If we write codes by hand instead of SQL, it may generate some intermediate RDDs, e.g. a series of map(), filter() operations upon the source RDD. But the SQL version would not generate intermediate RDDs, correct?
Depending on the SQL content, the generated VM byte codes also involves partitioning, shuffling, correct? But without intermediate RDDs, how could spark schedule and execute them on worker machines?
In fact, I still can not understand the relationship between the spark SQL and spark core. How they interact with each other?
To understand how SparkSQL or the dataframe/dataset DSL maps to RDD operations, look at the physical plan Spark generates using explain.
sql(/* your SQL here */).explain
myDataframe.explain
At the very core of Spark, RDD[_] is the underlying datatype that is manipulated using distributed operations. In Spark versions <= 1.6.x DataFrame is RDD[Row] and Dataset is separate. In Spark versions >= 2.x DataFrame becomes Dataset[Row]. That doesn't change the fact that underneath it all Spark uses RDD operations.
For a deeper dive into understanding Spark execution, read Understanding Spark Through Visualization.
While fetching and manipulating data from HBASE using spark, *Spark sql join* vs *spark dataframe join* - which one is faster?
RDD always Outperform Dataframe and SparkSQL, but from my experience Dataframe perform well compared to SparkSQL. Dataframe function perform well compare to spark sql.Below link will give some insights on this.
Spark RDDs vs DataFrames vs SparkSQL
As far as I can tell, they should behave the same regarding to performance. SQL internally will work as DataFrame
I don't have access to a cluster to properly test but I imagine that the Spark SQL just compiles down to the native data frame code.
The rule of thumb I've heard is that the SQL code should be used for exploration and dataframe operations for production code.
Spark SQL brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations, that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The execution speed will be the same, because they use same optimization algorithms.
If the join might be shared across queries carefully implemented join with RDDs might be a good option. However if this is not the case let spark/catalyst do it's job and join within spark sql. It will do all the optimization. So you wouldn't have to maintain your join logic etc.
Spark SQL join and Spark Dataframe join are almost same thing. The join is actually delegated to RDD operations under the hood. On top of RDD operation we have convenience methods like spark sql, data frame or data set. In case of spark sql it needs to spend a tiny amount of extra time to parse the SQL.
It should be evaluated more in terms of good programming practice. I like dataset because you can catch syntax errors while compiling. And the encodes behind the scene takes care of compacting the data and executing the query.
I did some performance analysis for sql vs dataframe on Cassandra using spark, I think it will be the same for HBASE also.
According to me sql works faster than dataframe approach. The reason behind this might be that in the dataframe approach there are lot of java object's involved. In sql approach everything is done in-memory.
Attaching results.