Large Query or mutate Dataframe? - apache-spark

I am using a SparkSession to connect to a hive database. I'm trying to decide what is the best way to enrichment the data. I was using Spark Sql but I am weary to use it.
Does the SparkSql just call Hive Sql? So would that mean there is no improved performance from using Spark?
If not, should I just create a large sql query to spark, or should I grab a table I wan't convert it to a data frame and manipulate it using sparks functions?

No, Spark will read the data from Hive, but use its own execution engine. Performance and capabilities will differ. How much depends on the execution engine you are using for Hive. (M/R, Tez, Spark, LLAP?)
That's the same thing. I would stick to SQL queries, and A-B-test against Hive in the beginning, but SQL is notoriously difficult to maintain, where Scala/Python code using Spark's DataSet API is more user friendly in the long term.

Related

Is there a way to read data without SQL in Spark?

I am beginner in Spark and was given an assignment to read data from csv and perform some query data using Spark Core.
However, every online resource that I search uses some form of SQL from the pyspark.sql module.
Are there any way to read data and perform data query (select, count, group by) using only Spark Core?
Spark Core is concept RDD. Here you can find more information and examples with processing some textfiles.
its good practice to use Spark Dataframe instead Spark RDD.
Spark Dataframe uses catalyst optimizer which automatically calls out code internally in best way to improve performance.
https://blog.bi-geek.com/en/spark-sql-optimizador-catalyst/

Spark Sql vs Spark Data frame API

Can anyone explain when to use Spark SQL(plain sql queries) and Spark Data Frame methods .I see we can do every operation with spark sql .
Which is better in performance
They are both equally performant.
Using dataframe APIs guarantees type safety, and can be further optimized by the SQL engine/query builder
From a usage perspective, it's hard to catch a syntax error until runtime in Spark SQL while using Dataframe APIs we can catch those at compile time.

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Which query to use for better performance, join in SQL or using Dataset API?

While fetching and manipulating data from HBASE using spark, *Spark sql join* vs *spark dataframe join* - which one is faster?
RDD always Outperform Dataframe and SparkSQL, but from my experience Dataframe perform well compared to SparkSQL. Dataframe function perform well compare to spark sql.Below link will give some insights on this.
Spark RDDs vs DataFrames vs SparkSQL
As far as I can tell, they should behave the same regarding to performance. SQL internally will work as DataFrame
I don't have access to a cluster to properly test but I imagine that the Spark SQL just compiles down to the native data frame code.
The rule of thumb I've heard is that the SQL code should be used for exploration and dataframe operations for production code.
Spark SQL brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations, that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The execution speed will be the same, because they use same optimization algorithms.
If the join might be shared across queries carefully implemented join with RDDs might be a good option. However if this is not the case let spark/catalyst do it's job and join within spark sql. It will do all the optimization. So you wouldn't have to maintain your join logic etc.
Spark SQL join and Spark Dataframe join are almost same thing. The join is actually delegated to RDD operations under the hood. On top of RDD operation we have convenience methods like spark sql, data frame or data set. In case of spark sql it needs to spend a tiny amount of extra time to parse the SQL.
It should be evaluated more in terms of good programming practice. I like dataset because you can catch syntax errors while compiling. And the encodes behind the scene takes care of compacting the data and executing the query.
I did some performance analysis for sql vs dataframe on Cassandra using spark, I think it will be the same for HBASE also.
According to me sql works faster than dataframe approach. The reason behind this might be that in the dataframe approach there are lot of java object's involved. In sql approach everything is done in-memory.
Attaching results.

Resources