Is there a way to read data without SQL in Spark? - apache-spark

I am beginner in Spark and was given an assignment to read data from csv and perform some query data using Spark Core.
However, every online resource that I search uses some form of SQL from the pyspark.sql module.
Are there any way to read data and perform data query (select, count, group by) using only Spark Core?

Spark Core is concept RDD. Here you can find more information and examples with processing some textfiles.

its good practice to use Spark Dataframe instead Spark RDD.
Spark Dataframe uses catalyst optimizer which automatically calls out code internally in best way to improve performance.
https://blog.bi-geek.com/en/spark-sql-optimizador-catalyst/

Related

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Large Query or mutate Dataframe?

I am using a SparkSession to connect to a hive database. I'm trying to decide what is the best way to enrichment the data. I was using Spark Sql but I am weary to use it.
Does the SparkSql just call Hive Sql? So would that mean there is no improved performance from using Spark?
If not, should I just create a large sql query to spark, or should I grab a table I wan't convert it to a data frame and manipulate it using sparks functions?
No, Spark will read the data from Hive, but use its own execution engine. Performance and capabilities will differ. How much depends on the execution engine you are using for Hive. (M/R, Tez, Spark, LLAP?)
That's the same thing. I would stick to SQL queries, and A-B-test against Hive in the beginning, but SQL is notoriously difficult to maintain, where Scala/Python code using Spark's DataSet API is more user friendly in the long term.

Spark DataFrame vs sqlContext

For the purposes of comparison, suppose we have a table "T" with two columns "A","B". We also have a hiveContext operating in some HDFS database. We make a data frame:
In theory, which of the following is faster:
sqlContext.sql("SELECT A,SUM(B) FROM T GROUP BY A")
or
df.groupBy("A").sum("B")
where "df" is a dataframe referring to T. For these simple kinds of aggregate operations, is there any reason why one should prefer one method over the other?
No, these should boil down to the same execution plan. Underneath the Spark SQL engine is using the same optimization engine, the catalyst optimizer. You can always check this yourself by looking at the spark UI, or even calling explain on the resultant DataFrame.
Spark developers have made great effort to optimise. The performance between DataFrame Scala and DataFrame SQL is undistinguishable. Even for DataFrame Python, the differ is when collect data to driver.
It opens a new world
It doesn't have to be one vs. another
We can just choose what ever way we comfortable with
The performance comparison published by databricks

Spark SQL: how does it map to RDD operations?

When I learn spark SQL, I have a question in my mind:
As said, the SQL execution result is SchemaRDD, but what happens behind the scene? How many transformations or actions in the optimized execution plan, which should be equivalent to plain RDD hand-written codes invoked?
If we write codes by hand instead of SQL, it may generate some intermediate RDDs, e.g. a series of map(), filter() operations upon the source RDD. But the SQL version would not generate intermediate RDDs, correct?
Depending on the SQL content, the generated VM byte codes also involves partitioning, shuffling, correct? But without intermediate RDDs, how could spark schedule and execute them on worker machines?
In fact, I still can not understand the relationship between the spark SQL and spark core. How they interact with each other?
To understand how SparkSQL or the dataframe/dataset DSL maps to RDD operations, look at the physical plan Spark generates using explain.
sql(/* your SQL here */).explain
myDataframe.explain
At the very core of Spark, RDD[_] is the underlying datatype that is manipulated using distributed operations. In Spark versions <= 1.6.x DataFrame is RDD[Row] and Dataset is separate. In Spark versions >= 2.x DataFrame becomes Dataset[Row]. That doesn't change the fact that underneath it all Spark uses RDD operations.
For a deeper dive into understanding Spark execution, read Understanding Spark Through Visualization.

Which query to use for better performance, join in SQL or using Dataset API?

While fetching and manipulating data from HBASE using spark, *Spark sql join* vs *spark dataframe join* - which one is faster?
RDD always Outperform Dataframe and SparkSQL, but from my experience Dataframe perform well compared to SparkSQL. Dataframe function perform well compare to spark sql.Below link will give some insights on this.
Spark RDDs vs DataFrames vs SparkSQL
As far as I can tell, they should behave the same regarding to performance. SQL internally will work as DataFrame
I don't have access to a cluster to properly test but I imagine that the Spark SQL just compiles down to the native data frame code.
The rule of thumb I've heard is that the SQL code should be used for exploration and dataframe operations for production code.
Spark SQL brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations, that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The execution speed will be the same, because they use same optimization algorithms.
If the join might be shared across queries carefully implemented join with RDDs might be a good option. However if this is not the case let spark/catalyst do it's job and join within spark sql. It will do all the optimization. So you wouldn't have to maintain your join logic etc.
Spark SQL join and Spark Dataframe join are almost same thing. The join is actually delegated to RDD operations under the hood. On top of RDD operation we have convenience methods like spark sql, data frame or data set. In case of spark sql it needs to spend a tiny amount of extra time to parse the SQL.
It should be evaluated more in terms of good programming practice. I like dataset because you can catch syntax errors while compiling. And the encodes behind the scene takes care of compacting the data and executing the query.
I did some performance analysis for sql vs dataframe on Cassandra using spark, I think it will be the same for HBASE also.
According to me sql works faster than dataframe approach. The reason behind this might be that in the dataframe approach there are lot of java object's involved. In sql approach everything is done in-memory.
Attaching results.

Resources