How to execute CQL query using pyspark - apache-spark

I want to execute Cassandra CQL query using PySpark.But I am not finding the way to execute it.I can load whole table to dataframe and create Tempview and query it.
df = spark.read.format("org.apache.spark.sql.cassandra").
options(table="country_production2",keyspace="country").load()
df.createOrReplaceTempView("Test")
Please suggest any better way to so that I can execute CQL query in PySpark.

Spark SQL doesn't support Cassandra's cql dialects directly. It only allows you to load the table as a Dataframe and operate on it.
If you are concerned about reading a whole table to query it, then you may use the filters as given below to let Spark push the predicates the load only the data you need.
from pyspark.sql.functions import *
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()\
.filter(col("id")=="A")
df.createOrReplaceTempView("Test")

In pyspark you're using SQL, not CQL. If the SQL query somehow matches the CQL, i.e., you're querying by partition or primary key, then Spark Cassandra Connector (SCC) will transform query into that CQL, and execute (so-called predicates pushdown). If it doesn't match, then Spark will load all data via SCC, and perform filtering on the Spark level.
So after you're registered temporary view, you can do:
val result = spark.sql("select ... from Test where ...")
and work with results in result variable. To check if predicates pushdown happens, execute result.explain(), and check for the * marker in the conditions in the PushedFilters section.

Related

How to paginate hive table in spark?

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

Spark HiveContext : Spark Engine OR Hive Engine?

I am trying to understand spark hiveContext.
when we write query using hiveContext like
sqlContext=new HiveContext(sc)
sqlContext.sql("select * from TableA inner join TableB on ( a=b) ")
Is it using Spark Engine OR Hive Engine?? I believe above query get executed with Spark Engine. But if thats the case why we need dataframes?
We can blindly copy all hive queries in sqlContext.sql("") and run without using dataframes.
By DataFrames, I mean like this TableA.join(TableB, a === b)
We can even perform aggregation using SQL commands. Could any one Please clarify the concept? If there is any advantage of using dataframe joins rather that sqlContext.sql() join?
join is just an example. :)
The Spark HiveContext uses Spark execution engine underneath see the spark code.
Parser support in spark is pluggable, HiveContext uses spark's HiveQuery parser.
Functionally you can do everything with sql and Dataframes are not needed. But dataframes provided a convenient way to achieve the same results. The user doesn't need to write a SQL statement.

How to translate HiveQL query to corresponding DataFrame operation?

In my usecase, I was using hivecontext inside myRDD.map() function. I got error that java.lang nullpointerexception. I realized, it is not possible to use hiveContext inside the map logic. The hivecontext was used to fire a hiveql query to another table (conf). hiveContext.sql(). The query is like this
select config_date, filefilter, family, id from mydb.conf where
id == 178 and config_date < cast("2011-02-04 13:05:41.0" as
timestamp) and family == "drf" order by config_date desc limit 1
I have decided to create a dataframe of this table before the start of the map process in the driver code. And perform dataframe operations inside the map logic. Basically, want to do method calls over dataframe instead of using hivecontext to query.
Is it possible? Can someone help me out here how to replicate this query over dataframe?
Yes, translating your Hive query to Dataset is perfectly possible.
You can just spark.sql(yourQueryHere) or rewrite the query to use Spark SQL's Dataset API.
Just load your Hie table using spark.read.table("mydb.conf") and do the filtering and ordering.
val conf = spark.
read.
table("mydb.conf").
select("config_date", "filefilter", "family", "id").
... // you know the rest
You can then join this Dataset with the other and apply joined transformation that will will avoid using hiveContext inside map.

Spark DataFrames: registerTempTable vs not

I just started with DataFrame yesterday and am really liking it so far.
I dont understand one thing though...
(Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").
So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?
The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val customerDataFrame = myCodeToCreateOrLoadDataFrame()
customerDataFrame.registerTempTable( "cust" )
val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
val salesPerCustomer: DataFrame = hc.sql( query )
salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.
It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql
SELECT * FROM myTempView

Resources