Spark SQL - Options for deploying SQL queries on Spark Streams - apache-spark

I'm new to Spark and would like to run a Spark SQL query over Spark streams.
My current understanding is that I would need to define my SQL query in the code of my Spark job, as this snippet lifted from the Spark SQ home page shows:-
spark.read.json("s3n://...")
  .registerTempTable("json")
results = spark.sql(
  """SELECT *
     FROM people
     JOIN json ...""")
What I want to do is define my query on its own somewhere - eg. .sql file - and then deploy it over a Spark cluster.
Can anyone tell me if Spark currently has any support for this architecture? eg. some API?

you can use python with open to fill your purpose:
with open('filepath/filename.sql') as fr:
query = fr.read()
x = spark.sql(query)
x.show(5)
you could pass filename.sql as an argument while submitting your job using sys.argv[]
Please refer this link for more help: Spark SQL question

Related

spark predicate push down is not working with phoenix hbase table

I am working on spark-hive-hbase integration.Here phoenix hbase table am using for the integration.
Phoenix : **apache-phoenix-4.14**
HBase : **hbase-1.4**
spark : **spark-2.3**
hive : **1.2.1**
I am using spark thrift server and accessing the table using jdbc.
Almost all basic features which i tested is working fine. but when i submit a query from spark with where condition it's submitted to phoenix with out where condition and all filtering happening in the spark side.
If the table has billions of data we can't go with this.
example:
Input-query: select * from hive_hbase where rowid=0;
Query-submitted: PhoenixQueryBuilder: Input query : select /*+ NO_CACHE */ "rowid","load_date","cluster_id","status" from hive_hbase
is it a bug?
Please suggest if there is any way to force the query to submit with where condition(filter) (with jdbc only).
Thanks & Regards
Rahul
The above-mentioned behavior is not a bug rather a feature of spark, which will make sure that filter is not happening on the DB side rather it is done at spark's end, which therefore ensures performance for a non-rowkey filter and execution can be finished fast. If you still want to push the predicates for all intents and purposes you can use phoenix-spark or rather edit the predicate pushdown code of spark on your own. Below are links for your reference
https://community.hortonworks.com/questions/87551/predicate-pushdown-support-in-hortonworks-hbase-co.html
http://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read

How to query and join csv data with Hbase data in Spark Cluster in Azure

In Microsoft Azure, we can create Spark cluster in Azure HDInsight and create Hbase cluster in Azure HDInsight. Now I have created this 2 kinds of clusters. For the Spark cluster, I can create a dataframe from a csv file and run an SQL query like this (below query is executed in Jupyter notebook):
%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
In the meantime, in the spark shell, I can create a connector to another HBase cluster to query data table in that HBase like this:
val query = spark.sqlContext.sql("select personalName, officeAddress from contacts")
query.show()
so, my question is is there a way to do the join operation against this two tables? For example:
select * from hvac a inner join contacts b on a.id = b.id
I just reference below 2 documents in Microsoft Azure:
Run queries on Spark Cluster
Use Spark to read and write HBase data
Any ideas or suggestion for this?

How can Hive on Spark read data from jdbc?

We are using Hive on Spark, and we want to do everything on hive, and using spark to calculate. That mean's we don't need to write map/reduce code but sql-like code.
And now we got a problem here, we want to read datasource like postgresql, and control it by simple sql code. And we want it run on cluster.
I've got a idea, I can write some Hive udfs to connect to a jdbc and make a table like data, but I've found it doesn't run on spark job, then it will be useless.
What we want is typing in hive like that :
hive>select myfunc('jdbc:***://***','root','pw','some sql here');
Then I can get a table in hive, and let it join others. In the other way, no matter what engine hive use, we want to read other datasource in hive.
I don't know what to do now, maybe some one can give me some advice.
It's there any way to do like this:
hive> select * from hive_table where hive_table.id in
(select myfunc('jdbcUrl','user','pw','sql'));
I know that hive is used to compile the sql to MapReduce job, what I want to know is how to do to make my sql/udf compile to MapReduce job as spark.read().jdbc(...)
I think it’s easier to load the data from db into dataframe, then you could dump it to hive if necessary.
Read this: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases
See the property name dbtable, you could load a part of a table defined in sql query.

Spark HiveContext : Spark Engine OR Hive Engine?

I am trying to understand spark hiveContext.
when we write query using hiveContext like
sqlContext=new HiveContext(sc)
sqlContext.sql("select * from TableA inner join TableB on ( a=b) ")
Is it using Spark Engine OR Hive Engine?? I believe above query get executed with Spark Engine. But if thats the case why we need dataframes?
We can blindly copy all hive queries in sqlContext.sql("") and run without using dataframes.
By DataFrames, I mean like this TableA.join(TableB, a === b)
We can even perform aggregation using SQL commands. Could any one Please clarify the concept? If there is any advantage of using dataframe joins rather that sqlContext.sql() join?
join is just an example. :)
The Spark HiveContext uses Spark execution engine underneath see the spark code.
Parser support in spark is pluggable, HiveContext uses spark's HiveQuery parser.
Functionally you can do everything with sql and Dataframes are not needed. But dataframes provided a convenient way to achieve the same results. The user doesn't need to write a SQL statement.

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.
You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr
You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!
For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Resources