spark predicate push down is not working with phoenix hbase table - apache-spark

I am working on spark-hive-hbase integration.Here phoenix hbase table am using for the integration.
Phoenix : **apache-phoenix-4.14**
HBase : **hbase-1.4**
spark : **spark-2.3**
hive : **1.2.1**
I am using spark thrift server and accessing the table using jdbc.
Almost all basic features which i tested is working fine. but when i submit a query from spark with where condition it's submitted to phoenix with out where condition and all filtering happening in the spark side.
If the table has billions of data we can't go with this.
example:
Input-query: select * from hive_hbase where rowid=0;
Query-submitted: PhoenixQueryBuilder: Input query : select /*+ NO_CACHE */ "rowid","load_date","cluster_id","status" from hive_hbase
is it a bug?
Please suggest if there is any way to force the query to submit with where condition(filter) (with jdbc only).
Thanks & Regards
Rahul

The above-mentioned behavior is not a bug rather a feature of spark, which will make sure that filter is not happening on the DB side rather it is done at spark's end, which therefore ensures performance for a non-rowkey filter and execution can be finished fast. If you still want to push the predicates for all intents and purposes you can use phoenix-spark or rather edit the predicate pushdown code of spark on your own. Below are links for your reference
https://community.hortonworks.com/questions/87551/predicate-pushdown-support-in-hortonworks-hbase-co.html
http://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read

Related

How can I implement pyspark Cassandra "keybased" connector?

I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good.
Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
When you do spark.read, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.
When you perform query with condition (where or filter), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING, then it will be pushed down.
In your specific example, you're using inequality predicate (<> or !=) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.

Apache Spark + cassandra+Java +Spark session to display all records

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
But this is always giving me 20 records. I want to select all the records of table. can someone tell me how to do this ?
Thanks in advance.
show always outputs 20 records by default, although you can pass an argument to specify how many items do you need. But show is usually used just for briefly examine the data, especially when working interactively.
In your case, everything is really depends on what do you want to do with the data - you already successfully loaded the data using the load function - after that you can just start to use normal Spark functions - select, filter, groupBy, etc.
P.S. You can find here more examples on using Spark Cassandra Connector (SCC) from Java, although it's more cumbersome than using Scala... And I recommend to make sure that you're using SCC 2.5.0 or higher because of the many new features there.

How can Hive on Spark read data from jdbc?

We are using Hive on Spark, and we want to do everything on hive, and using spark to calculate. That mean's we don't need to write map/reduce code but sql-like code.
And now we got a problem here, we want to read datasource like postgresql, and control it by simple sql code. And we want it run on cluster.
I've got a idea, I can write some Hive udfs to connect to a jdbc and make a table like data, but I've found it doesn't run on spark job, then it will be useless.
What we want is typing in hive like that :
hive>select myfunc('jdbc:***://***','root','pw','some sql here');
Then I can get a table in hive, and let it join others. In the other way, no matter what engine hive use, we want to read other datasource in hive.
I don't know what to do now, maybe some one can give me some advice.
It's there any way to do like this:
hive> select * from hive_table where hive_table.id in
(select myfunc('jdbcUrl','user','pw','sql'));
I know that hive is used to compile the sql to MapReduce job, what I want to know is how to do to make my sql/udf compile to MapReduce job as spark.read().jdbc(...)
I think it’s easier to load the data from db into dataframe, then you could dump it to hive if necessary.
Read this: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases
See the property name dbtable, you could load a part of a table defined in sql query.

SPARK KUDU Complex Update statements directly or via Impala JDBC Driver possible?

If I look at the Imapala Shell or Hue, I can write complicated enough IMPALA update statements for KUDU. E.g. update with sub-select and what not. Fine.
Looking at the old JDBC connection methods for, say, mySQL via SPARK / SCALA, there is not a lot of possibility to do a complicated update via such a connection, and that is understandable. However, with KUDU, I think the situation changes.
Looking at the documentation on KUDU - Apache KUDU - Developing Applications with Apache KUDU, the follwoing questions:
It is unclear if I can issue a complex update SQL statement from a SPARK / SCALA environment via an IMPALA JDBC Driver (due to security issues with KUDU).
In SPARK KUDU Native Mode DML seems tied to a Data Frame approach with INSERT and UPSERT. What if I just want to write a free-format SQL DML statement like an UPDATE? I see that we can use Spark SQL to INSERT (treated as UPSERT by default) into a Kudu table. E.g.
sqlContext.sql(s"INSERT INTO TABLE $kuduTableName SELECT * FROM source_table")
My understanding with SPARK SQL INSERT ... as per above is that the KUDU table must be a temporary table as well. I cannot approach it directly. So, taking this all in then, how can we approach a KUDU table directly in SPARK? We cannot in SPARK / KUDU, and complicated UPDATES statement via SPARK SCALA / KUDU or SPARK SCALA to KUDU via Impala JDBC connection do not allow this either. I can do some things via shell scripting with saved env vars in some cases I note.
What a bad documentation in this regard.
DML insert, update, ... possible via the "approach" below, some examples:
stmt.execute("update KUDU_1 set v = 'same value' where k in ('1', '4') ;")
stmt.execute("insert into KUDU_1 select concat(k, 'ABCDEF'), 'MASS INSERT' from KUDU_1 ;")
The only thing if using the corresponding stmt.executequery a Java resultset is returned which differs to the more standard approach of reading from JDBC sources and persisting the results. A little surprise here for me. Maybe 2 approaches needed, one for more regular selects and one work DML non-select. Not sure if that can be all in the same programme module. For another time. Yes it can.

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.
You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr
You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!
For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Resources