SPARK KUDU Complex Update statements directly or via Impala JDBC Driver possible? - apache-spark

If I look at the Imapala Shell or Hue, I can write complicated enough IMPALA update statements for KUDU. E.g. update with sub-select and what not. Fine.
Looking at the old JDBC connection methods for, say, mySQL via SPARK / SCALA, there is not a lot of possibility to do a complicated update via such a connection, and that is understandable. However, with KUDU, I think the situation changes.
Looking at the documentation on KUDU - Apache KUDU - Developing Applications with Apache KUDU, the follwoing questions:
It is unclear if I can issue a complex update SQL statement from a SPARK / SCALA environment via an IMPALA JDBC Driver (due to security issues with KUDU).
In SPARK KUDU Native Mode DML seems tied to a Data Frame approach with INSERT and UPSERT. What if I just want to write a free-format SQL DML statement like an UPDATE? I see that we can use Spark SQL to INSERT (treated as UPSERT by default) into a Kudu table. E.g.
sqlContext.sql(s"INSERT INTO TABLE $kuduTableName SELECT * FROM source_table")
My understanding with SPARK SQL INSERT ... as per above is that the KUDU table must be a temporary table as well. I cannot approach it directly. So, taking this all in then, how can we approach a KUDU table directly in SPARK? We cannot in SPARK / KUDU, and complicated UPDATES statement via SPARK SCALA / KUDU or SPARK SCALA to KUDU via Impala JDBC connection do not allow this either. I can do some things via shell scripting with saved env vars in some cases I note.

What a bad documentation in this regard.
DML insert, update, ... possible via the "approach" below, some examples:
stmt.execute("update KUDU_1 set v = 'same value' where k in ('1', '4') ;")
stmt.execute("insert into KUDU_1 select concat(k, 'ABCDEF'), 'MASS INSERT' from KUDU_1 ;")
The only thing if using the corresponding stmt.executequery a Java resultset is returned which differs to the more standard approach of reading from JDBC sources and persisting the results. A little surprise here for me. Maybe 2 approaches needed, one for more regular selects and one work DML non-select. Not sure if that can be all in the same programme module. For another time. Yes it can.

Related

Spark + write to Hive table + workaround

Am trying to understand the pros and cons of one of the approaches that I am hearing frequently in my workspace.
Spark while writing data to Hive table (InsertInto) does following
Write to the staging folder
Move data to hive table using output committer.
Now I see people are complaining that the above 2 step approach is time-consuming and hence resorting to
1) Write files directly to HDFS
2) Refresh metastore for Hive
And I see people reporting a lot of improvement with this approach.
But somehow I am not yet convinced that it's the safe and right method. is it not a tradeoff for Automiticy? (Either full table write or no write)
What if the executor that's writing a file to HDFS crashes? I don't see a way to completely revert the half-done writes.
I also think Spark would have done this if it was the right way to do, isn't it?
Are my questions valid? Do you see any good in the above approach? Please comment.
It is not 100% right cause in hive v3 you could only access hive data with hive driver in order not to break new transaction machine.
Even if you are using hive2 you should at least keep in mind that you would not be able to access the data directly as soon as you upgrade.

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

spark predicate push down is not working with phoenix hbase table

I am working on spark-hive-hbase integration.Here phoenix hbase table am using for the integration.
Phoenix : **apache-phoenix-4.14**
HBase : **hbase-1.4**
spark : **spark-2.3**
hive : **1.2.1**
I am using spark thrift server and accessing the table using jdbc.
Almost all basic features which i tested is working fine. but when i submit a query from spark with where condition it's submitted to phoenix with out where condition and all filtering happening in the spark side.
If the table has billions of data we can't go with this.
example:
Input-query: select * from hive_hbase where rowid=0;
Query-submitted: PhoenixQueryBuilder: Input query : select /*+ NO_CACHE */ "rowid","load_date","cluster_id","status" from hive_hbase
is it a bug?
Please suggest if there is any way to force the query to submit with where condition(filter) (with jdbc only).
Thanks & Regards
Rahul
The above-mentioned behavior is not a bug rather a feature of spark, which will make sure that filter is not happening on the DB side rather it is done at spark's end, which therefore ensures performance for a non-rowkey filter and execution can be finished fast. If you still want to push the predicates for all intents and purposes you can use phoenix-spark or rather edit the predicate pushdown code of spark on your own. Below are links for your reference
https://community.hortonworks.com/questions/87551/predicate-pushdown-support-in-hortonworks-hbase-co.html
http://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read

How can Hive on Spark read data from jdbc?

We are using Hive on Spark, and we want to do everything on hive, and using spark to calculate. That mean's we don't need to write map/reduce code but sql-like code.
And now we got a problem here, we want to read datasource like postgresql, and control it by simple sql code. And we want it run on cluster.
I've got a idea, I can write some Hive udfs to connect to a jdbc and make a table like data, but I've found it doesn't run on spark job, then it will be useless.
What we want is typing in hive like that :
hive>select myfunc('jdbc:***://***','root','pw','some sql here');
Then I can get a table in hive, and let it join others. In the other way, no matter what engine hive use, we want to read other datasource in hive.
I don't know what to do now, maybe some one can give me some advice.
It's there any way to do like this:
hive> select * from hive_table where hive_table.id in
(select myfunc('jdbcUrl','user','pw','sql'));
I know that hive is used to compile the sql to MapReduce job, what I want to know is how to do to make my sql/udf compile to MapReduce job as spark.read().jdbc(...)
I think it’s easier to load the data from db into dataframe, then you could dump it to hive if necessary.
Read this: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases
See the property name dbtable, you could load a part of a table defined in sql query.

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources