Currently driving an RnD project testing hard against Azure's HDInsight Hadoop service. We use SQL Server Integration Services to manage ETL workflows, and so making HDInsight work with SSIS is a must.
I've had good success with a few of the Azure Feature Pack tasks. But there is no native HDInsight/Hadoop Destination task for use with DFTs.
Problem With Microsoft's Hive ODBC Driver Within An SSIS DFT
I create a DFT with a simple SQL Server "OLE DB Source" pointing to the cluster with a "ODBC Destination" using Microsoft HIVE ODBC Driver. (Ignore red error. It has detected the cluster is destroyed).
I've tested the cluster ODBC connection after entering all parameters, and it tests "OK". It is able to read the HIVE table even and map all columns to. The problem arrives at run time. It generally just locks up, with no rows in counter, or it will get to a handful of rows in the buffer and freeze.
I've troubleshooted with:
Verified connection string and Hadoop cluster username/password.
Recreated cluster and task several times.
Source is SQL Server, and runs fine if i point it to only a file destination or recordset destination.
Tested a smaller number off rows to see if it is a simple performance issue (SELECT TOP 100 FROM stupidTable). Also tested with only 4 columns.
Tested on a separate workstation to make sure it wasn't related to the machine.
All that said, and I can't figure out what else to try. I'm not doing much different than examples on the web like this one, except that I'm using the ODBC as a Destination and not a Source.
Has anyone had success with using the HIVE driver or another one within an SSIS Destination task? Thanks in advanced.
Related
Usecase: I have two databases, one for prod and one for dev. The prod uses an SAP JDBC driver, and the dev uses an Oracle JDBC driver as they are based on different DB's. I have to fetch data from prod DB, perform few operations and save it in dev DB for few project needs.
Issue: Currently am using these third-party drivers by setting "spark.driver.extraClassPath" in Spark Context. But this takes in only one argument. Thus, I am able to connect to only one of the DB's at a time.
Is there are any way I can make two different JDBC class path configuration? If not, then how can I approach this issue? Any guidance is much appreciated!!
Solution:
Instead of defining the driver file path, providing the folder path loads all drivers in that folder. So, in my case, I placed both the SAP and Oracle JDBC drivers in same folder and mentioned it in the Spark Context Configuration like shown in the below snippet.
.set("spark.driver.extraClassPath", r"<folder_path_jdbc_drivers>\*")
Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)
I have multiple Terabyte files that needs to be loaded into a database which sits on top of a high performance AZURE SQL server in cloud.
For now i'm trying to load these files via an SSIS package and its taking more than 12 hours to complete for 5 files.
I believe HDInsight/ Data Bricks are in Azure to do big data ETL process and analyze data using Ambari and other UI. But is it possible to use the same(HDInsight or DataBricks) to load the huge data files into a SQL table/database ? (Like using clusters to do load mutiple files in a parallel execution mode)
Any suggestion/help is much appreciated
Since you mentioned SSIS , I was wondering if you have considered the option of using Azure data factory ( personally I consider that to be the next version of SSIS on cloud ) ,the copy activity should do the trick and it does support parallel execution . Since you are considering the SQL Azure , we need to consider the congestion issue on the sink side , i meant the scenario where all the terabytes of files try to write to the SQL table at the same time .
I have created a table, ztest7 in the default database in my hive. I am able to query it using beeline. In tableau, I can query it using a custom sql.
However the table does NOT show when I search for it.
Am I missing something here?
Tableau Desktop Version = v10.1.1
Hive = v2.0.1
Spark = v2.1.0
Best Regards
I have the same issue with Tableau Desktop 10 (mac) to Hive (2.1.1) via Spark SQL 2.1 (on centos 7 server)
This is what I got from Tableau Support:
In Tableau Desktop, the ability to connect to Spark SQL without a
defining a default schema is not currently built into the product.
As a preliminary step, to define a default schema, configure the Spark
SQL hivemetastore to utilize a SchemaRDD or DataFrame. This must be
defined in the Hive Metastore for Tableau Desktop to be able to access
it. Pure schema-less Spark RDD's can not be queried by Spark SQL
because of the lack of a schema. RDDs can be converted into
SchemaRDDs, which have additional schema metadata as Spark SQL
provides access to SchemaRDDs. When a SchemaRDD is created, it is only
available in the local namespace or context, and is unavailable to
external services accessing Spark through ODBC and the Spark Thrift
Server. For Tableau to have access, the SchemaRDD needs to be
registered in a catalog that is available outside of just the local
context; the Hive Metastore is currently the only supported service.
I don't know how to check/implement this.
PS: I'd have posted this as a comment because I am not allowed to as I am new to Stack Overflow.
In the file labeled Table on the left side of the screen, Try selecting contains, entering part of your table name and hitting enter
I ran into similar issue. In my case, I had loaded tables using HIVE but the tableau connection to the data source was made using Impala as shown in the image below.
To fix the issue of not seeing the tables in tableau dropdown, try running INVALIDATE METADATA database.table_name in the impala interface. This fixed the problem for me.
To know why this fixes the issue, refer this link.
I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.