Have a server with a Hadoop instance running on it.
Basically I'd like to connect to some HDFS table via Excel on my local machine. I know that Power Query Add-in helps in dealing that operation and provide an opportunity to establish connection with HDFS. But here is the thing-I have Excel 2016, so according to Microsoft Documentation, Power Query is already built in the Excel. But when I'm trying to do "Data-Get Data-From Other Sources" there is simply no option like "Get Data from From Hadoop File (HDFS)"
What am I doing wrong and what exact steps do I need to take to get access to HDFS from Excel?
For me HDFS shows up here:
but not here:
What does the first New Query > From Other Sources look like for you?
Related
Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.
I am facing a weird issue which working with HDP2.4.0.0-169 sandbox.
I have HDP with host name - sandbox.hortonworks.com and ip 192.168.159.129, I have all default hadoop and other services up and running on that.
I have written a spark code for creating a table in hive and reading the content any existing table of hive present on HDP. I also have a code for writing data/inserting data into newly created hive table.
As soon as I run this code from my Eclipse using run as Scala Application option, it creates the table. It also reads the table but it is not able to write anything in any new or existing table created. This seems to be very weird to me as I can create table but can not write anything in it.
It gives me following error
Exception while executing hive query.java.net.UnknownHostException:
sandbox.hortonworks.com
I have an entry for sandbox.hortonworks.com in my windows hosts file
as well but unable to figure out why it is not allowing me to write
any data in hive table when I can create a table?
Is there any user's read/write permission issue?
If yes, then why it is allowing me to create and read data from hive using same user from eclipse?
It is only not allowing me to insert data into those hive tables.
Any quick pointer/reference would be appreciated.
Regards,
Bhupesh
Got it.
By mistake the entry made in C:\Windows\System32\drivers\etc\hosts file was wrong.
It should be
192.168.159.129 sandbox.hortonworks.com
Currently driving an RnD project testing hard against Azure's HDInsight Hadoop service. We use SQL Server Integration Services to manage ETL workflows, and so making HDInsight work with SSIS is a must.
I've had good success with a few of the Azure Feature Pack tasks. But there is no native HDInsight/Hadoop Destination task for use with DFTs.
Problem With Microsoft's Hive ODBC Driver Within An SSIS DFT
I create a DFT with a simple SQL Server "OLE DB Source" pointing to the cluster with a "ODBC Destination" using Microsoft HIVE ODBC Driver. (Ignore red error. It has detected the cluster is destroyed).
I've tested the cluster ODBC connection after entering all parameters, and it tests "OK". It is able to read the HIVE table even and map all columns to. The problem arrives at run time. It generally just locks up, with no rows in counter, or it will get to a handful of rows in the buffer and freeze.
I've troubleshooted with:
Verified connection string and Hadoop cluster username/password.
Recreated cluster and task several times.
Source is SQL Server, and runs fine if i point it to only a file destination or recordset destination.
Tested a smaller number off rows to see if it is a simple performance issue (SELECT TOP 100 FROM stupidTable). Also tested with only 4 columns.
Tested on a separate workstation to make sure it wasn't related to the machine.
All that said, and I can't figure out what else to try. I'm not doing much different than examples on the web like this one, except that I'm using the ODBC as a Destination and not a Source.
Has anyone had success with using the HIVE driver or another one within an SSIS Destination task? Thanks in advanced.
I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.
Is it possible to use Amazon Redshift as the data source for an Excel pivot table? Googling this question didn't yield any obvious answers. Thanks.
Yes I have.
However since the other answers were written, rather than use generic PostGres drivers, you should use customised Redshift Drivers provided by Amazon.
The answers you are looking for are here:
http://docs.aws.amazon.com/redshift/latest/mgmt/configure-odbc-connection.html
You can consume Amazon Redshift databases with the PostGRESQL ODBC drivers.
Download and install driver.
Set up a DSN on the box pointed to your Redshift server with your AWS credentials (you can find the ODBC connection string in the settings area of your cluster.)
Use that connection in Excel or any other product that can connect to ODBC connections.
You can convert Excel to CSV and upload it to S3. Once files are uploaded to S3 you can run copy command to copy data from S3 to Redshift cluster. You can run copy command via PostGRESQL JDBC connector or available tools like SqlWorkbench.