How can I Install Presto on HDInsight?

How can I Install Presto on HDInsight? - azure

I am new to the HDInsight of Azure.
I am trying to install presto on the HDInsight cluster.
As a test, I want to run TPC-H Query over. Here are what I did so far.
I loaded TPC-H tables on Hive
I am able to run a query over hive cli.
I am able to run show tables query on presto cli.
I am not able to run queries such as select count(*) from region; with Query 20200605_074052_00011_6etih failed: cannot create caching file system error message.
When I submit show tables query on presto cli, I got messages below.
Query 20200605_074050_00010_6etih, FINISHED, 5 nodes
Splits: 70 total, 70 done (100.00%)
0:00 [8 rows, 326B] [27 rows/s, 1.08KB/s]
I barely touched hadoop settings such as hdfs-site.xml or, core-site.xml and presto's configuration is nothing but settings about memories.
Any help would be appreciated. Thanks for reading it.

You can install Starburst Presto from HDInsights marketplace.
Read more: https://azure.microsoft.com/pl-pl/blog/azure-hdinsight-and-starburst-brings-presto-to-microsoft-azure-customers/
However, Starburst does not provide an updated version of this solution, recommending Kuberneters-based (e.g. using Azure AKS) solution instead. See https://docs.starburstdata.com/latest/installation/azure.html
Disclaimer: I am from Starburst.

Related

How can I apply a schema.cql to a new DataStax Astra database keyspace?

I've got an existing Cassandra application I want to try in Datastax Astra. I've created my Astra database and am able to connect to it from NodeJS successfully. The next task I have is to apply my existing schema.cql to the keyspace.
It should be easy enough right? But never-the-less I can't see any obvious way to do this. Can someone walk me through it please. I can then I think use the dsbulk tool to upload a dataset.
Thanks in advance Astra experts.
Rod

The easiest way to do it is to cut-and-paste the CQL statements one-by-one into the CQL console and run them one at a time. This is what I would recommend since you have guarantees that each DDL statement has completed successfully before you can execute the next one.
Alternatively, you can download the Standalone CQL Shell to your laptop/desktop and run the cqlsh commands from there. You will need to configure it with your app token + secure connect bundle to be able to connect to your database.
For more info, see the instructions for Installing standalone cqlsh. Cheers!

Converting comment to answer:
Unauthorized('Error from server: code=2100 [Unauthorized]
message="No SELECT permission on <table system_virtual_schema.keyspaces>
What is the role you have created your token for in Astra DB? Otherwise, create a new one as "Database Administrator" and that should be able to SELECT FROM system_virtual_schema.keyspaces.
While creating use-specific roles is a good idea, only privileged roles can run data definition language (DDL) commands.

How to install Hive Metastore in Kubernetes?

I am working on a project on Kubernetes where I use Spark SQL to create tables and I would like to add partitions and schemas to an Hive Metastore. However, I did not found any proper documentation to install Hive Metastore on Kubernetes. Is it something possible knowing that I have already a PostGreSQL database installed ? If yes, could you please help me with any official documentation ?
Thanks in advance.

Hive on MR3 allows the user to run Metastore in a Pod on Kubernetes. The instruction may look complicated, but once the Pod is properly configured, it's easy to start Metastore on Kubernetes. You can also find the pre-built Docker image at Docker Hub. Helm chart is also provided.
https://mr3docs.datamonad.com/docs/k8s/guide/run-metastore/
https://mr3docs.datamonad.com/docs/k8s/helm/run-metastore/
The documentation assumes MySQL, but we have tested it with PostgreSQL as well.

SSIS to Azure HDInsight Using Microsoft Hive ODBC Driver

Currently driving an RnD project testing hard against Azure's HDInsight Hadoop service. We use SQL Server Integration Services to manage ETL workflows, and so making HDInsight work with SSIS is a must.
I've had good success with a few of the Azure Feature Pack tasks. But there is no native HDInsight/Hadoop Destination task for use with DFTs.
Problem With Microsoft's Hive ODBC Driver Within An SSIS DFT
I create a DFT with a simple SQL Server "OLE DB Source" pointing to the cluster with a "ODBC Destination" using Microsoft HIVE ODBC Driver. (Ignore red error. It has detected the cluster is destroyed).
I've tested the cluster ODBC connection after entering all parameters, and it tests "OK". It is able to read the HIVE table even and map all columns to. The problem arrives at run time. It generally just locks up, with no rows in counter, or it will get to a handful of rows in the buffer and freeze.
I've troubleshooted with:
Verified connection string and Hadoop cluster username/password.
Recreated cluster and task several times.
Source is SQL Server, and runs fine if i point it to only a file destination or recordset destination.
Tested a smaller number off rows to see if it is a simple performance issue (SELECT TOP 100 FROM stupidTable). Also tested with only 4 columns.
Tested on a separate workstation to make sure it wasn't related to the machine.
All that said, and I can't figure out what else to try. I'm not doing much different than examples on the web like this one, except that I'm using the ODBC as a Destination and not a Source.
Has anyone had success with using the HIVE driver or another one within an SSIS Destination task? Thanks in advanced.

HDInsight SparkHistory on Azure shows no applications

I have created a Spark HDInsight Cluster on Azure. The cluster was used to run different jobs (either Spark or Hive).
Until a month ago, the history of the jobs could be seen in the Spark History Server dashboard. It seems that following the update that introduced Spark 1.6.0, this dashboard is no longer showing any applications.
I have also tried to bypass this issue by executing the PowerShell cmdlet for get-azurehdinsightjob as sugested here. The output is again an empty list of applications.
I would appreciate any help as this dashboard used to work and now all my experiments are stalled.

I managed to solve the issue by deleting everything inside wasb:///hdp/spark-events. Maybe the issue was related to the size of the folder, as no other log files could be appended.
All the following jobs are now appearing successfully in the Spark History Server dashboard.

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.

Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.

If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string