HDInsight and Talend Open Studio for Big Data - azure

I am currently working on a project in which I need to connect Talend open Studio for Big Data (v 6.3.1) to an Azure’s HDInsight (3.5) Hadoop Cluster. So far, I am trying a simple example which consists in creating an Hive Table.
For that, I am using the following diagram:
The hive connection was configured as followed:
… and please find below the specifications of the tHiveCreateTable_1 node:
By running this process:
· The specified container and deployment Blob is created (see image below) - which make me believe that everything is ok with the Windows Storage Configuration
· However the tHiveCreateTable_1 node has an error (see image below)
· I strongly believe that it´s something related with the Hostname and Port;
· I tried to use the host name of the cluster and the hostname of the Hive server that we can find in Ambari (see image below)
· But none of them worked as expected.
Has any one tried something similar to this?
Note: It seems reasonably important to say that the Azure version supported by Talend is 3.4, however, I am using 3.5, it might be it.
Many thanks for your help in advance.

According to the offical docuemnt about the differences between Hadoop components and versions available with HDInsight, HDInsight 3.5 is based on Hortonworks Data Platform(HDP) 2.5, but HDI 3.4 is based on HDP 2.4. However, there is not big version difference for their Hive componets or other componets. So, my suggestion is that you can try to create a HDI 3.4 using the same Azure Storage account for your current HDI 3.5, without more effects for your needs.

Related

Compatible libraries for Azure Cosmos GraphDB and Azure Databricks

I am trying to offload data from Azure-Databricks onto Azure Cosmos-GraphDB as needed vertices and edges.
I am continuously encountering java.lang.ClassNotFoundException error. I have mostly tried all my cards with all combinations of Library versions and respective Databricks Runtime Versions, but no luck. I have tried most of the compatible library versions mentioned under - https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-2_2-12/README.md#download
I will be using DBR- 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12), so any guidance on the right MAVEN libraries for Azure Cosmos Graph DB, please?
java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.azure.cosmosdb.spark.
Please find packages at http://spark.apache.org/third-party-projects.html
Below library with Graphframes did the trick. I am able to ingest data into azure cosmos-DB, even quicker than gremlin-python.
com.azure.cosmos.spark:azure-cosmos-spark_3-2_2-12:4.11.1
I had to engage cosmos.oltp SQL API along with the above library.
cosmos_edges.write.format("cosmos.oltp").options(**cfg).mode("APPEND").save()

Apache Spark Connector - where to install on Databricks

This Apache Spark connector: SQL Server & Azure SQL article from Azure team describes how to use this connector.
Question: If you want to use the above connector in Azure Databricks, where will you install it?
Remarks: The above article tells you to install it from here and import it in, say, your notebook using com.microsoft.azure:spark-mssql-connector_2.12:1.2.0. But it does not tell you where to install. I'm probably not understanding the article correctly. I need to use it in an Azure Databricks and would like to know where to install the connector jar (compiled) file.
You can do this in the cluster setup. See this documentation: https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-libraries.html
In short, when setting up the cluster, you can add third party libraries by their Maven coordinates - "com.microsoft.azure:spark-mssql-connector_2.12:1.2.0" is an example of a Maven coordinate.

What happens exactly when setting spark.databricks.service.server.enabled to true on Databricks?

Can anyone explain what spark.databricks.service.server.enabled does exactly when it's set to true?
The only thing I can find in the documentation is that it should be set to true when using Databricks Runtime 5.3 or below, but I can't find an explanation of what's happening exactly under the hood (https://docs.databricks.com/dev-tools/databricks-connect.html)
I would be grateful for any helpful response.
Thanks,
Note: spark.databricks.service.server.enabled true helps you to work on a Databricks Cluster from a remote machine.
As part of cluster setup "spark.databricks.service.server.enabled true" helps Databricks Connect allows you to connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (Zeppelin, Jupyter), and other custom applications to Azure Databricks clusters and run Apache Spark code.

Cassandra (Datastax v3.5) using Stratio Lucene Index plugin - Windows

I'm trying to look at using the Stratios Lucene index plugin (on Windows)installation of Cassandra (Datastax v3.5) but can't get Cassandra to recognize it.
I'm aware that you must use the corresponding version to Cassandra and have tried with 3.0.5 & 3.5 but both with the same results. The service is stopped, the index .jar file is copied to the lib directory & then the service is restarted. Then using CQLSH, I can create the relevant keyspace & table (as described in the Stratio documentation) but when attempting to create the index it fails with the following message:
Query invalid because of configuration issue: message="Unable to find custom indexer class 'com.stratio.cassandra.lucene.Index'"
https://github.com/Stratio/cassandra-lucene-index/tree/branch-3.5
Does anyone have any idea how to get this implemented & working?
Is there a central forum or a point of contact for Stratios Lucene index support?
This resource https://github.com/Stratio/cassandra-lucene-index/issues/118#issuecomment-211796434 suggests that only open source Apache Cassandra is officially supported by this plugin. It might work with DSE, might not. I checked 3.5.0 version works on Linux with Apache Cassandra but does not work on Windows with DSE :( According to Datastax docs, it should support custom secondary indexes. So, it might be the plugin does not run on Windows?

Upgrade Pig on HDInsight Emulator

I am currently using the HDInsight Hadoop Emulator, which comes with Pig Version .12. Our problem involves parsing xml files and I'd like to use the XPath command from PiggyBank, but it is only available with Pig version .13.
a. Can I Upgrade Pig in the emulator? How would I go about doing that?
b. Is the version of Pig really critical, or could I just get the latest version of the PiggyBank.jar file and use that?
currently there is no way to update component versions for HDInsight emulator (or at least that's very hard to do).
I have never used PiggyBank, but from the introduction page (https://cwiki.apache.org/confluence/display/PIG/PiggyBank) it seems that it is a collection of UDFs which should work with Pig 0.12. So i guess using the jar directly (and of course registering it in pig) should work.
Also, we are looking for an updated story for HDInsight emulator - so feel free to reach us at hdivstool at microsoft dot com if you have any thoughts, comments, requirements.
Xiaoyong Zhu from HDInsight team

Resources