Compatible libraries for Azure Cosmos GraphDB and Azure Databricks - azure

I am trying to offload data from Azure-Databricks onto Azure Cosmos-GraphDB as needed vertices and edges.
I am continuously encountering java.lang.ClassNotFoundException error. I have mostly tried all my cards with all combinations of Library versions and respective Databricks Runtime Versions, but no luck. I have tried most of the compatible library versions mentioned under - https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-2_2-12/README.md#download
I will be using DBR- 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12), so any guidance on the right MAVEN libraries for Azure Cosmos Graph DB, please?
java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.azure.cosmosdb.spark.
Please find packages at http://spark.apache.org/third-party-projects.html

Below library with Graphframes did the trick. I am able to ingest data into azure cosmos-DB, even quicker than gremlin-python.
com.azure.cosmos.spark:azure-cosmos-spark_3-2_2-12:4.11.1
I had to engage cosmos.oltp SQL API along with the above library.
cosmos_edges.write.format("cosmos.oltp").options(**cfg).mode("APPEND").save()

Related

Apache Spark Connector - where to install on Databricks

This Apache Spark connector: SQL Server & Azure SQL article from Azure team describes how to use this connector.
Question: If you want to use the above connector in Azure Databricks, where will you install it?
Remarks: The above article tells you to install it from here and import it in, say, your notebook using com.microsoft.azure:spark-mssql-connector_2.12:1.2.0. But it does not tell you where to install. I'm probably not understanding the article correctly. I need to use it in an Azure Databricks and would like to know where to install the connector jar (compiled) file.
You can do this in the cluster setup. See this documentation: https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-libraries.html
In short, when setting up the cluster, you can add third party libraries by their Maven coordinates - "com.microsoft.azure:spark-mssql-connector_2.12:1.2.0" is an example of a Maven coordinate.

.Netcore alternative for Microsoft.Azure.Management.HDInsight.Job?

I'm working on converting a library from full .NetFramework to .NetCore
I'm looking for a replacement for Microsoft.Azure.Management.HDInsight.Job, which hasn't been updated in over a year and is not compatible with .NetCore. I was hoping that the functionality would be rolled up into the much-more-recently-updated and netcore-compatible Microsoft.Azure.Management.HDInsight, but that doesn't appear to be the case.
I'm down to use the REST API, but I haven't been able to find the same functionality there. Any guidance would be appreciated.
You could try to install Microsoft.Azure.Management.HDInsight.Job with Package Manager to install some prerelease versions, so that its dependencies would not be conflict with your asp.net core.
I test them, no matter it is preview, it also have the functionality what you want.
Write in Package Manager Console such as:
Install-Package Microsoft.Azure.Management.HDInsight.Job -Version 1.0.7-preview
You could only install the version <= 1.0.7-preview. If not, you may could not install it.
For more detail, you could refer to this article.
I found the REST API I was looking for. It is the WebHCat API, not an Azure API.
MapReduce Job creation: https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+MapReduceJar
Pig Job creation:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig
Hive Job creation:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Hive
Sqoop Job creation: https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-use-sqoop-curl and https://sqoop.apache.org/docs/1.99.3/RESTAPI.html
Hopefully they will release 3.0.0 soon
https://github.com/Azure/azure-sdk-for-net/issues/9219

HDInsight and Talend Open Studio for Big Data

I am currently working on a project in which I need to connect Talend open Studio for Big Data (v 6.3.1) to an Azure’s HDInsight (3.5) Hadoop Cluster. So far, I am trying a simple example which consists in creating an Hive Table.
For that, I am using the following diagram:
The hive connection was configured as followed:
… and please find below the specifications of the tHiveCreateTable_1 node:
By running this process:
· The specified container and deployment Blob is created (see image below) - which make me believe that everything is ok with the Windows Storage Configuration
· However the tHiveCreateTable_1 node has an error (see image below)
· I strongly believe that it´s something related with the Hostname and Port;
· I tried to use the host name of the cluster and the hostname of the Hive server that we can find in Ambari (see image below)
· But none of them worked as expected.
Has any one tried something similar to this?
Note: It seems reasonably important to say that the Azure version supported by Talend is 3.4, however, I am using 3.5, it might be it.
Many thanks for your help in advance.
According to the offical docuemnt about the differences between Hadoop components and versions available with HDInsight, HDInsight 3.5 is based on Hortonworks Data Platform(HDP) 2.5, but HDI 3.4 is based on HDP 2.4. However, there is not big version difference for their Hive componets or other componets. So, my suggestion is that you can try to create a HDI 3.4 using the same Azure Storage account for your current HDI 3.5, without more effects for your needs.

Where is the Azure Storage library changelog?

I'm using the official nuget package of Windows Azure Storage Client Library to retrieve items of my Azure tables.
Recently I updated the package from version 2.0.2.0 to 2.0.5.0 and my app stopped working because the results returned by my storage query are different with the new version.
I'm looking for the library changelog in order to understand how to fix the issue.
Do you know where can I find it ?
The link provided on the nuget page seems to be outdated (it's a changelog between 1.x and 2.x, not between 2.0.2 and 2.0.5 !). Also, the Windows Azure Storage team's blog is not updated.
Please refer to the changelog.txt that is always updated with the respective source code changes.
The changelog is always up to date at https://github.com/Azure/azure-storage-net/blob/master/changelog.txt

Microsoft.WindowsAzure.Storage vs Microsoft.WindowsAzure.StorageClient

What's the difference between these two assemblies and when should I use each? I find that there are class name collisions between them so I imagine that I should only use one.
Example
Microsoft.WindowsAzure.Storage has Microsoft.WindowsAzure.Storage.Table.CloudTableClient
Microsoft.WindowsAzure.StorageClient has Microsoft.WindowsAzure.StorageClient.CloudTableClient
This seems very confusing. I can't imagine that Microsoft intends these to both be used in the same project.
Microsoft.WindowsAzure.Storage is version 2.0 of storage client library while Microsoft.WindowsAzure.StorageClient is the older version. There have been many changes in version 2.0 of the library (some of them are breaking). If you're starting new, I would actually recommend using 2.0 of the library as I found it more intuitive and easy to use than the older version. If you have an application which makes use of 1.7 version of the library, before you decide to upgrade, I would actually recommend reading the following blog posts by Windows Azure Storage Team:
http://blogs.msdn.com/b/windowsazurestorage/archive/2012/10/29/introducing-windows-azure-storage-client-library-2-0-for-net-and-windows-runtime.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2012/10/29/windows-azure-storage-client-library-2-0-breaking-changes-amp-migration-guide.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/06/windows-azure-storage-client-library-2-0-tables-deep-dive.aspx
However please note that there're still some components that your application might be using which has a dependency on storage client library 1.7. Windows Azure Diagnostics is one of them. So for some time you will need to use both versions. Good thing is that you can use both versions simultaneously in your project.
Hope this helps.
EDIT:
I also wrote a few blog posts about migrating code from storage client library 1.7 to 2.0 where I covered some basic scenarios. You can read those posts here:
Migrating blob storage code: http://gauravmantri.com/2012/11/28/storage-client-library-2-0-migrating-blob-storage-code/
Migrating queue code: http://gauravmantri.com/2012/11/24/storage-client-library-2-0-migrating-queue-storage-code/
Migrating table storage code: http://gauravmantri.com/2012/11/17/storage-client-library-2-0-migrating-table-storage-code/

Resources