Create External Data Source with HDInsight - azure-hdinsight

I'm trying to create an external data source with my HDInsight Cluster. While doing so, I need to provide location as Hadoop, Name, Node, IP Address, and port number.
So, where could I find the Name, Node, IP Address, Resource Manager location, IP Address and port numbers for both on HDInsight cluster?
I already browsed through Core-site.xml & yarn-site.xml and found nothing for HDInsight.
--- 3: syntax for Creating an external data source.
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
TYPE = HADOOP,
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
CREDENTIAL = HadoopUser1
);
-- LOCATION (Required) : Hadoop Name Node IP address and port.
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
-- CREDENTIAL (Optional): the database scoped credential, created above.
Thanks.

If I understand your question correctly you already have a HDInsight cluster and are trying to get Azure SQL DW to talk to it via an external table. If you search the Syntax section of the documentation for CREATE EXTERNAL DATA SOURCE for "Azure SQL Data Warehouse" you will see the only way Polybase in Azure SQL DW works at the moment is by talking to Azure Blob Storage and Azure Data Lake Store. (Stay tuned to that documentation page as Polybase in Azure SQL DW will get more flexible over time as they continue to enhance it.)
So for now you should have HDInsight write to an external table defined in Hive and then have Azure SQL DW point at the same folder in blob storage and declare its own external table that reads those blobs.

Related

Azure Data Factory Exception while reading table from Synapse and using staging for Polybase

I'm using Data Flow in Data Factory and I need to join a table from Synapse with my flow of data.
When I added the new source in Azure Data Flow I had to add a Staging linked service (as the label said: "For SQL DW, please specify a staging location for PolyBase.")
So I specified a path in Azure Data Lake Gen2 in which Polybase can create its tem dir.
Nevertheless I'm getting this error:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'keyMapCliente': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://MyContainerName#mystorgaename.dfs.core.windows.net/Raw/Tmp/e3e71c102e0a46cea0b286f17cc5b945/' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.","Details":"shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://MyContainerName#mystorgaename.dfs.core.windows.net/Raw/Tmp/e3e71c102e0a46cea0b286f17cc5b945/' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:872)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:767)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jd"}
The following are the Azure Data Flow Settings:
this the added source inside the data flow:
Any help is appreciated
I have reproed and was able to enable stagging location as azure data lake Gen2 storage account for polybase and connected synapse table data successfully.
Create your database scooped credentials with azure storage account key as secret.
Create an external data source and an external table with the scooped credentials created.
In Azure data factory:
Enable staging and connect to azure data lake Gen2 storage account with Account key authentication type.
In the data flow, connect your source to the synapse table and enable staging property in the source option

Databricks - transfer data from one databricks workspace to another

How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'

How to access one databricks delta tables from other databricks

I want to access one Databricks environment delta tables from other Databricks environment by creating global Hive meta store in one of the Databricks. Let me know if it is possible or not.
Thanks in advance.
There are two aspects here:
The data itself - they should be available to other workspaces - this is done by having a shared storage account/container, and writing data into it. You can either mount that storage account, or use direct access (via service principal or AAD passtrough) - you shouldn't write data to built-in DBFS Root that isn't available to other workspaces. After you write the data using dataframe.write.format("delta").save("some_path_on_adls"), you can read these data from another workspace that has access to that shared workspace - this could be done either
via Spark API: spark.read.format("delta").load("some_path_on_adls")
via SQL using following syntax instead of table name (see docs):
delta.`some_path_on_adls`
The metadata - if you want to represent saved data as SQL tables with database & table names instead of path, then you can use following choices:
Use the built-in metastore to save data into location on ADLS, and then create so-called external table in another workspace inside its own metastore. In the source workspace do:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
and in another workspace execute following SQL (either via %sql in notebook or via spark.sql function:
CREATE TABLE db_name.table_name USING DELTA LOCATION 'some_path_on_adls'
Use external metastore that is shared by multiple workspaces - in this case you just need to save data correctly:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
you still need to save it into shared location, so the data is accessible from another workspace, but you don't need to register the table explicitly, as another workspace will read the metadata from the same database.

Hive external tables map to azure blob storage

Is there a way to create a Hive external table using with location pointing to Azure Storage? We actually want to connect SAP HANA (SDA) to blob storage, so it seems the only way is to create an external hive table first which points to Azure blob storage and then use Hive ODBC connector/spark connectorto connect it toHANA SAP`. Does anyone have any idea how to achieve that?
You can create external tables in Hive or Spark on Azure. There are several options available:
Azure HDInsight
Azure Databricks (via Spark)
Hadoop distros supporting Azure Blob Storage (e. g. HDP)
External table creation would reference the data in the Blob storage account. See the following example for a Hive table created in HDInsight (wasb is used in the location):
CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
lines terminated by '<line separator>' STORED AS TEXTFILE
LOCATION 'wasb:///<directory in Azure blob>'
TBLPROPERTIES("skip.header.line.count"="1");
or in Azure Databricks:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'wasbs://<storage-account#<containername>.blob.core.windows.net/<directory>'
See also:
HDInsight Documentation
Azure Databricks Documentation
I don' t know what SAP supports. ODBC-Access is possible to all of the solutions.

How to create/access Hive tables with external Metastore on additional Azure Blob Storage?

I want to perform some data transformation in Hive with Azure Data Factory (v1) running a Azure HDInsight On Demand cluster (3.6).
Since the HDInsight On Demand cluster gets destroyed after some idle time and I want/need to keep the metadata about the Hive tables (e.g. partitions), I also configured an external Hive metastore, using a Azure SQL Server database.
Now I want to store all production data on a separate storage account than the one "default" account, where Data Factory and HDInsight also create containers for logging and other runtime data.
So I have the following resources:
Data Factory with HDInsight On Demand (as a linked service)
SQL Server and database for Hive metastore (configured in HDInsight On Demand)
Default storage account to be used by Data Factory and HDInsight On Demand cluster (blob storage, general purpose v1)
Additional storage account for data ingress and Hive tables (blob storage, general purpose v1)
Except the Data Factory, which is in location North Europe, all resources are in the same location West Europe, which should be fine (the HDInsight cluster must be in the same location as any storage accounts you want to use). All Data Factory related deployment is done using the DataFactoryManagementClient API.
An example Hive script (deployed as a HiveActivity in Data Factory) looks like this:
CREATE TABLE IF NOT EXISTS example_table (
deviceId string,
createdAt timestamp,
batteryVoltage double,
hardwareVersion string,
softwareVersion string,
)
PARTITIONED BY (year string, month string) -- year and month from createdAt
CLUSTERED BY (deviceId) INTO 256 BUCKETS
STORED AS ORC
LOCATION 'wasb://container#additionalstorage.blob.core.windows.net/example_table'
TBLPROPERTIES ('transactional'='true');
INSERT INTO TABLE example_table PARTITIONS (year, month) VALUES ("device1", timestamp "2018-01-22 08:57:00", 2.7, "hw1.32.2", "sw0.12.3");
Following the documentation here and here, this should be rather straightforward: Simply add the new storage account as an additional linked service (using the additionalLinkedServiceNames property).
However, this resulted in the following exceptions when a Hive script tried to access a table stored on this account:
IllegalStateException Error getting FileSystem for wasb : org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message
139827842123416:error:0D06B08E:asn1 encoding routines:ASN1_D2I_READ_BIO:not enough data:a_d2i_fp.c:247:
139827842123416:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode error:asn_mime.c:192:
139827842123416:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
Some googling told me that this happens, when the key provider is not configured correctly (i.e. the exceptions is thrown because it tries to decrypt the key even though it is not encrypted). After manually setting fs.azure.account.keyprovider.<storage_name>.blob.core.windows.net to org.apache.hadoop.fs.azure.SimpleKeyProvider it seemed to work for reading and "simple" writing of data to tables, but failed again when the metastore got involved (creating a table, adding new partitions, ...):
ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:783)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4434)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:316)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
[...]
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38593)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38561)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:38487)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1103)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1089)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2203)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:99)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:736)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:724)
[...]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:178)
at com.sun.proxy.$Proxy5.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:777)
... 24 more
I tried googling that again, but had no luck finding something usable. I think it may have to do something with the fact, that the metastore service is running separately from Hive and for some reason does not have access to the configured storage account keys... but to be honest, I think this should all just work without manually tinkering with the Hadoop/Hive configuration.
So, my question is: What am I doing wrong and how is this supposed to work?
You need to make sure you also add the hadoop-azure.jar and the azure-storage-5.4.0.jar to your Hadoop Classpath export in your hadoop-env.sh.
export HADOOP_CLASSPATH=/usr/lib/hadoop-client/hadoop-azure.jar:/usr/lib/hadoop-client/lib/azure-storage-5.4.0.jar:$HADOOP_CLASSPATH
And you will need to add the storage key via the following parameter in your core-site.
fs.azure.account.key.{storageaccount}.blob.core.windows.net
When you create your DB and table you need to specify the location using your storage account and the user id
Create table {Tablename}
...
LOCATION 'wasbs://{container}#{storageaccount}.blob.core.windows.net/{filepath}'
If you still have problems after trying the above check to see whether the storage account is a V1 or V2. We had an issue where the V2 storage account did not work with our version of HDP.

Resources