Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine - apache-spark

I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL Warehouse settings => Data Access Configuration
spark.databricks.hive.metastore.glueCatalog.enabled : true
The crawler using DELTA LAKE setup from AWS produces the following table metadata
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}

I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting
spark.databricks.hive.metastore.glueCatalog.enabled : true
implies that the table will be a hive table.
You will need to access the table in S3 directly, losing the advantages of the meta data catalog.
You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy:
{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }

I was able to query a delta table created by glue crawlers after updating the location. In your case it would need to be changed from:
s3://COMPANY/club/attachment/_symlink_format_manifest
to
s3://COMPANY/club/attachment
This is because delta on spark doesn't look at _symlink_format_manifest like hive and presto. It just needs to know the root directory.
The command in databricks to update the location is something like this:
ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"
Note: your database location has to be set as well in order for that command to work

Related

Azure Data Factory V2 - Input and Output

I'm trying to reproduce the following architecture based on the following github repo: https://github.com/Azure/cortana-intelligence-price-optimization
The problem is the part linked to the ADF, since in the guide it uses the old version of ADF: I don't know how to map in ADF v2 the "input" and "output" properties of a single activity so that they point to a dataset.
The pipeline performs a spark activity that does nothing more than execute a python script, and then I think it should write data into the dataset I defined already.
Here is the json of the ADF V1 pipeline inside the guide, which I cannot replicate:
"activities": [
{
"type": "HDInsightSpark",
"typeProperties": {
"rootPath": "adflibs",
"entryFilePath": "Sales_Data_Aggregation_2.0_blob.py",
"arguments": [ "modelsample" ],
"getDebugInfo": "Always"
},
"outputs": [
{
"name": "BlobStoreAggOutput"
}
],
"policy": {
"timeout": "00:30:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AggDataSparkJob",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
},
The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. When you use an on-demand Spark linked service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then deletes the cluster once the processing is complete.
Upload "Sales_Data_Aggregation_2.0_blob.py" to storage account attached to the HDInsight cluster and the modify the sample definition of a spark activity and create a schedule trigger and run the code:
Here is the sample JSON definition of a Spark activity:
{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}
Hope this helps.

kafka-connect : Error in sink cassandra connector

I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"
Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.

Auto-Table generation in Cassandra with kafka connect cassandra sink

I am using confluent.connect.cassandra.CassandraSinkConnector, for kafka connect cassandra sink.
I wanted to know if it was possible to auto-generate cassandra tables from the kafka topic using io.confluent.connect.cassandra.CassandraSinkConnector as connector.
If it is possible, can you please suggest what configuration to set to enable this feature. I have tried all the configurations mentioned in the documentation, but I was not successful in creating a table.
This is the configuration i am using:
{
"name": "cassandra-test4",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "3",
"topics": "orders-topic2",
"cassandra.contact.points": "my_ip",
"cassandra.keyspace": "test_cas",
"cassandra.write.mode": "Insert",
"cassandra.table.manage.enabled": "true",
"cassandra.sink.route": "test_cas.orders",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"flush.size": "1",
"cassandra.keyspace.create.enabled": "true",
"name": "cassandra-test4"
},
"tasks": [
{
"connector": "cassandra-test4",
"task": 0
},
{
"connector": "cassandra-test4",
"task": 1
},
{
"connector": "cassandra-test4",
"task": 2
}
],
"type": null
}
This is should be done by setting cassandra.keyspace.create.enabled & cassandra.table.manage.enabled properties to true. See documentation.
But be really careful - it's very easy to get schema disagreement in your cluster, and then you need to do additional steps to recover from it. It's better pre-create tables before starting connector...

Azure Data factory Dataset From a StoredProcedure

My goal is to pass data from one SQl azure database (User DB) to another SQl azure database (Datawarehouse) through a stored procedure.
I have created two linked Services, one for each DB. And two DataSets of which I have doubts.
The stored procedure in question collects data from a table and several joins with other tables and returns a result that should be stored in a table in the Datawarehouse
The SP is like this:
ALTER PROCEDURE [DataWarehouse].[Item_init]
AS
BEGIN
SET NOCOUNT ON
SELECT Id, a.Name, Code, f.Name, s.Name, g.Name
FROM Item.Item a
join Item.Groupg on g.idGroup= a.idGroup
join Item.Subfam s on s.idSubfam = g.idSubfam
join Item.Fam f on f.idFam= s.idFam
END
The dataset that collects data from the UserDB (I think it is not correct) is like this:
{
"name": "ds_SProcItem_init",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "UserTable",
"typeProperties": {
"tableName": "Item.Item"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The other dataset:
{
"name": "ds_DWItemOutput",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "DataWareHouse",
"typeProperties": {
"tableName": "Item"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The pipeline that communicates the datasets is as follows:
{
"name": "SprocItem_InitPipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DataWarehouse.Item_init"
},
"inputs": [
{
"name": "ds_SProcItem_init"
}
],
"outputs": [
{
"name": "ds_DWItemOutput"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SprocItem_Init"
}
],
"start": "2016-08-02T00:00:00Z",
"end": "2016-08-02T05:00:00Z",
"isPaused": false,
"hubName": "pruebasaas_hub",
"pipelineMode": "Scheduled"
}
}
Please, someone who knows the subject, could you help me?
Thanks!
Given the limits of Azure SQL DB I suggest you need to use a copy activity here as well as the stored procedure. You need to handle this within the confines of how ADF wants to work. Remember this isn't SSIS :-)
If I was building the data factory these are the steps I'd take...
For completeness define datasets of each of the tables used by the stored procedure.
First pipeline. Have an activity that calls the stored procedure that does the joins of the input datasets and outputs to a new staging table (do a SQL INSERT INTO ... SELECT... here) on the first Azure SQL DB instance.
Have the output dataset in ADF for the staging table defined (the proc result).
Second pipeline. Have a copy activity from the output staging table in point 3 as the input. Then output to the table on the second Azure SQL DB instance.
Again for completeness an ADF dataset for the final destination table.
The copy activity bridges the gap where cross database queries aren't possible and SQL Server Linked Servers don't exist.
Picture to help...
(Please forgive the poor paint skills)
Make sense? :-)
Good, crack on.

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.
Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.
Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity
I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Resources