kafka-connect : Error in sink cassandra connector

kafka-connect : Error in sink cassandra connector - cassandra

I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"

Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.

Related

Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL Warehouse settings => Data Access Configuration
spark.databricks.hive.metastore.glueCatalog.enabled : true
The crawler using DELTA LAKE setup from AWS produces the following table metadata
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}

I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting
spark.databricks.hive.metastore.glueCatalog.enabled : true
implies that the table will be a hive table.
You will need to access the table in S3 directly, losing the advantages of the meta data catalog.
You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy:
{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }

I was able to query a delta table created by glue crawlers after updating the location. In your case it would need to be changed from:
s3://COMPANY/club/attachment/_symlink_format_manifest
to
s3://COMPANY/club/attachment
This is because delta on spark doesn't look at _symlink_format_manifest like hive and presto. It just needs to know the root directory.
The command in databricks to update the location is something like this:
ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"
Note: your database location has to be set as well in order for that command to work

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.

Avoid duplicates during parallel job run in Azure Synapse

Need your suggestions in developing code in Azure Synapse.
We have a requirement where our jobs will run in parallel at same time and insert data to the same table.
During this insert there are changes that duplicate entries will be inserted to the same table.
For Example: If Job A and Job B run at same time both with same values then "not exists" or "not in" will fail to work. In this case I will get duplicates from both the job. Primary key or Unique constraint allows duplicates in Azure synapse. Is there any best way to lock tables during data insert. Like if Job A is running then JOB B should not insert the data to same table. Please pour your suggestions as I am new to this. Note: We use stored Procedure to load the data through ADF V2
Thanks,
Nandini

Duplicates must be handled within jobs before inserting data into Azure Synapse. If the duplicates exists between two jobs, then do it after completion of both jobs. It depends really how you are loading data. You can easily manage by creating a temp table instead of directly loading data to final table. Please make sure the structure of temp table should be same as final table (Distribution, Partition, constraints, nullability of the columns) You can use SQL BCP/INSERT TO/CTAS/CTAS with partition switching with stage table to final table.
If you can share specific scenario, it will be helpful to give suggestions relevant to your use case.

I just got the same case and I solved it with Pipeline Runs - Query By Factory
Use a Until activity before the DataFlow activity that writes the values in the table with this expression #equals(activity('pingPL').output.value[0].runId, pipeline().RunId) as follow:
Into the Until activities put a web activity and a wait time:
a. Web activity body - follow docs:
{
"lastUpdatedAfter": "#{addminutes(utcnow(), -30)}",
"lastUpdatedBefore": "#{utcnow()}",
"filters": [
{
"operand": "PipelineName",
"operator": "Equals",
"values": [
"pipeline_name_where_writeInSynapse_is_located"
]
},
{
"operand": "Status",
"operator": "Equals",
"values": [
"InProgress"
]
}
]
}
b. Wait activity 30 sec or whatever make sense
What is happening is, if you trigger several times the same pipeline in parallel the web activity is going to filter each PL status InProgress. It will look like this:
{
"value": [
{
"id": "...",
"runId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"debugRunId": null,
"runGroupId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "80efce4dbda74636878bc99472978ccf",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:24:01.0210945Z",
"runEnd": "2021-10-13T17:25:06.9692394Z",
"durationInMs": 65948,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:25:06.9704432Z",
"annotations": [],
"runDimension": {},
"isLatest": true
},
{
"id": "...",
"runId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"debugRunId": null,
"runGroupId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "08205e0eda0b41f6b5a90a8dda06a7f6",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:28:58.219611Z",
"runEnd": null,
"durationInMs": null,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:29:00.9860175Z",
"annotations": [],
"runDimension": {},
"isLatest": true
}
],
"ADFWebActivityResponseHeaders": {
"Pragma": "no-cache",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"X-Content-Type-Options": "nosniff",
"x-ms-ratelimit-remaining-subscription-reads": "11999",
"x-ms-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-correlation-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-routing-request-id": "WESTUS2:20211013T172902Z:188508ef-8897-4c21-8c37-ccdd4adc6d81",
"Cache-Control": "no-cache",
"Date": "Wed, 13 Oct 2021 17:29:02 GMT",
"Server": "Microsoft-IIS/10.0",
"X-Powered-By": "ASP.NET",
"Content-Length": "1492",
"Content-Type": "application/json; charset=utf-8",
"Expires": "-1"
},
"effectiveIntegrationRuntime": "NCAP-Simple-DataMovement (West US 2)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Then the Until expression will evaluate if the first value[0] has runId == pipeline_runid to stop the until activity and run the dataflow that writes in Synapse. Once the PL ends the status will be Succeeded and the Web activity in another job will get the next value[0] with the status InProgress and continue with the next write. This creates a dependency to the parallel jobs to wait until the dataflow validates and writes in table if need it.

Debezium MongoDB source JSON sink to Cassandra (LENSES.IO)

I need to sink a JSON object from MongoDB to a column in Cassandra. I'm using ExtractNewDocumentState, AvroConverter. But it seems I'm wrong. And AvroConverter used in source or sink? If I used it in the source then I used it in the sink too?
{
"name": "mongodb_source_connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "1",
"mongodb.hosts": "rs0/mongo:27017",
"mongodb.name": "dbserver1",
"mongodb.user": "scorpion",
"mongodb.password": "123123123",
"database.whitelist": "ladiform",
"database.history.kafka.bootstrap.servers": "kafka_3:9093",
"transforms": "route,unwrap",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.unwrap.type": "io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "drop",
"transforms.unwrap.operation.header": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
{
"name": "cassandra_sink_connector",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"tasks.max": "1",
"topics": "test34",
"connect.cassandra.port": "9042",
"connect.cassandra.key.space": "test",
"connect.cassandra.contact.points": "cassandra",
"connect.cassandra.username": "cassandra",
"connect.cassandra.password": "123123123",
"connect.cassandra.kcql": "INSERT INTO test33 SELECT id, data FROM test34"
}

the converters must be the same on both source and sink side as serialization is done in the source side and deserialization on the sink side.

Auto-Table generation in Cassandra with kafka connect cassandra sink

I am using confluent.connect.cassandra.CassandraSinkConnector, for kafka connect cassandra sink.
I wanted to know if it was possible to auto-generate cassandra tables from the kafka topic using io.confluent.connect.cassandra.CassandraSinkConnector as connector.
If it is possible, can you please suggest what configuration to set to enable this feature. I have tried all the configurations mentioned in the documentation, but I was not successful in creating a table.
This is the configuration i am using:
{
"name": "cassandra-test4",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "3",
"topics": "orders-topic2",
"cassandra.contact.points": "my_ip",
"cassandra.keyspace": "test_cas",
"cassandra.write.mode": "Insert",
"cassandra.table.manage.enabled": "true",
"cassandra.sink.route": "test_cas.orders",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"flush.size": "1",
"cassandra.keyspace.create.enabled": "true",
"name": "cassandra-test4"
},
"tasks": [
{
"connector": "cassandra-test4",
"task": 0
},
{
"connector": "cassandra-test4",
"task": 1
},
{
"connector": "cassandra-test4",
"task": 2
}
],
"type": null
}

This is should be done by setting cassandra.keyspace.create.enabled & cassandra.table.manage.enabled properties to true. See documentation.
But be really careful - it's very easy to get schema disagreement in your cluster, and then you need to do additional steps to recover from it. It's better pre-create tables before starting connector...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

kafka-connect : Error in sink cassandra connector - cassandra

Related

Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

Avoid duplicates during parallel job run in Azure Synapse

Debezium MongoDB source JSON sink to Cassandra (LENSES.IO)

Auto-Table generation in Cassandra with kafka connect cassandra sink

Categories

Resources