I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718
Related
I have a json file that has below's structure that I need to read in as pyspark dataframe.
I tried reading in using multiLine option but it doesn't seem to return more data than the columns and datatypes. What am I possibly doing wrong and how can I read in belows'structure?
df = spark.read.format("json").option("inferSchema", "true") \
.option("multiLine", "true") \
.load("/mnt/blob/input/jsonfile.json")
Structure:
[
[
{
"Key": "Val"
},
{
"Key": "Val"
}
],
[
{
"Key": "Val"
},
{
"Key": "Val"
}
]
]
I have a dataframe that I need to write to Kafka.
I have the avro schema defined, similar to this:
{
"namespace": "my.name.space",
"type": "record",
"name": "MyClass",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter1", "type": "string"},
{"name": "parameter2", "type": "string"},
...
]
}
and it's auto-generated to java bean. It's something similar to this:
public class MyClass extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
String id;
String parameter1;
String parameter2;
...
}
I found that to write in avro format there is only to_avro method that takes a column.
So my question, is there a way to force writing to Kafka in Avro format in this defined schema?
You can only do this when using Confluent. See https://aseigneurin.github.io/2018/08/02/kafka-tutorial-4-avro-and-schema-registry.html
I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"
Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.
Spark Dataframe Schema:
StructType(
[StructField("a", StringType(), False),
StructField("b", StringType(), True),
StructField("c" , BinaryType(), False),
StructField("d", ArrayType(StringType(), False), True),
StructField("e", TimestampType(), True)
])
When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.
BigQuery Schema:
[
{
"type": "STRING",
"name": "a",
"mode": "REQUIRED"
},
{
"type": "STRING",
"name": "b",
"mode": "NULLABLE"
},
{
"type": "BYTES",
"name": "c",
"mode": "REQUIRED"
},
{
"fields": [
{
"fields": [
{
"type": "STRING",
"name": "element",
"mode": "NULLABLE"
}
],
"type": "RECORD",
"name": "list",
"mode": "REPEATED"
}
],
"type": "RECORD",
"name": "d",
"mode": "NULLABLE"
},
{
"type": "TIMESTAMP",
"name": "e",
"mode": "NULLABLE"
}
]
Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?
This is due to the intermediate file format (parquet by default) that the spark-bigquery connector uses.
The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.
If you check the intermediate parquet schema using parquet-tools, you would find something like this the field d (ArrayType(StringType) in Spark)
optional group a (LIST) {
repeated group list {
optional binary element (STRING);
}
}
Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference
Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!
As a workaround, you can try to use orc as the intermediate format.
df
.write
.format("bigquery")
.option("intermediateFormat", "orc")
I am trying to call partitionBy on a nested field like below:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?
java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))
This is my json file:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084
I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...
Original Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
Modified Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
Code to get to Modified Json:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}
Since the feature is un-available as of Spark 2.3.1, here's a workaround. Make sure to handle name conflicts between the nested fields and the fields at the root level.
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
The partitions end up like
date=20180808/group=xxx/team=yyy/part-xxx.parquet