I am trying to write change data capture into EventHub as:
df = spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.table("cdc_test1")
While writing to azure eventhub it expects the content into a body attribute:
df.writeStream.format("eventhubs").option("checkpointLocation", checkpointLocation).outputMode("append").options(**ehConf).start()
It gives exception as
org.apache.spark.sql.AnalysisException: Required attribute 'body' not found.
at org.apache.spark.sql.eventhubs.EventHubsWriter$.$anonfun$validateQuery$2(EventHubsWriter.scala:53)
I am not sure how to wrap whole stream into a body. I think, I need another stream object which has a column body with value of "df"(original stream) as string. I am not able to achieve this. Please help !
You just need to create this column by using functions struct (to encode all columns as one object) and something like to_json (to create a single value from the object - you can use other functions, like, to_csv, or to_avro, but it will depend on the contract with consumers). The code could look as following:
df.select(F.to_json(F.struct("*")).alias("body"))\
.writeStream.format("eventhubs")\
.option("checkpointLocation", checkpointLocation)\
.outputMode("append")\
.options(**ehConf).start()
I am reading a seq json file from HDFS using spark like this :
val data = spark.read.json(spark.sparkContext.sequenceFile[String, String]("/prod/data/class1/20190114/2019011413/class2/part-*").map{
case (x,y) =>
(y.toString)})
data.registerTempTable("data")
val filteredData = data.filter("sourceInfo='Web'")
val explodedData = filteredData.withColumn("A", explode(filteredData("payload.adCsm.vfrd")))
val explodedDataDbg = explodedData.withColumn("B", explode(filteredData("payload.adCsm.dbg"))).drop("payload")
On which I am getting this error:
org.apache.spark.sql.AnalysisException:
Ambiguous reference to fields StructField(adCsm,ArrayType(StructType(StructField(atfComp,StringType,true), StructField(csmTot,StringType,true), StructField(dbc,ArrayType(LongType,true),true), StructField(dbcx,LongType,true), StructField(dbg,StringType,true), StructField(dbv,LongType,true), StructField(fv,LongType,true), StructField(hdr,LongType,true), StructField(hidden,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(hvrx,DoubleType,true), StructField(hvry,DoubleType,true), StructField(inf,StringType,true), StructField(isP,LongType,true), StructField(ltav,StringType,true), StructField(ltdb,StringType,true), StructField(ltdm,StringType,true), StructField(lteu,StringType,true), StructField(ltfm,StringType,true), StructField(ltfs,StringType,true), StructField(lths,StringType,true), StructField(ltpm,StringType,true), StructField(ltpq,StringType,true), StructField(ltts,StringType,true), StructField(ltut,StringType,true), StructField(ltvd,StringType,true), StructField(ltvv,StringType,true), StructField(msg,StringType,true), StructField(nl,LongType,true), StructField(prerender,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(pt,StringType,true), StructField(src,StringType,true), StructField(states,StringType,true), StructField(tdr,StringType,true), StructField(tld,StringType,true), StructField(trusted,BooleanType,true), StructField(tsc,LongType,true), StructField(tsd,DoubleType,true), StructField(tsz,DoubleType,true), StructField(type,StringType,true), StructField(unloaded,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(vdr,StringType,true), StructField(vfrd,LongType,true), StructField(visible,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(xpath,StringType,true)),true),true), StructField(adcsm,ArrayType(StructType(StructField(tdr,DoubleType,true), StructField(vdr,DoubleType,true)),true),true);
Not sure how, but ONLY SOMETIMES there are two structs with the same name "adCsm" inside "payload". Since I am interested in fields present in one of them, I need to deal with this ambiguity.
I know one way is to check for the field A and B and drop the column if the fields are absent and hence choose the other adCsm. Was wondering if there is any better way to handle this? If I can probably just merge the duplicate columns (with different data) instead of this explicit filtering?
Not sure how duplicate structs are even present in a seq "json" file
TIA!
I think, the ambiguity happened due to case sensitivity issue in spark dataframe column name. In the last part of the schema i see
StructField(adcsm,
ArrayType(StructType(
StructField(tdr,DoubleType,true),
StructField(vdr,DoubleType,true)),true),true)
So there is two same name structFields (adScm and adscm) inside plain StructType.
First enable case sensitivity of spark sql by
sqlContext.sql("set spark.sql.caseSensitive=true")
then it'll differentiate the two fields. Here is details to solve case sensitive issue solve case sensitivity issue
. Hopefully it'll help you.
I updated these lines of code to support for spring-data-cassandra-2.0.7.RELEASE:
CassandraOperations cOps = new CassandraTemplate(session);
From:
Insert insertStatement = (Insert)statement;
CqlTemplate.addWriteOptions(insertStatement, queryWriteOptions);
cOps.execute(insertStatement);
To:
Insert insertStatement = (Insert)statement;
insertStatement = QueryOptionsUtil.addWriteOptions(insertStatement,
queryWriteOptions);
cOps.insert(insertStatement);
Above changes are throwing below error:
Caused by: org.springframework.dao.InvalidDataAccessApiUsageException: Unknown type [interface com.datastax.driver.core.policies.RetryPolicy] for property [retryPolicy] in entity [com.datastax.driver.core.querybuilder.Insert]; only primitive types and Collections or Maps of primitive types are allowed
at org.springframework.data.cassandra.core.mapping.BasicCassandraPersistentProperty.getDataType(BasicCassandraPersistentProperty.java:170)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.lambda$null$10(CassandraMappingContext.java:552)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.lambda$getDataTypeWithUserTypeFactory$11(CassandraMappingContext.java:542)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.getDataTypeWithUserTypeFactory(CassandraMappingContext.java:527)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.getDataType(CassandraMappingContext.java:486)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getPropertyTargetType(MappingCassandraConverter.java:689)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.lambda$getTargetType$0(MappingCassandraConverter.java:682)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getTargetType(MappingCassandraConverter.java:670)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getWriteValue(MappingCassandraConverter.java:711)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.writeInsertFromWrapper(MappingCassandraConverter.java:403)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.writeInsertFromObject(MappingCassandraConverter.java:360)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.write(MappingCassandraConverter.java:345)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.write(MappingCassandraConverter.java:320)
at org.springframework.data.cassandra.core.QueryUtils.createInsertQuery(QueryUtils.java:78)
at org.springframework.data.cassandra.core.CassandraTemplate.insert(CassandraTemplate.java:442)
at org.springframework.data.cassandra.core.CassandraTemplate.insert(CassandraTemplate.java:430)
Query that is passed as input is of type com.datastax.driver.core.querybuilder.Insert containing:
INSERT INTO person (name,id,age) VALUES ('name01','123',23) USING TIMESTAMP 1528922717378000 AND TTL 60;
And the queryoptions containing RetryPolicy and consistency level is passed.
Based on documentation followed above changes are not working. Can anyone let me know what is wrong here?
I'm using Spring 2.0.7.RELEASE with Cassandra driver 3.5.0
I was able to work with it using below changes:
cOps.getCqlOperations().execute(insertStatement);
How can i check the consistency level if it got applied?
For me, this works:
batchOps.insert(ImmutableSet.of(entity), insertOptions);