Not able to use useBeamSchema for automatically converting Pcollection to table row schema - transform

//Follwing code to readfile from GCS bucket, tranform and write to Bigquery
PCollection<Quote> quotes = ...//get tranfrometed data
quotes.apply(BigQueryIO
.<Quote>write()
.to("my-project:my_dataset.my_table")
.useBeamSchema()
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_`enter code here`TRUNCATE));
//get error
//Exception in thread "main" java.lang.IllegalArgumentException: Unable to infer a coder and no Coder was //specified. Please set a coder by invoking Create.withCoder() explicitly or a schema by invoking //Create.withSchema().

I think you have to set a schema in your PCollection. Please see example below.

Related

Event Hub: org.apache.spark.sql.AnalysisException: Required attribute 'body' not found

I am trying to write change data capture into EventHub as:
df = spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.table("cdc_test1")
While writing to azure eventhub it expects the content into a body attribute:
df.writeStream.format("eventhubs").option("checkpointLocation", checkpointLocation).outputMode("append").options(**ehConf).start()
It gives exception as
org.apache.spark.sql.AnalysisException: Required attribute 'body' not found.
at org.apache.spark.sql.eventhubs.EventHubsWriter$.$anonfun$validateQuery$2(EventHubsWriter.scala:53)
I am not sure how to wrap whole stream into a body. I think, I need another stream object which has a column body with value of "df"(original stream) as string. I am not able to achieve this. Please help !
You just need to create this column by using functions struct (to encode all columns as one object) and something like to_json (to create a single value from the object - you can use other functions, like, to_csv, or to_avro, but it will depend on the contract with consumers). The code could look as following:
df.select(F.to_json(F.struct("*")).alias("body"))\
.writeStream.format("eventhubs")\
.option("checkpointLocation", checkpointLocation)\
.outputMode("append")\
.options(**ehConf).start()

Merging duplicate columns in seq json hdfs files in spark

I am reading a seq json file from HDFS using spark like this :
val data = spark.read.json(spark.sparkContext.sequenceFile[String, String]("/prod/data/class1/20190114/2019011413/class2/part-*").map{
case (x,y) =>
(y.toString)})
data.registerTempTable("data")
val filteredData = data.filter("sourceInfo='Web'")
val explodedData = filteredData.withColumn("A", explode(filteredData("payload.adCsm.vfrd")))
val explodedDataDbg = explodedData.withColumn("B", explode(filteredData("payload.adCsm.dbg"))).drop("payload")
On which I am getting this error:
org.apache.spark.sql.AnalysisException:
Ambiguous reference to fields StructField(adCsm,ArrayType(StructType(StructField(atfComp,StringType,true), StructField(csmTot,StringType,true), StructField(dbc,ArrayType(LongType,true),true), StructField(dbcx,LongType,true), StructField(dbg,StringType,true), StructField(dbv,LongType,true), StructField(fv,LongType,true), StructField(hdr,LongType,true), StructField(hidden,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(hvrx,DoubleType,true), StructField(hvry,DoubleType,true), StructField(inf,StringType,true), StructField(isP,LongType,true), StructField(ltav,StringType,true), StructField(ltdb,StringType,true), StructField(ltdm,StringType,true), StructField(lteu,StringType,true), StructField(ltfm,StringType,true), StructField(ltfs,StringType,true), StructField(lths,StringType,true), StructField(ltpm,StringType,true), StructField(ltpq,StringType,true), StructField(ltts,StringType,true), StructField(ltut,StringType,true), StructField(ltvd,StringType,true), StructField(ltvv,StringType,true), StructField(msg,StringType,true), StructField(nl,LongType,true), StructField(prerender,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(pt,StringType,true), StructField(src,StringType,true), StructField(states,StringType,true), StructField(tdr,StringType,true), StructField(tld,StringType,true), StructField(trusted,BooleanType,true), StructField(tsc,LongType,true), StructField(tsd,DoubleType,true), StructField(tsz,DoubleType,true), StructField(type,StringType,true), StructField(unloaded,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(vdr,StringType,true), StructField(vfrd,LongType,true), StructField(visible,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(xpath,StringType,true)),true),true), StructField(adcsm,ArrayType(StructType(StructField(tdr,DoubleType,true), StructField(vdr,DoubleType,true)),true),true);
Not sure how, but ONLY SOMETIMES there are two structs with the same name "adCsm" inside "payload". Since I am interested in fields present in one of them, I need to deal with this ambiguity.
I know one way is to check for the field A and B and drop the column if the fields are absent and hence choose the other adCsm. Was wondering if there is any better way to handle this? If I can probably just merge the duplicate columns (with different data) instead of this explicit filtering?
Not sure how duplicate structs are even present in a seq "json" file
TIA!
I think, the ambiguity happened due to case sensitivity issue in spark dataframe column name. In the last part of the schema i see
StructField(adcsm,
ArrayType(StructType(
StructField(tdr,DoubleType,true),
StructField(vdr,DoubleType,true)),true),true)
So there is two same name structFields (adScm and adscm) inside plain StructType.
First enable case sensitivity of spark sql by
sqlContext.sql("set spark.sql.caseSensitive=true")
then it'll differentiate the two fields. Here is details to solve case sensitive issue solve case sensitivity issue
. Hopefully it'll help you.

Spark: Read multiple AVRO files with different schema in parallel

I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.

Getting unknown type for RetryPolicy error after upgrading spring data cassandra to latest 2.0.7.RELEASE

I updated these lines of code to support for spring-data-cassandra-2.0.7.RELEASE:
CassandraOperations cOps = new CassandraTemplate(session);
From:
Insert insertStatement = (Insert)statement;
CqlTemplate.addWriteOptions(insertStatement, queryWriteOptions);
cOps.execute(insertStatement);
To:
Insert insertStatement = (Insert)statement;
insertStatement = QueryOptionsUtil.addWriteOptions(insertStatement,
queryWriteOptions);
cOps.insert(insertStatement);
Above changes are throwing below error:
Caused by: org.springframework.dao.InvalidDataAccessApiUsageException: Unknown type [interface com.datastax.driver.core.policies.RetryPolicy] for property [retryPolicy] in entity [com.datastax.driver.core.querybuilder.Insert]; only primitive types and Collections or Maps of primitive types are allowed
at org.springframework.data.cassandra.core.mapping.BasicCassandraPersistentProperty.getDataType(BasicCassandraPersistentProperty.java:170)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.lambda$null$10(CassandraMappingContext.java:552)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.lambda$getDataTypeWithUserTypeFactory$11(CassandraMappingContext.java:542)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.getDataTypeWithUserTypeFactory(CassandraMappingContext.java:527)
at org.springframework.data.cassandra.core.mapping.CassandraMappingContext.getDataType(CassandraMappingContext.java:486)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getPropertyTargetType(MappingCassandraConverter.java:689)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.lambda$getTargetType$0(MappingCassandraConverter.java:682)
at java.util.Optional.orElseGet(Optional.java:267)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getTargetType(MappingCassandraConverter.java:670)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.getWriteValue(MappingCassandraConverter.java:711)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.writeInsertFromWrapper(MappingCassandraConverter.java:403)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.writeInsertFromObject(MappingCassandraConverter.java:360)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.write(MappingCassandraConverter.java:345)
at org.springframework.data.cassandra.core.convert.MappingCassandraConverter.write(MappingCassandraConverter.java:320)
at org.springframework.data.cassandra.core.QueryUtils.createInsertQuery(QueryUtils.java:78)
at org.springframework.data.cassandra.core.CassandraTemplate.insert(CassandraTemplate.java:442)
at org.springframework.data.cassandra.core.CassandraTemplate.insert(CassandraTemplate.java:430)
Query that is passed as input is of type com.datastax.driver.core.querybuilder.Insert containing:
INSERT INTO person (name,id,age) VALUES ('name01','123',23) USING TIMESTAMP 1528922717378000 AND TTL 60;
And the queryoptions containing RetryPolicy and consistency level is passed.
Based on documentation followed above changes are not working. Can anyone let me know what is wrong here?
I'm using Spring 2.0.7.RELEASE with Cassandra driver 3.5.0
I was able to work with it using below changes:
cOps.getCqlOperations().execute(insertStatement);
How can i check the consistency level if it got applied?
For me, this works:
batchOps.insert(ImmutableSet.of(entity), insertOptions);

How to read schema of a keyspace using java.?

I want to read schema of a keyspace in cassandra.
I know that, in Cassandra-cli we can execute following command to get Schema
show schema keyspace1;
But i want to read schema from remote machine using java.
How i can solve this? Plzzz help me....
This one i solved by using thrift client
KsDef keyspaceDefinition = _client.describe_keyspace(_keyspace);
List<CfDef> columnDefinition = keyspaceDefinition.getCf_defs();
Here key space definition contains whole schema details, so from that KsDef we can read whatever we want. In my case i want to read metadata so i am reading column metadata from the above column definitions as shown below.
for(int i=0;i<columnDefinition.size();i++){
List<ColumnDef> columnMetadata = columnDefinition.get(i).getColumn_metadata();
for(int j=0;j<columnMetadata.size();j++){
columnfamilyNames.add(columnDefinition.get(i).getName());
columnNames.add(new String((columnMetadata.get(j).getName())));
validationClasses.add(columnMetadata.get(j).getValidation_class());
//ar.add(coldef.get(i).getName()+"\t"+bb_to_str(colmeta.get(j).getName())+"\t"+colmeta.get(j).getValidationClass());
}
}
here columnfamilyNames, columnNames and validationClasses are arraylists.

Resources