Databricks Autoloader throws IllegalArgumentException - databricks

I'm trying the simplest auto loader example included in the databricks website
https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(input_data_path))
(df.writeStream.format("delta")
.option("checkpointLocation", chkpt_path)
.table("iot_stream"))
I keep getting this message:
IllegalArgumentException: cloudFiles.schemaLocation Could not find required option: schemaLocation. Please provide a schema location using cloudFiles.schemaLocation for storing inferred schema and supporting schema evolution.
If providing cloudFiles.schemaLocation is required, why do the examples everywhere are missing it? what's the underlying issue here?

I suspect what is going on is that you are not explicitly setting the .option("cloudFiles.schemaEvolutionMode")
Which means it is being set to the default which is "addNewColumns" as per https://docs.databricks.com/ingestion/auto-loader/options.html
And that requires you set the .option("cloudFiles.schemaLocation", path) in the reader.
Thus you are inadvertently requiring it and not setting it.

Related

spark GroupBy throws StateSchemaNotCompatible exception with different "Existing key schema"

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.

How to store set of tuples into cassandra using datastax driver

I'm trying to run my service with Micronaut and Cassandra (currently version 3.11.10) and store a column that is a set of tuples into Cassandra.
example code:
QueryBuilder
.insertInto(table)
.value("column", QueryBuilder.literal(items.map { it.toTuple() }.toSet())))
The toTuple() method is just an extension method that transfer the items into Term objects
When I'm doing so I'm receiving the following error:
Internal Server Error: Could not inline literal of type java.util.Collections$SingletonSet. This happens because the driver doesn't know how to map it to a CQL type. Try passing a TypeCodec or CodecRegistry to literal().
I've checked online in multiple sources but couldn't find a simple way to store a set of tuples into the database without implementing my custom TypeCodec. As I'm sure that I'm not the first person to have this issue, I'm probably doing something completely wrong, however I couldn't find any documentation regarding to what is the correct way of doing this.

Spark saveAsTable with location at s3 bucket's root cause NullPointerException

I am working with Spark 3.0.1 and my partitioned table is stored in s3. Please find here the description of the issue.
Create Table
Create table root_table_test_spark_3_0_1 (
id string,
name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION 's3a://MY_BUCKET_NAME/'
Code that is causing the NullPointerException on the second run
Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
.toDS()
.write
.partitionBy("id")
.mode(SaveMode.Append)
.saveAsTable("root_table_test_spark_3_0_1")
When the Hive metastore is empty everything is ok but the issue is happening when spark is trying to do the getCustomPartitionLocations in InsertIntoHadoopFsRelationCommand phase. (on the second run for example)
Indeed it calls the below method : from (org.apache.hadoop.fs.Path)
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return new Path(getParent(), getName()+suffix);
}
But the getParent() will return null when we are at root, resulting in a NullPointerException. The only option i'm thinking at the moment is to override this method to do something like :
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}
Anyone having issues when LOCATION of a spark hive table is at root level? Any workaround? Is there any known issues opened?
My Runtime does not allow me to override the Path class and fix the suffix method and i can't move my data from the bucket's root as it exists since 2 years now.
The issue happen because i'm migrating from Spark 2.1.0 to Spark 3.0.1 and the behavior checking the custom partitions appeared in Spark 2.2.0 (https://github.com/apache/spark/pull/16460)
This whole context help to understand the problem but basically you can reproduce it easily doing
val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))
FYI. the hadoop-common version is 2.7.4 and please find here the full stacktrace
NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)
Thanks
Looks like a situation where the spark code calls Path.suffix("something) and because the root path has no parent, an NPE is triggered
Long term fix
File JIRA on issues.apache.org against HADOOP; provide a patch with test for fix suffix() to downgrade properly when called on root path. Best for all
don't use the root path as a destination of a table.
Do both of these
Option #2 should avoid other surprises about how tables are created/committed etc...some of the code may fail because an attempt to delete the root of the path (here s3a://some-bucket") won't delete the root, will it?
Put differently: root directories have "odd" semantics everywhere; most of the time you don't notice this on a local FS because you never try to use / as a destination of work, get surprised that rm -rf / is different from rm -rf /subdir, etc etc. Spark, Hive etc were never written to use / as a destination of work, so you get to see the failures.

Spark-Solr Connector trying to add already existing field with stored=true

I am using Spark-Solr connector 3.4.0 with Solr cloud version 7.6.0 in a Spark 2.2.1 Cluster . We have an existing Solr collection with a predefined schema for it. Most of the fields have the stored parameter set to true, but there are certain fields where we explicitly set stored=false. When we try to push data to Solr using the spark-solr connector, we get the following error-
org.apache.solr.api.ApiBag$ExceptionWithErrObject: error processing commands, errors: [{add-field={name=taxonomy, indexed=true, multiValued=true, docValues=true, stored=true, type=string}, errorMessages=[Field 'item_id_channel' already exists.
]}],
at org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:92)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
The error says the item_id_channel already exists, but this error is only raised for fields for which we have defined stored=false (in the Solr schema). I get that the connector wishes to create the schema again for some reason, but it sets the stored parameter to true which clashes with the predefined schema definition on Solr for this field.
My question is - Is there a way to tell the connector (probably through some option?) that we want the stored to be set to true for certain fields? And probably a generic way to define other solr parameters for the fields?
We found the issue that was causing the error. There was a bug in older versions of spark-solr connector, because of which the connector was trying to add existing fields to the solr schema in case the value of stored was true. This was fixed in 3.5.5 release. Hence, once we upgraded our connector to version 3.5.14, the ingestion stared working without any errors.

Logging Spark Configuration Properties

I'm trying to log the properties for each Spark application that run in one Yarn cluster ( properties like spark.shuffle.compress, spark.reducer.maxMbInFlight, spark.executor.instances and so on ).
However i don't know if this information is logged anywhere. I know that we can access to the yarn logs through the "yarn" command but the properties I'm talking about are not store there.
Is there anyway to access to this kind of info?. The idea is to have a trace of all the applications that run in the cluster together with its properties to identify which ones have the most impact in their execution time.
You could log it yourself... use sc.getConf.toDebugString, sqlContext.getConf("") or sqlContext.getAllConfs.
scala> sqlContext.getConf("spark.sql.shuffle.partitions")
res129: String = 200
scala> sqlContext.getAllConfs
res130: scala.collection.immutable.Map[String,String] = Map(hive.server2.thrift.http.cookie.is.httponly -> true, dfs.namenode.resource.check.interval ....
scala> sc.getConf.toDebugString
res132: String =
spark.app.id=local-1449607289874
spark.app.name=Spark shell
spark.driver.host=10.5.10.153
Edit: However, I could not find the properties you specified among the 1200+ properties in sqlContext.getAllConfs :( Otherwise the documentation says:
The application web UI at http://:4040 lists Spark properties
in the “Environment” tab. This is a useful place to check to make sure
that your properties have been set correctly. Note that only values
explicitly specified through spark-defaults.conf, SparkConf, or the
command line will appear. For all other configuration properties, you
can assume the default value is used.

Resources