I am working on the Deployment of Purview ADB Lineage Solution Accelerator developed by MS Azure team. The tool's gitgub site is here.
I followed their instructions and deployed their tool on Azure. But when I run their sample scala file abfss-in-abfss-out-olsample The following code, gives the error shown below:
NoSuchElementException: spark.openlineage.samplestorageaccount
Code in Scala language:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val storageServiceName = spark.conf.get("spark.openlineage.samplestorageaccount")
val storageContainerName = spark.conf.get("spark.openlineage.samplestoragecontainer")
val adlsRootPath = "wasbs://"+storageContainerName+"#"+storageServiceName+".blob.core.windows.net"
val storageKey = dbutils.secrets.get("purview-to-adb-kv", "storageAccessKey")
spark.conf.set("fs.azure.account.key."+storageServiceName+".blob.core.windows.net", storageKey)
Question: What could be a cause of the error and how can we fix it
UPDATE: In the Spark Config in the Advanced Options section of the Databricks Cluster I have added the following content as suggested by item 4 of Install OpenLineage on Your Databricks Cluster section of the above mentioned tutorial.
Spark config
spark.openlineage.host https://functionapppv2dtbr8s6k.azurewebsites.net
spark.openlineage.url.param.code bmHFCiNI86nfgqwfkX86Lj5veclRds9Zb1NIJ48uRgNXAzFuQEueiQ==
spark.openlineage.namespace https://adb-1900514794152199.12#0160-060038-516wad48
spark.databricks.delta.preview.enabled true
spark.openlineage.version v1
It means Spark Configuration (spark.conf) doesn't contain such a key.
You have to check how is the configuration setup/provided if you expect this key to be present.
Related
I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}
As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)
I'm getting ClassCastException when trying to traverse the directories in mounted Databricks volume.
java.lang.ClassCastException: com.databricks.backend.daemon.dbutils.FileInfo cannot be cast to com.databricks.service.FileInfo
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at com.mycompany.functions.UnifiedFrameworkFunctions$.getAllFiles(UnifiedFrameworkFunctions.scala:287)
where getAllFiles function looks like:
import com.databricks.service.{DBUtils, FileInfo}
...
def getAllFiles(path: String): Seq[String] = {
val files = DBUtils.fs.ls(path)
if (files.isEmpty)
List()
else
files.map(file => { // line where exception is raised
val path: String = file.path
if (DBUtils.fs.dbfs.getFileStatus(new org.apache.hadoop.fs.Path(path)).isDirectory) getAllFiles(path)
else List(path)
}).reduce(_ ++ _)
}
Locally it runs OK with Databricks Connect, but when src code is packaged as jar and run on Databricks cluster the above exception is raised.
Since Databricks in their documentation suggest using com.databricks.service.DBUtils and when calling DBUtils.fs.ls(path) it returns FileInfo from the same service package - is this a bug or should the api be used in some other way?
I'm using Databricks Connect & Runtime of version 8.1
I have tried a workaround to get file names from a folder.
I have performed following steps to get list of filenames from mounted directory.
I have stored 3 files at โmnt/Sales/โ location.
Step1: Use display(dbutils.fs.ls(โ/mnt/Sales/โ)) command.
Step2: Assign file location to a variable.
Step3: Load variable to dataframe and get names of files.
You could convert the directories (of type Seq[com.databricks.service.FileInfo]) to a string, split the string, and use pattern matching to extract the file names as you traverse the new Array[String]. Using scala:
val files = dbutils.fs.ls(path).mkString(";").split(";")
val pattern = "dbfs.*/(Sales_Record[0-9]+.csv)/.*".r
files.map(file => { val pattern(res) = file; res })
Or you could try
val pattern = "dbfs.*/(.*.csv)/.*".r
to get all file names ending in csv. The pattern can be constructed to suit your needs.
To use dbutils on your local machine (Databricks Connect), you need to import it. However, dbutils is already available on the Databricks cluster, so you should not import it. Thus, when your src code (with DBUtils.fs.ls(path)) is packaged as a jar and run on Databricks cluster, you will get this error. A minimal example on my Databricks cluster shows this:
So scala users should just use dbutils.* for code on the Databricks cluster.
For PySpark users, the Databricks documentations says the following here:
When using Databricks Runtime 7.3 LTS or above, to access the DBUtils module in a way that works both locally and in Databricks clusters, use the following get_dbutils():
And the document goes onto show the code for this purpose, so consult the Python code tab in the link for more details.
I have cloudera vm running spark version 1.6.0
I created a dataframe from a CSV file and now filtering columns based on some where clause
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///home/cloudera/sample.csv')
df.registerTempTable("closedtrips")
result = sqlContext.sql("SELECT id,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'")
However it gives me runtime error on the sql line.
py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
: java.lang.RuntimeException: [1.96] failure: identifier expected
SELECT consigner,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'
^
Where am I going wrong here?
The above command fails in vm command line, however works fine when ran on databricks environment
Also why are column names case sensitive in vm, it fails to recognise 'trip frozen' because the actual column is 'Trip Frozen'.
All of this works fine in databricks and breaks in vm
In your VM, are you creating sqlContext as a SQLContext or as a HiveContext?
In Databricks, the automatically-created sqlContext will always point to a HiveContext.
In Spark 2.0 this distinction between HiveContext and regular SQLContext should not matter because both have been subsumed by SparkSession, but in Spark 1.6 the two types of contexts differ slightly in how they parse SQL language input.
I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.
I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.