SaveAstable with append Mode returning Exception - apache-spark

I have a metastore setup, and trying to add data from a dataset to same table using following code:
String warehouseLocation = "file:///" + System.getProperty("user.dir") + "/" + sessionName;
session = SparkSession.builder().appName(sessionName).config(sparkConf)
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate();
dataset.write().mode(SaveMode.Append).option("path", tablePathName).saveAsTable(tableName);
But I am getting an exception:
com.spectramd.products.focus.common.FocusException:
Error Table Information : The location of the existing table `default`.`tablename` is
`file:/D:/Spectramd/priya/Projects/Integrated/Platform/v1165/etl/SparkETL/spark-warehouse/MU2_2022/Stroke2/tablename`.
It doesn't match the specified location `MU2_2022/Stroke2/tablename`.
It only throws this error when the table exists. How to resolve this?

Related

Inconsistent Behaviors for Multiple SparkSessions when accessing the Iceberg Table

I explored the multiple SparkSessions (to connect to different data sources/data clusters) a bit. And I found a wired behavior.
Firstly I created a SparkSession to RW the iceberg table, and everything works.
Then if I use the new SparkSession (with some incorrect parameters like spark.sql.catalog.mycatalog.uri) to access the table created by the previous SparkSession through (1) spark.read().*.load("*") first, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter).
The full test is given as below:
// The test to use the new SparkSession access the dataset created by previous SparkSession, using spark.read().*.load(*) first, then sql. And the whole test still works.
#Test
public void multipleSparkSessions() throws AnalysisException {
// Create the 1st SparkSession
String endpoint = String.format("http://localhost:%s/metastore", port);
ctx = SparkSession
.builder()
.master("local")
.config("spark.ui.enabled", false)
.config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.mycatalog.type", "hive")
.config("spark.sql.catalog.mycatalog.uri", endpoint)
.config("spark.sql.catalog.mycatalog.cache-enabled", "false")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.getOrCreate();
// Create a table with the SparkSession
String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
+ "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));
// Create a new SparkSession
SparkSession newSession = ctx.newSession();
newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");
// Access the created dataset above with the new SparkSession through session.read()...load(), which succeeds
List<Row> dataset2 = newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList();
dataset2.forEach(r -> System.out.println(r));
// Access the dataset through SQL, which succeeds as well.
newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList();
}
But if I use the new SparkSession to access the table through (1) newSession.sql first, the execution fails, and then (2) the read().**.load("**") will fail as well with error java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679.
The updated test is given below, you will notice the assertThrows which verifies the Exception is thrown.
IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession shouldn't be able to locate that table.
#Test
public void multipleSparkSessions() throws AnalysisException {
..same as above...
// Access the dataset through SQL first, the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList());
// Access the created dataset above with the new SparkSession through session.read()...load(), the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList());
}
Any idea what could lead to these two different behaviors with spark.read().load() versus spark.sql() in different sequences?

Databricks: AnalysisException: The specified properties do not match the existing properties at /mnt/product1/data/snapshot/ATM/latest

Getting exception while creating Delta Lake Table for existing delta format data:
Query:
CREATE TABLE IF NOT EXISTS AccountMember USING DELTA TBLPROPERTIES (delta.enableChangeDataFeed = true, delta.autoOptimize.optimizeWrite = true) LOCATION '/mnt/product1/data/snapshot/ATM/latest'
Exception: AnalysisException: The specified properties do not match the existing properties at /mnt/product1/data/snapshot/ATM/latest
Please help to understand the root cause!

Spark read data from Cassandra error org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string

I have a Cassandra table that is created as the following(in cqlsh)
CREATE TABLE blog.session( id int PRIMARY KEY, visited text);
I write data to Cassandra and it looks like this
id | visited
1 | Url1-Url2-Url3
I then try to read it using spark Cassandra connector(2.5.1).
val sparkSession = SparkSession.builder()
.master("local")
.appName("ReadFromCass")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
import sparkSession.implicits._
val readSessions = sparkSession.sqlContext
.read
.cassandraFormat("table1", "keyspace1").load().show()
However, it seems to be unable to read the visited since it is a text object with dashes in between words. The error occurs as
org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string
any ideas on why spark is unable to read this and how to fix it?
The error seemed to be the version of the spark-cassandra-connector. Instead of using "2.5.1" use "3.0.0-beta"

Accessing hive data using spark

I want access hive data using spark:
%spark
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql(LOAD DATA LOCAL INPATH '//filepath' INTO TABLE src)
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
However I am getting error:
:4: error: ')' expected but '(' found.
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
How to resolve this error?
You should use standard SQL syntax:
sqlContext.sql("SELECT key, value FROM src").show()
What's more, every sql command should have String as an argument, second command is without ""
sqlContext.sql("LOAD DATA LOCAL INPATH '//filepath' INTO TABLE src")
can you try this?
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) # creating hive context
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '//filepath' INTO TABLE src")
val srcRDD = sqlContext.sql("SELECT key, value FROM src")
srcRDD.collect().foreach(println) # printing the data

Spark query appends keyspace to the spark temp table

I have a cassandraSQLContext where I do this:
cassandraSqlContext.setKeyspace("test");
Because if I don't, it complains about me setting up the default keyspace.
Now when I run this piece of code:
def insertIntoCassandra(siteMetaData: MetaData, dataFrame: DataFrame): Unit ={
System.out.println(dataFrame.show())
val tableName = siteMetaData.getTableName.toLowerCase()
dataFrame.registerTempTable("spark_"+ tableName)
System.out.println("Registered the spark table to spark_" + tableName)
val columns = columnMap.get(siteMetaData.getTableName)
val query = cassandraQueryBuilder.buildInsertQuery("test", tableName, columns)
System.out.println("Query: " + query);
cassandraSqlContext.sql(query)
System.out.println("Query executed")
}
It gives me this error log:
Registered the spark table to spark_test
Query: INSERT INTO TABLE test.tablename SELECT **the columns here** FROM spark_tablename
17/02/28 04:15:53 ERROR JobScheduler: Error running job streaming job 1488255351000 ms.0
java.util.concurrent.ExecutionException: java.io.IOException: Couldn't find test.tablename or any similarly named keyspace and table pairs
What I don't understand is why isnt cassandraSQLContext executing the printed out query, why does it append the keyspace to the spark temptable.
public String buildInsertQuery(String activeReplicaKeySpace, String tableName, String columns){
String sql = "INSERT INTO TABLE " + activeReplicaKeySpace + "." + tableName +
" SELECT " + columns + " FROM spark_" + tableName;
return sql;
}
So the problem was I was using two different instances of cassandraSQLContext.In one of the methods, I was instantiating a new cassandraSQLContext which was conflicting with the cassandraSQLContext being passed to insertIntoCassandra method.

Resources