Inconsistent Behaviors for Multiple SparkSessions when accessing the Iceberg Table - apache-spark

I explored the multiple SparkSessions (to connect to different data sources/data clusters) a bit. And I found a wired behavior.
Firstly I created a SparkSession to RW the iceberg table, and everything works.
Then if I use the new SparkSession (with some incorrect parameters like spark.sql.catalog.mycatalog.uri) to access the table created by the previous SparkSession through (1) spark.read().*.load("*") first, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter).
The full test is given as below:
// The test to use the new SparkSession access the dataset created by previous SparkSession, using spark.read().*.load(*) first, then sql. And the whole test still works.
#Test
public void multipleSparkSessions() throws AnalysisException {
// Create the 1st SparkSession
String endpoint = String.format("http://localhost:%s/metastore", port);
ctx = SparkSession
.builder()
.master("local")
.config("spark.ui.enabled", false)
.config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.mycatalog.type", "hive")
.config("spark.sql.catalog.mycatalog.uri", endpoint)
.config("spark.sql.catalog.mycatalog.cache-enabled", "false")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.getOrCreate();
// Create a table with the SparkSession
String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
+ "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));
// Create a new SparkSession
SparkSession newSession = ctx.newSession();
newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");
// Access the created dataset above with the new SparkSession through session.read()...load(), which succeeds
List<Row> dataset2 = newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList();
dataset2.forEach(r -> System.out.println(r));
// Access the dataset through SQL, which succeeds as well.
newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList();
}
But if I use the new SparkSession to access the table through (1) newSession.sql first, the execution fails, and then (2) the read().**.load("**") will fail as well with error java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679.
The updated test is given below, you will notice the assertThrows which verifies the Exception is thrown.
IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession shouldn't be able to locate that table.
#Test
public void multipleSparkSessions() throws AnalysisException {
..same as above...
// Access the dataset through SQL first, the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList());
// Access the created dataset above with the new SparkSession through session.read()...load(), the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList());
}
Any idea what could lead to these two different behaviors with spark.read().load() versus spark.sql() in different sequences?

Related

Call class from external file in parallel pyspark

I'm trying to distribute data with IDs in cluster then call another class to do complex logic on these IDs in parallel in pySpark. I'm so confused of how to sort things out as the code below did not work
I have file myprocess.py contains
def class MyProcess():
def __init__(self, sqlcontext, x):
/* code here */
def complex_calculation(self):
/* lots of sql and statiscal steps */
Then I have my main wrapper control.py
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("complexlogic") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
sc.addPyFile(r"myprocess.py")
from myprocess import MyProcess
sqlContext = SQLContext(sc)
settings_bc = sc.broadcast({
'mysqlContext': sqlContext
})
/**
some code to create df_param
**/
df = df_param.repartition("id")
print('number of partitions', df.rdd.getNumPartitions())
rdd__param = df.rdd.map(lambda x: MyProcess(settings_bc.value, x).complex_calculation()).collect()
The error I get
_pickle.PicklingError: Could not serialize broadcast: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I understand that error is probably regarding passing sqlContext, but I think my issue is larger than that error, it is what is the right way to do what I'm trying to achieve (Edit: I'm trying to use the ID to filter 17 hive tables with that id and use these 17 tables to do complex math. If I move outside map how I will achieve parallelism) . Any help is greatly appreciated.

Spark Streaming: Using external data during stream transformation

I have a situation where I have to filter data-points in a stream based on some condition involving a reference to external data. I have loaded up the external data in a Dataframe (so that I get to query on it using SQL interface). But when I tried to query on Dataframe I see that we cannot access it inside the transform (filter) function. (sample code below)
// DStream is created and temp table called 'locations' is registered
dStream.filter(dp => {
val responseDf = sqlContext.sql("select location from locations where id='001'")
responseDf.show() //nothing is displayed
// some condition evaluation using responseDf
true
})
Am I doing something wrong? If yes, then what would be a better approach to load external data in-memory and query it during stream transformation stage.
Using SparkSession instead of SQLContext solved the issue. Code below,
val sparkSession = SparkSession.builder().appName("APP").getOrCreate()
val df = sparkSession.createDataFrame(locationRepo.getLocationInfo, classOf[LocationVO])
df.createOrReplaceTempView("locations")
val dStream: DStream[StreamDataPoint] = getdStream()
dStream.filter(dp => {
val sparkAppSession = SparkSession.builder().appName("APP").getOrCreate()
val responseDf = sparkAppSession.sql("select location from locations where id='001'")
responseDf.show() // this prints the results
// some condition evaluation using responseDf
true
})

WriteConf of Spark-Cassandra Connector being used or not

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?
There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

how to use a whole hive database in spark and read sql queries from external files?

I am using hortonworks sandbox in Azure with spark 1.6.
I have a Hive database populated with TPC-DS sample data. I want to read some SQL queries from external files and run them on the hive dataset in spark.
I follow this topic Using hive database in spark which is just using a table in my dataset and also it writes SQL query in spark again, but I need to define whole, dataset as my source to query on that, I think i should use dataframes but i am not sure and do not know how!
also I want to import the SQL query from external .sql file and do not write down the query again!
would you please guide me how can I do this?
thank you very much,
bests!
Spark Can read data directly from Hive table. You can create, drop Hive table using Spark and even you can do all Hive hql related operations through the Spark. For this you need to use Spark HiveContext
From the Spark documentation:
Spark HiveContext, provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup.
For more information you can visit Spark Documentation
To Avoid writing sql in code, you can use property file where you can put all your Hive query and then you can use the key in you code.
Please see below the implementation of Spark HiveContext and use of property file in Spark Scala.
package com.spark.hive.poc
import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;
//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
object ReadPropertyFiles extends Serializable {
val conf = new SparkConf().setAppName("read local file");
conf.set("spark.executor.memory", "100M");
conf.setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
def main(args: Array[String]): Unit = {
var hadoopConf = new org.apache.hadoop.conf.Configuration();
var fileSystem = FileSystem.get(hadoopConf);
var Path = new Path(args(0));
val inputStream = fileSystem.open(Path);
var Properties = new java.util.Properties;
Properties.load(inputStream);
//Create an RDD
val people = sc.textFile("/user/User1/spark_hive_poc/input/");
//The schema is encoded in a string
val schemaString = "name address";
//Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
//Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
//Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.printSchema();
peopleDataFrame.registerTempTable("tbl_temp")
val data = sqlContext.sql(Properties.getProperty("temp_table"));
//Drop Hive table
sqlContext.sql(Properties.getProperty("drop_hive_table"));
//Create Hive table
sqlContext.sql(Properties.getProperty("create_hive_tavle"));
//Insert data into Hive table
sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
//Select Data into Hive table
sqlContext.sql(Properties.getProperty("select_from_hive")).show();
sc.stop
}
}
Entry in Properties File :
temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl
Spark submit Command to run this job:
[User1#hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties
Note: Property File location should be HDFS location.

How to implement auto increment in spark SQL(PySpark)

I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0
Thank you
Kalyan
I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive.
check this line #UDFType(deterministic = false, stateful = true) in below code to make sure it's stateful UDF.
package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
* UDFRowSequence.
*/
#Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
#UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();
public UDFRowSequence() {
result.set(0);
}
public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}
// End UDFRowSequence.java
Now build the jar and add the location when pyspark get's started.
$ pyspark --jars your_jar_name.jar
Then register with sqlContext.
sqlContext.sql("CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'")
Now use row_seq() in select query
sqlContext.sql("SELECT row_seq(), col1, col2 FROM table_name")
Project to use Hive UDFs in pySpark

Resources