How to implement auto increment in spark SQL(PySpark) - apache-spark

I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0
Thank you
Kalyan

I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive.
check this line #UDFType(deterministic = false, stateful = true) in below code to make sure it's stateful UDF.
package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
* UDFRowSequence.
*/
#Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
#UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();
public UDFRowSequence() {
result.set(0);
}
public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}
// End UDFRowSequence.java
Now build the jar and add the location when pyspark get's started.
$ pyspark --jars your_jar_name.jar
Then register with sqlContext.
sqlContext.sql("CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'")
Now use row_seq() in select query
sqlContext.sql("SELECT row_seq(), col1, col2 FROM table_name")
Project to use Hive UDFs in pySpark

Related

Inconsistent Behaviors for Multiple SparkSessions when accessing the Iceberg Table

I explored the multiple SparkSessions (to connect to different data sources/data clusters) a bit. And I found a wired behavior.
Firstly I created a SparkSession to RW the iceberg table, and everything works.
Then if I use the new SparkSession (with some incorrect parameters like spark.sql.catalog.mycatalog.uri) to access the table created by the previous SparkSession through (1) spark.read().*.load("*") first, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter).
The full test is given as below:
// The test to use the new SparkSession access the dataset created by previous SparkSession, using spark.read().*.load(*) first, then sql. And the whole test still works.
#Test
public void multipleSparkSessions() throws AnalysisException {
// Create the 1st SparkSession
String endpoint = String.format("http://localhost:%s/metastore", port);
ctx = SparkSession
.builder()
.master("local")
.config("spark.ui.enabled", false)
.config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.mycatalog.type", "hive")
.config("spark.sql.catalog.mycatalog.uri", endpoint)
.config("spark.sql.catalog.mycatalog.cache-enabled", "false")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.getOrCreate();
// Create a table with the SparkSession
String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
+ "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));
// Create a new SparkSession
SparkSession newSession = ctx.newSession();
newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");
// Access the created dataset above with the new SparkSession through session.read()...load(), which succeeds
List<Row> dataset2 = newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList();
dataset2.forEach(r -> System.out.println(r));
// Access the dataset through SQL, which succeeds as well.
newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList();
}
But if I use the new SparkSession to access the table through (1) newSession.sql first, the execution fails, and then (2) the read().**.load("**") will fail as well with error java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679.
The updated test is given below, you will notice the assertThrows which verifies the Exception is thrown.
IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession shouldn't be able to locate that table.
#Test
public void multipleSparkSessions() throws AnalysisException {
..same as above...
// Access the dataset through SQL first, the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList());
// Access the created dataset above with the new SparkSession through session.read()...load(), the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList());
}
Any idea what could lead to these two different behaviors with spark.read().load() versus spark.sql() in different sequences?

Nullability in Spark sql schemas is advisory by default. What is best way to strictly enforce it?

I am working on a simple ETL project which reads CSV files, performs
some modifications on each column, then writes the result out as JSON.
I would like downstream processes which read my results
to be confident that my output conforms to
an agreed schema, but my problem is that even if I define
my input schema with nullable=false for all fields, nulls can sneak
in and corrupt my output files, and there seems to be no (performant) way I can
make Spark enforce 'not null' for my input fields.
This seems to be a feature, as stated below in Spark, The Definitive Guide:
when you define a schema where all columns are declared to not have
null values , Spark will not enforce that and will happily let null
values into that column. The nullable signal is simply to help Spark
SQL optimize for handling that column. If you have null values in
columns that should not have null values, you can get an incorrect
result or see strange exceptions that can be hard to debug.
I have written a little check utility to go through each row of a dataframe and
raise an error if nulls are detected in any of the columns (at any level of
nesting, in the case of fields or subfields like map, struct, or array.)
I am wondering, specifically: DID I RE-INVENT THE WHEEL WITH THIS CHECK UTILITY ? Are there any existing libraries, or
Spark techniques that would do this for me (ideally in a better way than what I implemented) ?
The check utility and a simplified version of my pipeline appears below. As presented, the call to the
check utility is commented out. If you run without the check utility enabled, you would see this result in
/tmp/output.csv.
cat /tmp/output.json/*
(one + 1),(two + 1)
3,4
"",5
The second line after the header should be a number, but it is an empty string
(which is how spark writes out the null, I guess.) This output would be problematic for
downstream components that read my ETL job's output: these components just want integers.
Now, I can enable the check by un-commenting out the line
//checkNulls(inDf)
When I do this I get an exception that informs me of the invalid null value and prints
out the entirety of the offending row, like this:
java.lang.RuntimeException: found null column value in row: [null,4]
One Possible Alternate Approach Given in Spark/Definitive Guide
Spark, The Definitive Guide mentions the possibility of doing this:
<dataframe>.na.drop()
But this would (AFAIK) silently drop the bad records rather than flagging the bad ones.
I could then do a "set subtract" on the input before and after the drop, but that seems like
a heavy performance hit to find out what is null and what is not. At first glance, I'd
prefer my method.... But I am still wondering if there might be some better way out there.
The complete code is given below. Thanks !
package org
import java.io.PrintWriter
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// before running, do; rm -rf /tmp/out* /tmp/foo*
object SchemaCheckFailsToExcludeInvalidNullValue extends App {
import NullCheckMethods._
//val input = "2,3\n\"xxx\",4" // this will be dropped as malformed
val input = "2,3\n,4" // BUT.. this will be let through
new PrintWriter("/tmp/foo.csv") { write(input); close }
lazy val sparkConf = new SparkConf()
.setAppName("Learn Spark")
.setMaster("local[*]")
lazy val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val spark = sparkSession
val schema = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", IntegerType, nullable = false)
)
)
val inDf: DataFrame =
spark.
read.
option("header", "false").
option("mode", "dropMalformed").
schema(schema).
csv("/tmp/foo.csv")
//checkNulls(inDf)
val plusOneDf = inDf.selectExpr("one+1", "two+1")
plusOneDf.show()
plusOneDf.
write.
option("header", "true").
csv("/tmp/output.csv")
}
object NullCheckMethods extends Serializable {
def checkNull(columnValue: Any): Unit = {
if (columnValue == null)
throw new RuntimeException("got null")
columnValue match {
case item: Seq[_] =>
item.foreach(checkNull)
case item: Map[_, _] =>
item.values.foreach(checkNull)
case item: Row =>
item.toSeq.foreach {
checkNull
}
case default =>
println(
s"bad object [ $default ] of type: ${default.getClass.getName}")
}
}
def checkNulls(row: Row): Unit = {
try {
row.toSeq.foreach {
checkNull
}
} catch {
case err: Throwable =>
throw new RuntimeException(
s"found null column value in row: ${row}")
}
}
def checkNulls(df: DataFrame): Unit = {
df.foreach { row => checkNulls(row) }
}
}
You can use the built-in Row method anyNull to split the dataframe and process both splits differently:
val plusOneNoNulls = plusOneDf.filter(!_.anyNull)
val plusOneWithNulls = plusOneDf.filter(_.anyNull)
If you don't plan to have a manual null-handling process, using the builtin DataFrame.na methods is simpler since it already implements all the usual ways to automatically handle nulls (i.e drop or fill them out with default values).

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}
Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)
Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src
USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

How to use registered Hive UDF in Spark DataFrame?

I have registered my hive UDF using the following code:
hiveContext.udf().register("MyUDF",new UDF1(String,String)) {
public String call(String o) throws Execption {
//bla bla
}
},DataTypes.String);
Now I want to use above MyUDF in DataFrame. How do we use it? I know how to use it in a SQL and it works fine
hiveContext.sql(select MyUDF("test") from myTable);
My hiveContext.sql() query involves group by on multiple columns so for scaling purpose I am trying to convert this query into DataFrame APIs
dataframe.select("col1","col2","coln").groupby(""col1","col2","coln").count();
Can we do the following: dataframe.select(MyUDF("col1"))?
I tested the following with pyspark 3.x running on top of yarn and it works
from pyspark.sql.functions import expr
df1 = df.withColumn("result", expr("MyUDF('test')"))
df1.show()
df2 = df.selectExpr("MyUDF('test') as result").show()
df2.show()
In case you come across Class not found error. Then you might want to add the jar using spark.sql("ADD JAR hdfs://...")

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources