How to write data to Apache Iceberg tables using Spark SQL? - apache-spark

I am trying to familiarize myself with Apache Iceberg and I'm having some trouble understanding how to write some external data to a table using Spark SQL.
I have a file, one.csv, sitting in a directory, /data
my Iceberg catalog is configured to point to this directory, /warehouse
I want to write this one.csv to an Apache Iceberg table (preferably using Spark SQL)
Is it even possible to read external data using Spark SQL? And then write it to the iceberg tables? Do I have to use scala or python to do this? I've been through the Iceberg and Spark 3.0.1 documentation a bunch but maybe I'm missing something.
Code Update
Here is some code that I hope will help
spark.conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
spark.conf.set("spark.sql.catalog.spark_catalog.type", "hive")
spark.conf.set("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.local.type", "hadoop")
spark.conf.set("spark.sql.catalog.local.warehouse", "data/warehouse")
I have the data I need to use sitting in a directory /one/one.csv
How do I get it into an Iceberg table using Spark? Can all of this be done purely using SparkSQL?
spark.sql(
"""
CREATE or REPLACE TABLE local.db.one
USING iceberg
AS SELECT * FROM `/one/one.csv`
"""
)
Then the goal is I can work with this iceberg table directly for example:
select * from local.db.one
and this would give me all the content from the /one/one.csv file.

To use the SparkSQL, read the file into a dataframe, then register it as a temp view. This temp view can now be referred in the SQL as:
var df = spark.read.format("csv").load("/data/one.csv")
df.createOrReplaceTempView("tempview");
spark.sql("CREATE or REPLACE TABLE local.db.one USING iceberg AS SELECT * FROM tempview");
To answer your other question, Scala or Python is not required; the above example is in Java.

val sparkConf = new SparkConf()
sparkConf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
sparkConf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
sparkConf.set("spark.sql.catalog.spark_catalog.type", "hive")
sparkConf.set("spark.sql.catalog.hive_catalog", "org.apache.iceberg.spark.SparkCatalog")
sparkConf.set("spark.sql.catalog.hive_catalog.type", "hadoop")
sparkConf.set("spark.sql.catalog.hive_catalog.warehouse", "hdfs://host:port/user/hive/warehouse")
sparkConf.set("hive.metastore.uris", "thrift://host:19083")
sparkConf.set("spark.sql.catalog.hive_prod", " org.apache.iceberg.spark.SparkCatalog")
sparkConf.set("spark.sql.catalog.hive_prod.type", "hive")
sparkConf.set("spark.sql.catalog.hive_prod.uri", "thrift://host:19083")
sparkConf.set("hive.metastore.warehouse.dir", "hdfs://host:port/user/hive/warehouse")
val spark: SparkSession = SparkSession.builder()
.enableHiveSupport()
.config(sparkConf)
.master("yarn")
.appName("kafkaTableTest")
.getOrCreate()
spark.sql(
"""
|
|create table if not exists hive_catalog.icebergdb.kafkatest1(
| company_id int,
| event string,
| event_time timestamp,
| position_id int,
| user_id int
|)using iceberg
|PARTITIONED BY (days(event_time))
|""".stripMargin)
import spark.implicits._
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka_server")
.option("subscribe", "topic")
.option("startingOffsets", "latest")
.load()
//.selectExpr("cast (value as string)")
val value: DataFrame = df.selectExpr("CAST(value AS STRING)")
.as[String]
.map(data => {
val json_str: JSONObject = JSON.parseObject(data)
val company_id: Integer = json_str.getInteger("company_id")
val event: String = json_str.getString("event")
val event_time: String = json_str.getString("event_time")
val position_id: Integer = json_str.getInteger("position_id")
val user_id: Integer = json_str.getInteger("user_id")
(company_id, event, event_time, position_id, user_id)
})
.toDF("company_id", "event", "event_time", "position_id", "user_id")
value.createOrReplaceTempView("table")
spark.sql(
"""
|select
| company_id,
| event,
| to_timestamp(event_time,'yyyy-MM-dd HH:mm:ss') as event_time,
| position_id,
| user_id
|from table
|""".stripMargin)
.writeStream
.format("iceberg")
.outputMode("append")
.trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
.option("path","hive_catalog.icebergdb.kafkatest1") // tablePath: catalog.db.tableName
.option("checkpointLocation","hdfspath")
.start()
.awaitTermination()
This example is reading data from Kafka and writing data to Iceberg table

Related

Mapping Kafka to Spark dataFrame with the Schema

I have application which runs query on Kafka topics with the schema specified,
Below is my code :
SparkSession spark = SparkSession.builder()
.appName("Spark-Kafka-Integration")
.config("spark.master", "local")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
// Mapping it to the schema
Dataset<Row> ds2 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp"));
ds2.createOrReplaceTempView("ds2");
// Making a Row having timestamp and the values
Dataset<Row> ds3 = spark.sql("select rows.* , timestamp from ds2 ");
ds3.createOrReplaceTempView("table");
Dataset<Row> result2 = spark.sql(query.getQuery());
This runs fine, now I have view table which will have all columns and timestamp. Then I can run SQL like Select column1 , column2 from table group by window(timestamp,'1 minutes'),column1 , column2
My Question :
Is this is an efficient way to do it ? Because if I have multiple topics i.e .option("subscribe","topic1,topics2,...") then I have to create multiple data frame in order to run Join Query on them and how I can handle timestamp column ?
In case of multiple topics I will have the following code :
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic1, topic2,....topicn")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
Dataset<Row> ds = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic1");
Dataset<Row> ds1 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic2");
.... so on and have to same for other data frame

Is is possible to parse JSON string from Kafka topic in real time using Spark Streaming SQL?

I have a Pyspark notebook that connects to kafka broker and creates a spark writeStream called temp. The data values in Kafka topic are in json format but I'm not sure how to go about creating a spark sql table that can parse this data in real time. The only way I know is to create a copy of the table convert it into RDD or DF and parse the value into another RDD and DF. Is is possible to have this done in real time processing as the stream is being written?
Code:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","hoteth") \
.option("startingOffsets", "earliest") \
.load()
ds = df.selectExpr("CAST (key AS STRING)", "CAST(value AS STRING)", "timestamp")
ds.writeStream.queryName("temp").format("memory").start()
spark.sql("select * from temp limit 5").show()
Output:
+----+--------------------+--------------------+
| key| value| timestamp|
+----+--------------------+--------------------+
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
|null|{"e":"trade","E":...|2018-09-18 15:41:...|
+----+--------------------+--------------------+
One way I could solve this is to just lateral view json_tuple just like it is done in Hive HQL. I'm still looking for a solution that it can parse data directly from the stream so that it doesn't take extra processing time parsing using query.
spark.sql("""
select value, v1.transaction,ticker,price
from temp
lateral view json_tuple(value,"e","s","p") v1 as transaction, ticker,price
limit 5
""").show()

Spark reading table after connection to HiverServer2 only gives schema not data

I try to connect to a remote hive cluster using the following code and I get the table data as expected
val spark = SparkSession
.builder()
.appName("adhocattempts")
.config("hive.metastore.uris", "thrift://<remote-host>:9083")
.enableHiveSupport()
.getOrCreate()
val seqdf=sql("select * from anon_seq")
seqdf.show
However, when I try to do this via HiveServer2, I get no data in my dataframe. This table is based on a sequencefile. Is that the issue, since I am actually trying to read this via jdbc?
val sparkJdbc = SparkSession.builder.appName("SparkHiveJob").getOrCreate
val sc = sparkJdbc.sparkContext
val sqlContext = sparkJdbc.sqlContext
val driverName = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driverName)
val df = sparkJdbc.read
.format("jdbc")
.option("url", "jdbc:hive2://<remote-host>:10000/default")
.option("dbtable", "anon_seq")
.load()
df.show()
Can someone help me understand the purpose of using HiveServer2 with jdbc and relevant drivers in Spark2?

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method
df.saveAsTable(tablename,mode).
The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).
is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).
Question can be considered as an extension to :How to save DataFrame directly to Hive?
any help is much appreciated.
I believe it works something like this:
df is a dataframe with year, month and other columns
df.write.partitionBy('year', 'month').saveAsTable(...)
or
df.write.partitionBy('year', 'month').insertInto(...)
I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")
I had to enable the following properties to make it work.
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
I also faced same thing but using following tricks I resolved.
When we Do any table as partitioned then partitioned column become case sensitive.
Partitioned column should be present in DataFrame with same name (case sensitive). Code:
var dbName="your database name"
var finaltable="your table name"
// First check if table is available or not..
if (sparkSession.sql("show tables in " + dbName).filter("tableName='" +finaltable + "'").collect().length == 0) {
//If table is not available then it will create for you..
println("Table Not Present \n Creating table " + finaltable)
sparkSession.sql("use Database_Name")
sparkSession.sql("SET hive.exec.dynamic.partition = true")
sparkSession.sql("SET hive.exec.dynamic.partition.mode = nonstrict ")
sparkSession.sql("SET hive.exec.max.dynamic.partitions.pernode = 400")
sparkSession.sql("create table " + dbName +"." + finaltable + "(EMP_ID string,EMP_Name string,EMP_Address string,EMP_Salary bigint) PARTITIONED BY (EMP_DEP STRING)")
//Table is created now insert the DataFrame in append Mode
df.write.mode(SaveMode.Append).insertInto(empDB + "." + finaltable)
}
it can be configured on SparkSession in that way:
spark = SparkSession \
.builder \
...
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
or you can add them to .properties file
the spark.hadoop prefix is needed by Spark config (at least in 2.4) and here is how Spark sets this config:
/**
* Appends spark.hadoop.* configurations from a [[SparkConf]] to a Hadoop
* configuration without the spark.hadoop. prefix.
*/
def appendSparkHadoopConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = {
SparkHadoopUtil.appendSparkHadoopConfigs(conf, hadoopConf)
}
This is what works for me. I set these settings and then put the data in partitioned tables.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode",
"nonstrict")
This worked for me using python and spark 2.1.0.
Not sure if it's the best way to do this but it works...
# WRITE DATA INTO A HIVE TABLE
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[*]") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
### CREATE HIVE TABLE (with one row)
spark.sql("""
CREATE TABLE IF NOT EXISTS hive_df (col1 INT, col2 STRING, partition_bin INT)
USING HIVE OPTIONS(fileFormat 'PARQUET')
PARTITIONED BY (partition_bin)
LOCATION 'hive_df'
""")
spark.sql("""
INSERT INTO hive_df PARTITION (partition_bin = 0)
VALUES (0, 'init_record')
""")
###
### CREATE NON HIVE TABLE (with one row)
spark.sql("""
CREATE TABLE IF NOT EXISTS non_hive_df (col1 INT, col2 STRING, partition_bin INT)
USING PARQUET
PARTITIONED BY (partition_bin)
LOCATION 'non_hive_df'
""")
spark.sql("""
INSERT INTO non_hive_df PARTITION (partition_bin = 0)
VALUES (0, 'init_record')
""")
###
### ATTEMPT DYNAMIC OVERWRITE WITH EACH TABLE
spark.sql("""
INSERT OVERWRITE TABLE hive_df PARTITION (partition_bin)
VALUES (0, 'new_record', 1)
""")
spark.sql("""
INSERT OVERWRITE TABLE non_hive_df PARTITION (partition_bin)
VALUES (0, 'new_record', 1)
""")
spark.sql("SELECT * FROM hive_df").show() # 2 row dynamic overwrite
spark.sql("SELECT * FROM non_hive_df").show() # 1 row full table overwrite
df1.write
.mode("append")
.format('ORC')
.partitionBy("date")
.option('path', '/hdfs_path')
.saveAsTable("DB.Partition_tablename")
It will create the partition with "date" column values and will also write as Hive External Table in hive from spark DF.

Resources