create hive external table with schema in spark - apache-spark

I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my creating table. I use scala. Need help guys.

finally, I make it myself with old-fashioned way. With the help of code below:
val rawSchema = sqlContext.read.avro("Path").schema
val schemaString = rawSchema.fields.map(field => field.name.replaceAll("""^_""", "").concat(" ").concat(field.dataType.typeName match {
case "integer" => "int"
case smt => smt
})).mkString(",\n")
val ddl =
s"""
|Create external table $tablename ($schemaString) \n
|partitioned by (y int, m int, d int, hh int, mm int) \n
|Stored As Avro \n
|-- inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' \n
| -- outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' \n
| Location 'hdfs://$path'
""".stripMargin
take care no column name can start with _ and hive can't parse integer. I would like to say that this way is not flexible but work. if anyone get better idea, plz comment.

I didn't see a way to automatically infer schema for external tables. So I created case for the string type. You could add case for your data type. But I'm not sure how many columns you have. I apologize as this might not be a clean approach.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SaveMode};
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.read.format("com.databricks.spark.avro").load("people.avro")
val schema = results.schema.map( x => x.name.concat(" ").concat( x.dataType.toString() match { case "StringType" => "STRING"} ) ).mkString(",")
val hive_sql = "CREATE EXTERNAL TABLE people_and_age (" + schema + ") ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/ravi/people_age'"
hiveContext.sql(hive_sql)
results.saveAsTable("people_age",SaveMode.Overwrite)
hiveContext.sql("select * from people_age").show()

Try the below code.
val htctx= new HiveContext(sc)
htctx.sql(create extetnal table tablename schema partitioned by attribute row format serde serde.jar field terminated by value location path)

Related

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

How to convert a sql to spark dataset?

I have a Val test=sql ("Select * from table1) which returns a dataframe. I want to convert it to dataset which is not working.
test.toDS is throwing error.
Please provide more detail about the error.
If you want to convert a dataframe in dataset use the code below :
case class MyClass(field1: Int, field2: Long) // for example
val df = sql ("Select * from table1)
val ds : Dataset[MyClass] = df.as[MyClass]

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}
Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)
Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src
USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

SPARK read.jdbc & custom schema

With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ... Presumably what I am trying to do is no longer possible as in the above example.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.
I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
"key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx",
"password": "yyy",
"driver":"com.mysql.jdbc.Driver",
"customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column
you put inside the customized schema.
. What am i missing or has this changed in SPARK 2.x?
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema.
If the types are not satisfying, just cast the results to the desired types.

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
Finally! This is now a feature in Spark 2.3.0:
SPARK-20236
To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,
df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")
If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)
You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
This works for me on AWS Glue ETL jobs (Glue 1.0 - Spark 2.4 - Python 2)
Adding 'overwrite=True' parameter in the insertInto statement solves this:
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)
By default overwrite=False. Changing it to True allows us to overwrite specific partitions contained in df and in the partioned_table. This helps us avoid overwriting the entire contents of the partioned_table with df.
Using Spark 1.6...
The HiveContext can simplify this process greatly. The key is that you must create the table in Hive first using a CREATE EXTERNAL TABLE statement with partitioning defined. For example:
# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'
From here, let's say you have a Dataframe with new records in it for a specific partition (or multiple partitions). You can use a HiveContext SQL statement to perform an INSERT OVERWRITE using this Dataframe, which will overwrite the table for only the partitions contained in the Dataframe:
# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')
hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
SELECT name, age
FROM update_dataframe""")
Note: update_dataframe in this example has a schema that matches that of the target test table.
One easy mistake to make with this approach is to skip the CREATE EXTERNAL TABLE step in Hive and just make the table using the Dataframe API's write methods. For Parquet-based tables in particular, the table will not be defined appropriately to support Hive's INSERT OVERWRITE... PARTITION function.
Hope this helps.
Tested this on Spark 2.3.1 with Scala.
Most of the answers above are writing to a Hive table. However, I wanted to write directly to disk, which has an external hive table on top of this folder.
First the required configuration
val sparkSession: SparkSession = SparkSession
.builder
.enableHiveSupport()
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
.appName("spark_write_to_dynamic_partition_folders")
Usage here:
DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
I tried below approach to overwrite particular partition in HIVE table.
### load Data and check records
raw_df = spark.table("test.original")
raw_df.count()
lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925
### Check data in few partitions.
sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
print "Number of records: ", sample.count()
sample.show()
### Back-up the partitions before deletion
raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")
### UDF : To delete particular partition.
def delete_part(table, part):
qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
spark.sql(qry)
### Delete partitions
part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
part_list = part_df.rdd.map(lambda x : x[0]).collect()
table = "test.original"
for p in part_list:
delete_part(table, p)
### Do the required Changes to the columns in partitions
df = spark.table("test.original_bkp")
newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
newdf.select("c_customer_sk", "c_preferred_cust_flag").show()
### Write the Partitions back to Original table
newdf.write.insertInto("test.original")
### Verify data in Original table
orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()
Hope it helps.
Regards,
Neeraj
As jatin Wrote you can delete paritions from hive and from path and then append data
Since I was wasting too much time with it I added the following example for other spark users.
I used Scala with spark 2.2.1
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)
object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"
val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"
val someText = List("text1", "text2")
val ids = (0 until 5).toList
val listData = partitions1.flatMap(p1 => {
partitions2.flatMap(p2 => {
someText.flatMap(
text => {
ids.map(
id => DataExample(p1, p2, text, id)
)
}
)
}
)
})
val asDataFrame = spark.createDataFrame(listData)
//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
val p = new Path(path)
val fs = p.getFileSystem(new Configuration())
fs.delete(p, recursive)
}
def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
if (spark.catalog.tableExists(tableName)) {
//clean partitions
val asColumns = partitions.map(c => new Column(c))
val relevantPartitions = df.select(asColumns: _*).distinct().collect()
val partitionToRemove = relevantPartitions.map(row => {
val fields = row.schema.fields
s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
})
val cleanFolders = relevantPartitions.map(partition => {
val fields = partition.schema.fields
path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
})
println(s"Going to clean ${partitionToRemove.size} partitions")
partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
cleanFolders.foreach(partition => deletePath(partition, true))
}
asDataFrame.write
.options(Map("path" -> myTablePath))
.mode(SaveMode.Append)
.partitionBy(partitionColumns: _*)
.saveAsTable(tableName)
}
//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)
import spark.implicits._
val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
println("Overwrite is working !!!")
}
}
If you use DataFrame, possibly you want to use Hive table over data.
In this case you need just call method
df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)
It'll overwrite partitions that DataFrame contains.
There's not necessity to specify format (orc), because Spark will use Hive table format.
It works fine in Spark version 1.6
Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.
CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';
Once the table is created, you would write your data to the tmpLocation
df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)
Then you would recover the table partition paths by executing:
MSCK REPAIR TABLE tmpTbl;
Get the partition paths by querying the Hive metadata like:
SHOW PARTITONS tmpTbl;
Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl
I would suggest you doing clean-up and then writing new partitions with Append mode:
import scala.sys.process._
def deletePath(path: String): Unit = {
s"hdfs dfs -rm -r -skipTrash $path".!
}
df.select(partitionColumn).distinct.collect().foreach(p => {
val partition = p.getAs[String](partitionColumn)
deletePath(s"$path/$partitionColumn=$partition")
})
df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)
This will delete only new partitions. After writing data run this command if you need to update metastore:
sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")
Note: deletePath assumes that hfds command is available on your system.
My solution implies overwriting each specific partition starting from a spark dataframe. It skips the dropping partition part. I'm using pyspark>=3 and I'm writing on AWS s3:
def write_df_on_s3(df, s3_path, field, mode):
# get the list of unique field values
list_partitions = [x.asDict()[field] for x in df.select(field).distinct().collect()]
df_repartitioned = df.repartition(1,field)
for p in list_partitions:
# create dataframes by partition and send it to s3
df_to_send = df_repartitioned.where("{}='{}'".format(field,p))
df_to_send.write.mode(mode).parquet(s3_path+"/"+field+"={}/".format(p))
The arguments of this simple function are the df, the s3_path, the partition field, and the mode (overwrite or append). The first part gets the unique field values: it means that if I'm partitioning the df by daily, I get a list of all the dailies in the df. Then I'm repartition the df. Finally, I'm selecting the repartitioned df by each daily and I'm writing it on its specific partition path.
You can change the repartition integer by your needs.
You could do something like this to make the job reentrant (idempotent):
(tried this on spark 2.2)
# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)
# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)
# Load the partition
df.write\
.partitionBy("partition_col")\
.saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
For >= Spark 2.3.0 :
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)

Resources