Load an RDD into hive - apache-spark

I want to load an RDD (k=table_name, v=content) into a partitioned hive table (year,month, day) with pyspark in spark version 1.6.x
The whole while trying to use the logic of this SQL query:
ALTER TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% DROP IF EXISTS PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);LOAD DATA INTO TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);
Could someone please give some suggestions?

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sparkContext.parallelize([(1, 'cat', '2016-12-20'), (2, 'dog', '2016-12-21')])
df = spark.createDataFrame(df, schema=['id', 'val', 'dt'])
df.write.saveAsTable(name='default.test', format='orc', mode='overwrite', partitionBy='dt')
Using enableHiveSupport() and df.write.saveAsTable()

Related

How to update Cassandra table with latest row, where Spark Dataframe is having multiple rows with same primary key?

We have Cassandra table person,
CREATE TABLE test.person (
name text PRIMARY KEY,
score bigint
)
and Dataframe is,
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
In Spark We wanted to save dataframe to table , where dataframe is having multiple records for the same primary key.
Q 1: How Cassandra Connector internally handles ordering of the rows?
Q2: We are reading data from kafka and saving to Cassandra, and our batch will always have multiple events like above. We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Connector version we used is spark-cassandra-connector_2.12:3.2.1
Here are some Observation from our side,
val spark = SparkSession.builder()
.master("local[1]")
.appName("CassandraConnector")
.config("spark.cassandra.connection.host", "")
.config("spark.cassandra.connection.port", "")
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.getOrCreate()
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
caseClassDF.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", "test")
.option("table", "person")
.mode("APPEND")
.save()
When we have
.master("local[1]")
then in Cassandra table, we always see score 2000 for "Andy1" and 2700 fro "Ron", this is the latest in the Seq
Now when we change to,
.master("local[*]") OR .master("local[2]")
then we see some random score in Cassandra table, either 200 or 32 for "Andy1".
Note : We did each run on fresh table. So it is always insert and update in one batch.
We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Data in dataframe is by definition aren't ordered, and write into Cassandra will reflect this (inserts and updates are the same things in Cassandra) - data will be written in the random order and last write will win.
If you want to write only the latest value (with max score?) you will need to perform aggregations over your data, and use update output mode to write data to Cassandra (to write intermediate results of your streaming aggregations). Something like this:
caseClassDF.groupBy("name").agg(max("score")).write....

PySpark Partition Dataframe created from HiveContext

I'm fetching data from HiveContext and creating DataFrame. To achieve performance benefits I want to partition DF before applying join operation. How to parition data on 'ID' column and then apply Join on 'ID'
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveCtx = HiveContext(spark)
df1 = hiveCtx.sql("select id,name,address from db.table1")
df2 = hiveCtx.sql("select id,name,marks from db.table2")
Need to perform following operarions on data
Dataframe partitionBy 'ID'
Join by 'ID'
You can use repartition.
Refer spark docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition
Based on your data size, choose the no.of.partition.
df1= df1.repartition(7, "id")

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

Spark SQL 'Show table extended from db like table gives different result' from Hive

spark.sql("SHOW TABLE EXTENDED IN DB LIKE 'TABLE'")
Beeline >>SHOW TABLE EXTENDED IN DB LIKE 'TABLE';
Both queries have different results.
If I run the same query in Spark it is giving different result than Hive. Format and lastUpdatedTime is missing in Spark SQL.
If anyone have idea then please let me know how to see lastUpdatedTime of Hive table from Spark SQL
Try this -
scala> val df = spark.sql(s"describe extended ${db}.${table_name}").select("data_type").where("col_name == 'Table Properties'")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_type: string]
scala> df.map(r => r.getString(0).split(",")(1).trim).collect
res39: Array[String] = Array(last_modified_time=1539848078)
scala> df.map(r => r.getString(0).split(",")(1).trim.split("=")(1)).collect.mkString
res41: String = 1539848078

How to convert a table into a Spark Dataframe

In Spark SQL, a dataframe can be queried as a table using this:
sqlContext.registerDataFrameAsTable(df, "mytable")
Assuming what I have is mytable, how can I get or access this as a DataFrame?
The cleanest way:
df = sqlContext.table("mytable")
Documentation
Well you can query it and save the result into a variable. Check that SQLContext's method sql returns a DataFrame.
df = sqlContext.sql("SELECT * FROM mytable")

Resources