PySpark Partition Dataframe created from HiveContext - apache-spark

I'm fetching data from HiveContext and creating DataFrame. To achieve performance benefits I want to partition DF before applying join operation. How to parition data on 'ID' column and then apply Join on 'ID'
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveCtx = HiveContext(spark)
df1 = hiveCtx.sql("select id,name,address from db.table1")
df2 = hiveCtx.sql("select id,name,marks from db.table2")
Need to perform following operarions on data
Dataframe partitionBy 'ID'
Join by 'ID'

You can use repartition.
Refer spark docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition
Based on your data size, choose the no.of.partition.
df1= df1.repartition(7, "id")

Related

spark join raises “Detected implicit cartesian product for INNER join between logical plans”

I have a situation where I have a dataframe df
and let's say I do the following steps:
df1 = df
df2 = df
and then write a query which uses D and E in the joins e.g
df3 = df1.join(df2, df1["column"] = df2["column"])
This is nothing but a self join which is widely needed in ETL.
Why does spark not handle it correctly
I have seen many posts but none of them provide a work around.
UPdate:
If I load the dataframes df1 and df2 from the same s3 location and then perform the join the issue goes away. But when you are doing ETL it may not always be the case where we persist the data and then use it to avoid this scenario.
Any thoughts?

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

INSERT IF NOT EXISTS ELSE UPDATE in Spark SQL

Is there any provision of doing "INSERT IF NOT EXISTS ELSE UPDATE" in Spark SQL.
I have Spark SQL table "ABC" that has some records.
And then i have another batch of records that i want to Insert/update in this table based on whether they exist in this table or not.
is there a SQL command that i can use in SQL query to make this happen?
In regular Spark this could be achieved with a join followed by a map like this:
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("id1", "orginal"), ("id2", "original"))).toDF("df1_id", "df1_status")
val df2 = spark.sparkContext.parallelize(List(("id1", "new"), ("id3","new"))).toDF("df2_id", "df2_status")
val df3 = df1
.join(df2, 'df1_id === 'df2_id, "outer")
.map(row => {
if (row.isNullAt(2))
(row.getString(0), row.getString(1))
else
(row.getString(2), row.getString(3))
})
This yields:
scala> df3.show
+---+--------+
| _1| _2|
+---+--------+
|id3| new|
|id1| new|
|id2|original|
+---+--------+
You could also use select with udfs instead of map, but in this particular case with null-values, I personally prefer the map variant.
you can use spark sql like this :
select * from (select c.*, row_number() over (partition by tac order by tag desc) as
TAG_NUM from (
select
a.tac
,a.name
,0 as tag
from tableA a
union all
select
b.tac
,b.name
,1 as tag
from tableB b) c ) d where TAG_NUM=1
tac is column you want to insert/update by.
I know it's a bit late to share my code, but to add or update my database, i did a fuction that looks like this :
import pandas as pd
#Returns a spark dataframe with added and updated datas
#key parameter is the primary key of the dataframes
#The two parameters dfToUpdate and dfToAddAndUpdate are spark dataframes
def AddOrUpdateDf(dfToUpdate,dfToAddAndUpdate,key):
#Cast the spark dataframe dfToUpdate to pandas dataframe
dfToUpdatePandas = dfToUpdate.toPandas()
#Cast the spark dataframe dfToAddAndUpdate to pandas dataframe
dfToAddAndUpdatePandas = dfToAddAndUpdate.toPandas()
#Update the table records with the latest records, and adding new records if there are new records.
AddOrUpdatePandasDf = pd.concat([dfToUpdatePandas,dfToAddAndUpdatePandas]).drop_duplicates([key], keep = 'last').sort_values(key)
#Cast back to get a spark dataframe
AddOrUpdateDf = spark.createDataFrame(AddOrUpdatePandasDf)
return AddOrUpdateDf
As you can see, we need to cast the spark dataframes to pandas dataframe to be able to do the pd.concat and especially the drop_duplicates with the "keep = 'last'", then we cast back to spark dataframe and return it.
I don't think this is the best way to handle the AddOrUpdate, but at least, it works.

Dataframe partitioning pruning

I have a two table(table_a, table_b) hive which is partitioned by build_date(integer of format YYYYMMDD) and it parquet and snappy compressed.
I use this table in a my pyspark program as
hiveCtx = HiveContext(sc)
df1 = hiveCtx.table('table_a').filter(func.col('build_date') == 20170101)
df2 = hiveCtx.table('table_b')
df3 = df1.join(df2, df1.build_date == df2.build_date)
When I do a df3.explain() it reads all the partitions in table_b. How do I make it read only that particular partition ?
I have also set the hive property set('spark.sql.hive.convertMetastoreParquet', 'false')

Load an RDD into hive

I want to load an RDD (k=table_name, v=content) into a partitioned hive table (year,month, day) with pyspark in spark version 1.6.x
The whole while trying to use the logic of this SQL query:
ALTER TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% DROP IF EXISTS PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);LOAD DATA INTO TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);
Could someone please give some suggestions?
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sparkContext.parallelize([(1, 'cat', '2016-12-20'), (2, 'dog', '2016-12-21')])
df = spark.createDataFrame(df, schema=['id', 'val', 'dt'])
df.write.saveAsTable(name='default.test', format='orc', mode='overwrite', partitionBy='dt')
Using enableHiveSupport() and df.write.saveAsTable()

Resources