Loading a spark dataframe into Hive partition - apache-spark

Im trying to load a dataframe into hive table which is partitioned like below.
> create table emptab(id int, name String, salary int, dept String)
> partitioned by (location String)
> row format delimited
> fields terminated by ','
> stored as parquet;
I have a dataframe created in the below format:
val empfile = sc.textFile("emp")
val empdata = empfile.map(e => e.split(","))
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empdata.map(e => employee(e(0).toInt, e(1), e(2).toint, e(3)))
val empDF = empRDD.toDF()
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=England")
But Im getting an error as below:
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=India")
java.lang.RuntimeException: [1.1] failure: identifier expected
/user/hive/warehouse/emptab/location=England
Data in "emp" file:
---+-------+------+-----+
| id| name|salary| dept|
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
Also this is the first time loading the empty Hive table which is partitioned. I am trying to create a partition while loading the data into Hive table.
Could anyone tell what is the mistake I am doing here and how can I correct it ?

This is a wrong approach.
When you say the partition path, that is not a "valid" Hadoop path.
What you have to do is:
val empDF = empRDD.toDF()
val empDFFiltered = empDF.filter(empDF.location == "India")
empDFFiltered.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab")
The path will be handle by the partitionBy, if you want only add the information to partition India you should filter the India data from your dataframe.

Related

Does inserting into a Cassandra unset cell create a tombstone?

I am trying to enable invalidation of old measurements while keeping them in my Cassandra setup. Given the following table structure:
ID|Test|result|valid|valid2
1 | 1 | 10 | False| unset
2 | 1 | 11 | True| False
3 | 1 | 12 | True| True
with primary key (ID,test)
Now if I insert the following SparkDataframe using the connector as normal with mode("append")
ID|Test|valid2
1 | 1 | False
Will this create a tombstone? The purpose is to be able to "invalidate" certain rows in my tables when necessary. I understand tombstones are created when cells are outdated. But since there is no value in the cell, will a tombstone be created?
Tombstones are created when you performing explicit DELETE, insert null value, or data is TTLed.
If you don't specify the value for specific column, then the data for this cell is simply not set, and if you had some previous data before, then they won't be overwritten until you explicitly set them to null. But in Spark, usually situation is different - by default it will insert nulls until you won't specify spark.cassandra.output.ignoreNulls as true - in this case it will treat nulls as unset, and won't owerwrite the previous data.
But when you specify incomplete row, then only provided pieces will be updated, keeping the previous data intact.
If we have following table and data:
create table test.v2(id int primary key, valid boolean, v int);
insert into test.v2(id, valid, v) values(2,True, 2);
insert into test.v2(id, valid, v) values(1,True, 1);
we can check that data is visible in Spark:
scala> val data = spark.read.cassandraFormat("v2", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, v: int ... 1 more field]
scala> data.show
+---+---+-----+
| id| v|valid|
+---+---+-----+
| 1| 1| true|
| 2| 2| true|
+---+---+-----+
Now update the data:
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> val newData = Seq((2, false)).toDF("id", "valid")
newData: org.apache.spark.sql.DataFrame = [id: int, valid: boolean]
scala> newData.write.cassandraFormat("v2", "test").mode(SaveMode.Append).save()
scala> data.show
+---+---+-----+
| id| v|valid|
+---+---+-----+
| 1| 1| true|
| 2| 2|false|
+---+---+-----+

How to change multiple column values to a constant with out specifying all column names?

I have a very wide dataframe in spark .It has 80 columns so I want to set a column to 0 and rest to 1.
So the one I want to set to 1 I tried it with
df = df.withColumn("set_zero_column", lit(0))
and it worked.
Now I want to set the rest columns to 1. How do I do without specify all the 79 names ?
Any help is appreciated
Use select with a list comprehension:
from pyspark.sql.functions import lit
set_one_columns = [lit(1).alias(c) for c in df.columns if c != "set_zero_column"]
df = df.select(lit(0).alias("set_zero_column"), *set_one_columns)
If you needed to maintain the original column order, you could do:
cols = [lit(0).alias(c) if c == "set_zero_column" else lit(1).alias(c) for c in df.columns]
df = df.select(*cols)
I try to answer in Scala:
Example:
Method1:
//sample dataframe
val df=Seq(("a",1)).toDF("id","id1")
//filter req columns and add literal value
val cls=df.columns.map(x => if (x != "id1") (x,lit("1")) else (x,lit("0")))
//use foldLeft and add columns dynamically
val df2=cls.foldLeft(df){(df,cls) => df.withColumn(cls._1,cls._2)}
Result:
df2.show()
+---+---+
| id|id1|
+---+---+
| 1| 0|
+---+---+
Method2:Pault approach :)
val cls=df.columns.map( x => if (x !="id1") lit(1).alias(s"${x}") else lit(0).alias(s"${x}"))
Result:
df.select(cls:_*).show()
+---+---+
| id|id1|
+---+---+
| 1| 0|
+---+---+
I am still new to spark sql , although it may not be the most efficient way to handle this scenario but will be glad if it helps or can be further improved,
This is how I was able to do it in java.
Step1:
Create Sparksession and
Load your file into daframe.
code:
public void process() throws AnalysisException {
SparkSession session = new SparkSession.Builder()
.appName("Untyped Agregation on data frame")
.master("local")
.getOrCreate();
//Load the file that you need to compute.
Dataset<Row> peopledf = session.read()
.option("header","true")
.option("delimiter"," ")
.csv("src/main/resources/Person.txt");
Output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 27| 1|
| Dheeraj| 30| 1|
| Saloni| 26| 1|
| Deepak| 30| 1|
| Db| 25| 1|
|Praneeth| 24| 1|
| jyoti| 26| 1|
+--------+---+--------+
Step2(optional):
Incase you require to provide constant value to any one column.
code:
//incase you require to chnage value for a single column.
Dataset<Row> peopledf1 = peopledf.withColumn("property",lit("0"));
peopledf1.show();
output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 27| 0|
| Dheeraj| 30| 0|
| Saloni| 26| 0|
| Deepak| 30| 0|
| Db| 25| 0|
|Praneeth| 24| 0|
| jyoti| 26| 0|
+--------+---+--------+
Step3:
Get the String array of all the column names in your data frame.
code:
//Get the list of all the coloumns
String[] myStringArray = peopledf1.columns();
Step4:
logic for filtering out the column among the array you dont want to provide constant value to and creating List of required columns names and lit("constsnt") for withColumns
code:
//create two list one bieng names of columns you need to compute
//other bieng same size(same number of element as that of column list)of
//lit("0") i.e constant
//filter out the coloumn that you dont want to apply constant upon.
List<String> myList = new ArrayList<String>();
List<Column> myList1 = new ArrayList<Column>();
for(String element : myStringArray){
if(!(element.contains("name"))){
myList.add(element);
myList1.add(lit("0"));
}
}
Step5:
Convert the List to Scala Seq as withColumns method requires that format of argument.
code:
//convert both list into scala Seq<Columns> and Seq<String> respectively.
//Need to do this because withColumns method requires arguments in Seq form.
//check scala doc for with columns
Seq<Column> mySeq1 = convertListToSeq(myList1);
Seq<String> mySeq= convertListToSeq1(myList);
code for convertListToSeq using JavaConverters:
//Use JavaConverters to Convert List to Scala Seq using provided method below
public Seq<String> convertListToSeq1(List<String> inputList) {
return
JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().
toSeq();
}
public Seq<Column> convertListToSeq(List<Column> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator())
.asScala().toSeq();
}
Step6:
Print Output to console
code:
//Display the required output on console.
peopledf1.withColumns(mySeq,mySeq1).show();
output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 0| 0|
| Dheeraj| 0| 0|
| Saloni| 0| 0|
| Deepak| 0| 0|
| Db| 0| 0|
|Praneeth| 0| 0|
| jyoti| 0| 0|
+--------+---+--------+
Please do comment if code can be improved further.
Happy Learning,
Gaurav

Spark Dataframe issue in overwriting the partition data of Hive table

Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","d‌​ynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.

Compare two dataset and get what fields are changed

I am working on a spark using Java, where I will download data from api and compare with mongodb data, while the downloaded json have 15-20 fields but database have 300 fields.
Now my task is to compare the downloaded jsons to mongodb data, and get whatever fields changed with past data.
Sample data set
Downloaded data from API
StudentId,Name,Phone,Email
1,tony,123,a#g.com
2,stark,456,b#g.com
3,spidy,789,c#g.com
Mongodb data
StudentId,Name,Phone,Email,State,City
1,tony,1234,a#g.com,NY,Nowhere
2,stark,456,bg#g.com,NY,Nowhere
3,spidy,789,c#g.com,OH,Nowhere
I can't use the except, because of column length.
Expected output
StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a#g.com,1234, //phone number only changed
2,stark,456,b#g.com,,bg#g.com //Email only changed
3,spidy,789,c#g.com,,
Consider your data is in 2 dataframes. We can create temporary views for them, as shown below,
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
Next we can use Spark SQL. Here, we join both these views using the StudentId column and then use a case statement on top of them to compute the past phone number and email.
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
Output:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a#g.com| 1234| |
| 2|stark| 456|b#g.com| | bg#g.com|
| 3|spidy| 789|c#g.com| | |
+---------+-----+-----+-------+----------+----------+
Please find the below same source code. Here I am taking the only phone number condition as an example.
val list = List((1,"tony",123,"a#g.com"), (2,"stark",456,"b#g.com")
(3,"spidy",789,"c#g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
.select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1",
'Email as "Email_1")
df1.show()
val list1 = List((1,"tony",1234,"a#g.com","NY","Nowhere"),
(2,"stark",456,"bg#g.com", "NY", "Nowhere"),
(3,"spidy",789,"c#g.com","OH","Nowhere"))
val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
.select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
'Email as "Email_2", 'State as "State_2", 'City as "City_2")
df2.show()
val df3 = df1.join(df2, df1("StudentId_1") ===
df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))
df3.withColumnRenamed("Phone_1", "Past_Phone").show()
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
| 1| tony| 123|a#g.com| 1| tony| 1234|a#g.com| NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
We have :
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
After Join :
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
Then
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
Reference : Spark: Add column to dataframe conditionally
Next :
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a#g.com| | 123|
| 2|stark| 456|bg#g.com| b#g.com| |
| 3|spidy| 789| c#g.com| | |
+---------+-----+-----+--------+----------+----------+

Reading Hive Tables in Spark Dataframe without header

I have the following Hive table:
select* from employee;
OK
abc 19 da
xyz 25 sa
pqr 30 er
suv 45 dr
when I read this in spark(pyspark):
df = hiveCtx.sql('select* from spark_hive.employee')
df.show()
+----+----+-----+
|name| age| role|
+----+----+-----+
|name|null| role|
| abc| 19| da|
| xyz| 25| sa|
| pqr| 30| er|
| suv| 45| dr|
+----+----+-----+
I end up getting the headers in my spark DataFrame. Is there a simple way to remove it ?
Also, am I missing something while reading the table into the DataFrame (Ideally I shouldn't be getting the header right ?) ?
You have to remove header from the result. You can do like this:
scala> val df = sql("select * from employee")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+----+----+----+
| id|name| age|
+----+----+----+
|null|name|null|
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+----+----+----+
scala> val header = df.first()
header: org.apache.spark.sql.Row = [null,name,null]
scala> val data = df.filter(row => row != header)
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 1 more field]
scala> data.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+---+----+---+
Thanks.
you can use skip.header.line.count for skip this header. You could also specify the same while creating the table. For example:
create external table testtable ( id int,name string, age int)
row format delimited .............
tblproperties ("skip.header.line.count"="1");
after that load the data and then check your query I hope you will get expected output.
Not the Most elegant way, but this worked with pyspark:
rddWithoutHeader = dfemp.rdd.filter(lambda line: line!=header)
dfnew = sqlContext.createDataFrame(rddWithoutHeader)

Resources