Create folder wise structure in Delta Format on HDFS - delta-lake

I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field .
I am using delta-core_2.11:0.6.1, Spark : 2.4 versions
Example :
/temp/deltalake/data/project_1/2022/12/1
/temp/deltalake/data/project_1/2022/12/2
.
.
and so on.
The thing I found nearest to my requirement was : partitionBy(Keys) in delta lake documentation.
That will create the data in this format : /temp/deltalake/data/project_1/year=2022/month=12/day=1
data.show() :
+----+-------+-----+-------+---+-------------------+----------+
|S_No|section| Name| City|Age| eventtime| date|
+----+-------+-----+-------+---+-------------------+----------+
| 1| a|Name1| Indore| 25|2022-02-10 23:30:14|2022-02-10|
| 2| b|Name2| Delhi| 25|2021-08-12 10:50:10|2021-08-12|
| 3| c|Name3| Ranchi| 30|2022-12-10 15:00:00|2022-12-10|
| 4| d|Name4|Kolkata| 30|2022-05-10 00:30:00|2022-05-10|
| 5| e|Name5| Mumbai| 30|2022-07-01 10:32:12|2022-07-01|
+----+-------+-----+-------+---+-------------------+----------+
data
.write
.format("delta")
.mode("overwrite")
.option("mergeSchema", "true")
.partitionBy(Keys)
.save("/temp/deltalake/data/project_1/")
But this too didn't work. I referred to this below medium article:
https://medium.com/#aravinthR/partitioned-delta-lake-part-3-5cc52b64ebda
Would be great if anyone can help me out in figuring out a possible solution.

Related

How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+

Spark Dataframe issue in overwriting the partition data of Hive table

Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","d‌​ynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

Spark (or pyspark) columns content shuffle with GroupBy

I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.

Loading a spark dataframe into Hive partition

Im trying to load a dataframe into hive table which is partitioned like below.
> create table emptab(id int, name String, salary int, dept String)
> partitioned by (location String)
> row format delimited
> fields terminated by ','
> stored as parquet;
I have a dataframe created in the below format:
val empfile = sc.textFile("emp")
val empdata = empfile.map(e => e.split(","))
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empdata.map(e => employee(e(0).toInt, e(1), e(2).toint, e(3)))
val empDF = empRDD.toDF()
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=England")
But Im getting an error as below:
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=India")
java.lang.RuntimeException: [1.1] failure: identifier expected
/user/hive/warehouse/emptab/location=England
Data in "emp" file:
---+-------+------+-----+
| id| name|salary| dept|
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
Also this is the first time loading the empty Hive table which is partitioned. I am trying to create a partition while loading the data into Hive table.
Could anyone tell what is the mistake I am doing here and how can I correct it ?
This is a wrong approach.
When you say the partition path, that is not a "valid" Hadoop path.
What you have to do is:
val empDF = empRDD.toDF()
val empDFFiltered = empDF.filter(empDF.location == "India")
empDFFiltered.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab")
The path will be handle by the partitionBy, if you want only add the information to partition India you should filter the India data from your dataframe.

Resources