I am pushing some initial bulk data into a hudi table, and then every day, I write incremental data into it. But if back data arrives, then the latest precombined field which is already in the table is ignored and the arriving precombined field(which is older) over writes it.
I write a data frame containing the following data with the following configs:
+---+-----+-------------+
| id| req|dms_timestamp|
+---+-----+-------------+
| 1| one| 2022-12-17|
| 2| two| 2022-12-17|
| 3|three| 2022-12-17|
+---+-----+-------------+
"className"-> "org.apache.hudi",
"hoodie.datasource.write.precombine.field"-> "dms_timestamp",
"hoodie.datasource.write.recordkey.field"-> "id",
"hoodie.table.name"-> "hudi_test",
"hoodie.consistency.check.enabled"-> "false",
"hoodie.datasource.write.reconcile.schema"-> "true",
"path"-> basePath,
"hoodie.datasource.write.keygenerator.class"-> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.partitionpath.field"-> "",
"hoodie.datasource.write.hive_style_partitioning"-> "true",
"hoodie.upsert.shuffle.parallelism"-> "1",
"hoodie.datasource.write.operation"-> "upsert",
"hoodie.cleaner.policy"-> "KEEP_LATEST_COMMITS",
"hoodie.cleaner.commits.retained"-> "5",
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |req |dms_timestamp|
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
|20221214130513893 |20221214130513893_0_0|id:3 | |005674e6-a581-419a-b8c7-b2282986bc52-0_0-36-34_20221214130513893.parquet|3 |three|2022-12-17 |
|20221214130513893 |20221214130513893_0_1|id:1 | |005674e6-a581-419a-b8c7-b2282986bc52-0_0-36-34_20221214130513893.parquet|1 |one |2022-12-17 |
|20221214130513893 |20221214130513893_0_2|id:2 | |005674e6-a581-419a-b8c7-b2282986bc52-0_0-36-34_20221214130513893.parquet|2 |two |2022-12-17 |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
Then In the next run I upsert the following data:
+---+----+-------------+
| id| req|dms_timestamp|
+---+----+-------------+
| 1|null| 2019-01-01|
+---+----+-------------+
"hoodie.table.name"-> "hudi_test",
"hoodie.datasource.write.recordkey.field" -> "id",
"hoodie.datasource.write.precombine.field" -> "dms_timestamp",
// get_common_config
"className"-> "org.apache.hudi",
"hoodie.datasource.hive_sync.use_jdbc"-> "false",
"hoodie.consistency.check.enabled"-> "false",
"hoodie.datasource.write.reconcile.schema"-> "true",
"path"-> basePath,
// get_partitionDataConfig -- no partitionfield
"hoodie.datasource.write.keygenerator.class"-> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.partitionpath.field"-> "",
"hoodie.datasource.write.hive_style_partitioning"-> "true",
// get_incrementalWriteConfig
"hoodie.upsert.shuffle.parallelism"-> "1",
"hoodie.datasource.write.operation"-> "upsert",
"hoodie.cleaner.policy"-> "KEEP_LATEST_COMMITS",
"hoodie.cleaner.commits.retained"-> "5",
and getting this table:
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |req |dms_timestamp|
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
|20221214131440563 |20221214131440563_0_0|id:3 | |37dee403-6077-4a01-bf28-7afd65ef390a-0_0-18-21_20221214131555500.parquet|3 |three|2022-12-17 |
|20221214131555500 |20221214131555500_0_1|id:1 | |37dee403-6077-4a01-bf28-7afd65ef390a-0_0-18-21_20221214131555500.parquet|1 |null |2019-01-01 |
|20221214131440563 |20221214131440563_0_2|id:2 | |37dee403-6077-4a01-bf28-7afd65ef390a-0_0-18-21_20221214131555500.parquet|2 |two |2022-12-17 |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+-----+-------------+
This should not happen as this is back-date data arriving late in the stream. How to handle this?
By default, Hudi uses org.apache.hudi.common.model.OverwriteWithLatestAvroPayload as payload class, with this class, Hudi uses the precombine field just to deduplicate the incoming data (precombine step), then it overwrites the existing record with the new one without comparing the precombine field values.
If you want to always keep the last updated record, you need to add this configuration:
"hoodie.datasource.write.payload.class" -> "org.apache.hudi.common.model.DefaultHoodieRecordPayload"
Related
I have a dataframe that has distinct 'send' and 'receive' rows. I need to join these rows in a single one with send and receive columns, using PySpark. Notice that the ID is the same for the lines and the action identifier is ACTION_ID:
Original dataframe:
+------------------------------------+------------------------+---------+--------------------+
|ID |MSG_DT |ACTION_CD|MESSAGE |
+------------------------------------+------------------------+---------+--------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|receive |Oi |
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.852Z|send |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|receive |4 |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.688Z|send |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|receive |1 |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.479Z|send |⭐️*Antes de você ir |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|receive |788884 |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:57.435Z|send |Agora |
+------------------------------------+------------------------+---------+--------------------+
How I need:
+------------------------------------+------------------------+-------+-------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+------------------------+-------+-------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|Oi |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|1 |⭐️*Antes de você ir|
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|788884 |Agora |
+------------------------------------+------------------------+-------+-------------------+
Ps.: The MSG_DT is the earliest record.
You can construct the RECEIVE and SEND by applying first expression over computed columns that are created depending on ACTION_CD.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.552Z", "receive", "Oi",),
("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.852Z", "send", "Olá!",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.565Z", "receive", "4",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.688Z", "send", "Certo",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.408Z", "receive", "1",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.479Z", "send", "️*Antes de você ir",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:52.798Z", "receive", "788884",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:57.435Z", "send", "Agora",), ]
df = spark.createDataFrame(data, ("ID", "MSG_DT", "ACTION_CD", "MESSAGE")).withColumn("MSG_DT", F.to_timestamp("MSG_DT"))
ws = W.partitionBy("ID").orderBy("MSG_DT")
first_rows = ws.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
action_column_selection = lambda action: F.first(F.when(F.col("ACTION_CD") == action, F.col("MESSAGE")), ignorenulls=True).over(first_rows)
(df.select("*",
action_column_selection("receive").alias("RECEIVE"),
action_column_selection("send").alias("SEND"),
F.row_number().over(ws).alias("rn"))
.where("rn = 1")
.drop("ACTION_CD", "MESSAGE", "rn")).show(truncate=False)
"""
+------------------------------------+-----------------------+-------+------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+-----------------------+-------+------------------+
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07 23:25:52.798|788884 |Agora |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07 23:25:06.565|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07 23:25:30.408|1 |️*Antes de você ir|
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07 23:24:54.552|Oi |Olá! |
+------------------------------------+-----------------------+-------+------------------+
"""
How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+
I'm using Databricks 9.1. I've got a json file with values:
{"property_before":0 ,"nested_object":{ "property_int":0, "property_double":1.0}, "property_after":0}
and schema defined as:
{
"fields":[
{"name":"property_before", "type":"integer"}
,{"name":"nested_object", "type":{ "fields":[
{"name":"property_int", "type":"integer"}
,{"name":"property_double", "type":"integer"}
], "type":"struct"}}
,{"name":"property_after", "type":"integer"}
],"type":"struct"
}
You can see, that in the nested_object there is a property_double:1.0, but schema defines that property as an integer. As a result when reading the json file I've got
+---------------+-------------+--------------+
|property_before|nested_object|property_after|
+---------------+-------------+--------------+
|0 |null |null |
+---------------+-------------+--------------+
Is there a way to:
end up with null value only for property_double as it's the only one that type doesn't match
end up with 1 value for property_double in a sense of implicit conversion?
Thanks!
I have a text. From each line I want to filter everything after some stop word. For example :
stop_words=['with','is', '/']
One of the rows is:
senior manager with experience
I want to remove everything after with (including with) so the output is:
senior manager
I have big-data and am working with Spark in Python.
You can find the location of the stop words using instr, and get a substring up to that location.
import pyspark.sql.functions as F
stop_words = ['with', 'is', '/']
df = spark.createDataFrame([
['senior manager with experience'],
['is good'],
['xxx//'],
['other text']
]).toDF('col')
df.show(truncate=False)
+------------------------------+
|col |
+------------------------------+
|senior manager with experience|
|is good |
|xxx // |
|other text |
+------------------------------+
df2 = df.withColumn('idx',
F.coalesce(
# Get the smallest index of a stop word in the string
F.least(*[F.when(F.instr('col', s) != 0, F.instr('col', s)) for s in stop_words]),
# If no stop words found, get the whole string
F.length('col') + 1)
).selectExpr('trim(substring(col, 1, idx-1)) col')
df2.show()
+--------------+
| col|
+--------------+
|senior manager|
| |
| xxx|
| other text|
+--------------+
You can use udf and get index of first occurrence of stop word in col, then again using one more udf, you can substring col message.
val df = List("senior manager with experience", "is good", "xxx//", "other text").toDF("col")
val index_udf = udf ( (col_value :String ) => {val result = for (elem <- stop_words; if col_value.contains(elem)) yield col_value.indexOf(elem)
if (result.isEmpty) col_value.length else result.min } )
val substr_udf = udf((elem:String, index:Int) => elem.substring(0, index))
val df3 = df.withColumn("index", index_udf($"col")).withColumn("substr_message", substr_udf($"col", $"index")).select($"substr_message").withColumnRenamed("substr_message", "col")
df3.show()
+---------------+
| col|
+---------------+
|senior manager |
| |
| xxx|
| other text|
+---------------+
How can I transform data like below in order to store data in ElasticSearch?
Here is a dataset of a bean that I would aggregate by product into a JSON array.
List<Bean> data = new ArrayList<Bean>();
data.add(new Bean("book","John",59));
data.add(new Bean("book","Björn",61));
data.add(new Bean("tv","Roger",36));
Dataset ds = spark.createDataFrame(data, Bean.class);
ds.show(false);
+------+-------+---------+
|amount|product|purchaser|
+------+-------+---------+
|59 |book |John |
|61 |book |Björn |
|36 |tv |Roger |
+------+-------+---------+
ds = ds.groupBy(col("product")).agg(collect_list(map(ds.col("purchaser"),ds.col("amount")).as("map")));
ds.show(false);
+-------+---------------------------------------------+
|product|collect_list(map(purchaser, amount) AS `map`)|
+-------+---------------------------------------------+
|tv |[[Roger -> 36]] |
|book |[[John -> 59], [Björn -> 61]] |
+-------+---------------------------------------------+
This is what I want to transform it into:
+-------+------------------------------------------------------------------+
|product|json |
+-------+------------------------------------------------------------------+
|tv |[{purchaser: "Roger", amount:36}] |
|book |[{purchaser: "John", amount:36}, {purchaser: "Björn", amount:61}] |
+-------+------------------------------------------------------------------+
The solution :
ds.groupBy(col("product"))
.agg(collect_list(to_json(struct(col("purchaser"), col("amount"))).alias("json")));