delta (OSS) + MERGE recreating the underlying parquet files though there is no change in the incoming data - delta-lake

I am using delta (OSS - version 0.7.0 with pyspark 3.0.1) and the table is getting modified (merge) every 5 mins - microbatch pyspark script.
When I run for the first time it created 18 small files (numTargetRowsInserted -> 32560) and I used the same data and rerun again though there is no change in the data, table is touched and the version is updated and the number of small files increased to 400 and perviously added 18 files were marked as removed. However, except the first MERGE, subsequent merger is having the following values numTargetRowsCopied -> 32560 in the OperationMetics. Why the target rows copied again and the older files are marked as removed? Am i missing anything?
OperationMetrics data is as below,
operationMetrics |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 68457, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 66410]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 16838, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 48810]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12399, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 15039] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12244, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 14828] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 67154, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 70194]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 20367, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 80719]|
[numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 32560, scanTimeMs -> 7035, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 0, rewriteTimeMs -> 11606] |
Merge SQL :
MERGE INTO Target_table tgt
USING Source_table src
ON src.pk_col = tgt.pk_col
WHEN MATCHED AND src.operation=="DELETE" THEN DELETE
WHEN MATCHED AND src.operation=="UPDATE" THEN UPDATE SET *
WHEN NOT MATCHED AND src.operation!="DELETE" THEN INSERT *

That's a known behavior of the Delta - it rewrites every file that has matching record in the ON clause, regardless of the condition for WHEN MATCHED / WHEN NOT MATCHED.
In your case, if you're using the same data, you still have matches with data in the table, so ON condition is executed, and then when it's going through MATCHED/NOT MATCHED it doesn't find anything. To avoid this you need to think on how you can change condition to be more specific.
Look into this talk (and slides) - it explains how MERGE works

Hi try to see this for basic guidelines .
I recomend use partition for minimize the cost of intence merge
operations
write with coalece 1 this guarantee 1 file per partition but watch
out with the cardinaliti of columns selected for partition
my case

Related

Unable to remove blank dict from an array in pyspark column

I have a column in my DF where data type is :
--testcolumn:array
--element: map
-----key:string
-----value: string
testcolumn
Row1:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"},{"":""},{"":""}]
Row2:
[{"class":"6","Roll:"1","Name":"Ram1"}]
[{"":""},{"":""}{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row3:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row4:
[{"":""},{"":""}{"class":"6","Roll:"1","Name":"Ram1"},{"":""},{"":""}]
expecting output:
outputcolumn
Row1:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row2:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row3:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row4:
[{"class":"6","Roll:"1","Name":"Ram1"}]
I have tried this code but it is not giving me output:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,'{"":""}','')))
it will be really great if someone can help me.
Assuming all your values remain strings, and that your data frame does look something like this:
df = spark.createDataFrame([
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"},{"":""},{"":""}]},
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"":""},{"":""},{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"}]},
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"}]},
{"testcolumn": [{"":""},{"":""},{"class":"6","Roll":"1","Name":"Ram1"},{"":""},{"":""}]}
])
df.show(truncate=False)
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |testcolumn |
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}, { -> }, { -> }] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}, { -> }, { -> }, {Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}]|
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{ -> }, { -> }, {Roll -> 1, class -> 6, Name -> Ram1}, { -> }, { -> }] |
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You could accomplish this by a) casting the whole column to a string with to_json, b) cleaning up the empty maps with regexp_replace, and c) casting the column back with from_json.
import pyspark.sql.functions as F
df_out = df \
.withColumn("testcolumn", F.to_json(F.col("testcolumn"))) \
.withColumn("testcolumn", F.regexp_replace(F.col("testcolumn"), ',?\{"":""\}', '')) \
.withColumn("testcolumn", F.regexp_replace(F.col("testcolumn"), "(?<=\[),", "")) \
.withColumn("testcolumn", F.from_json(F.col("testcolumn"), "ARRAY<MAP<STRING, STRING>>"))
df_out.show(truncate=False)
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |testcolumn |
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}]|
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}] |
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+

Split a string to a list of integers [duplicate]

This question already has answers here:
How to extract numbers from a string in Python?
(19 answers)
Closed 2 years ago.
I am working on Google's OR-Tools for solving a simple VRP problem. I need to plot the solution. So, I parse print_solution() function to return a dictionary of tours. Now, I have got a string list like this
tour = [' 0 -> 14 -> 15 -> 19 -> 1 -> 13 -> 5 -> 10 -> 20 -> 3 -> 7 -> 6 -> 16 -> 4 -> 9 -> 2 -> 17 -> 11 -> 12 -> 8 -> 18 -> 0']
Could someone please help me to obtain only the integers from this list?
The first provided answer has some issues, but it does almost work. The idea behind it is sound. See the comments on the answer for an explanation of the issues I see.
I would do this differently, so that what the split results in is purely digits rather than digits with spaces on either side that then have to be dealt with separately. That is, I'd cause the spaces to been seen as being parts of the delimiters. So I'd do this instead:
import re
tour = [' 0 -> 14 -> 15 -> 19 -> 1 -> 13 -> 5 -> 10 -> 20 -> 3 -> 7 -> 6 -> 16 -> 4 -> 9 -> 2 -> 17 -> 11 -> 12 -> 8 -> 18 -> 0']
x = [int(digits) for digits in re.split(r'\s*->\s*', tour[0])]
print(x)
Result:
[0, 14, 15, 19, 1, 13, 5, 10, 20, 3, 7, 6, 16, 4, 9, 2, 17, 11, 12, 8, 18, 0]
This code will work regardless of if int() deals correctly with its argument being padded with spaces, since the spaces have already been stripped away.
x = tour.split('->')
int_list = [int(num.strip()) for num in x]

Search for Pattern in list : python Regex

After the Data Analysis & getting the Required Result I'm appending that result to a List
Now I Need to Retrieve Or Separate the Result (Search For Pattern & Obtain It)
Code:
data = []
data.append('\n'.join([' -> '.join(e) for e in paths]))
List Contais This data:
CH_Trans -> St_1 -> WDL
TRANSFER_Trn -> St_1
Access_Ltd -> MPL_Limited
IPIPI -> TLC_Pvt_Ltd
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
Sonata_Ltd -> Three_Star_Services
Arc_Estates -> Russian_Hosp
A -> B -> C -> D -> E -> F
G -> H
ZN_INTBNKOUT_SET -> -2008_1 -> X
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Note : Display or Retrieve Pattern Which has more than two symbol of type[->]
( Txt -> txt -> txt )
I'm Trying to get it Done by Regex
for i in data:
regex = ("\w+\s->\s\w+\s->\s\w+")
match = re.findall(regex, i,re.MULTILINE)
print(match)
Regex Expression I Tried But Unable to get Requried Result
#\w+\s->\s\w+\s->\s\w+
#\w+\s[-][>]\s\w+\s[-][>]\s\w+
#\w+\s[-][>]\s\w+\s[-][>]\s\w+\s[-][>]\s\w+
Result I Got
['CH_Trans-> St_1-> WDL', '234er -> Three_Star_Services -> Asian_Pharmas',
'A -> B -> C', 'D -> E -> F', 'ZZ_1_ -> AA_2 -> AA_3',
'SSS -> BBB -> SSS', 'Rock_8CC -> Russ -> By_sus']
Requried Result What I want to Obtain is
----Pattern I------
CH_Trans -> St_1 -> WDL
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
A -> B -> C -> D -> E -> F
ZN_INTBNKOUT_SET -> -2008_1 -> X
# Pattern II Consists of Patterns which are same i.e[ Fist_ele & Last_Ele Is Same]
----Pattern II------
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Would you please try the following as a starting point:
regex = r'^\S+(?:\s->\s\S+){2,}$'
for i in data:
m = re.match(regex, i)
if (m):
print(m.group())
Results (Pattern I + Pattern II):
CH_Trans -> St_1 -> WDL
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
A -> B -> C -> D -> E -> F
ZN_INTBNKOUT_SET -> -2008_1 -> X
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Explanation of the regex ^\S+(?:\s->\s\S+){2,}$:
^\S+ start with non-blank string
(?: ... ) grouping
\s->\s\S+ a blank followed by "->" followed by a blank and non-blank string
{2,} repeats the previous pattern (or group) two or more times
$ end of the string
As of pattern II please say:
regex = r'^(\S+)(?:\s->\s\S+){1,}\s->\s\1$'
for i in data:
m = re.match(regex, i)
if (m):
print(m.group())
Results:
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Explanation of regex r'^(\S+)(?:\s->\s\S+){1,}\s->\s\1$':
- ^(\S+) captures the 1st element and assigns \1 to it
- (?: ... ) grouping
- \s->\s\S+ a blank followed by "->" followed by a blank and non-blank string
- {1,} repeats the previous pattern (or group) one or more times
- \s->\s\1 a blank followed by "->" followed by a blank and the 1st element \1
- $ end of the string
In order to obtain the result of pattern I, we may need to subtract the list of pattern II from the 1st results.
If we could say:
regex = r'^(\S+)(?:\s->\s\S+){2,}(?<!\1)$'
it will exclude the string whose last element differs from the 1st element then we could obtain the result of pattern I directry but the regex causes the error saying "group references in lookbehind assertions" so far.

Combine 2 map columns in spark sql

Hello I am using the combine udf from brickhouse to combine to maps for the below example
select combine(map('a',1,'b',0),map('a',0,'c',1))
If does combine the maps but I want to keep the highest value while combining the maps. is the possible?
You can use udf to concat the map to get the key having max value as below-
val mapConcat = udf((map1: Map[String, Int], map2: Map[String, Int]) => {
val finalMap = mutable.Map.empty[String, mutable.ArrayBuffer[Int]]
map1.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
map2.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
finalMap.mapValues(_.max)
})
spark.udf.register("my_map_concat", mapConcat)
spark.range(2).selectExpr("map('a',1,'b',0)","map('a',0,'c',1)",
"my_map_concat(map('a',1,'b',0),map('a',0,'c',1))")
.show(false)
Output-
+----------------+----------------+-------------------------------------+
|map(a, 1, b, 0) |map(a, 0, c, 1) |UDF(map(a, 1, b, 0), map(a, 0, c, 1))|
+----------------+----------------+-------------------------------------+
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
+----------------+----------------+-------------------------------------+

How to return a list of names in Haskell?

I am writing a simple program that deals with rainfall for places in the UK. Each place has a name, location in degrees north and east, and list of rainfall figures.
How do I return a list of the names of all places, example:
[London, Cardiff, St Helier].
Error:
1. Couldn't match type
2. list comprehension
import Data.Char
import Data.List
type Place = (String, Int, Int, [Int])
testData :: [Place]
testData = [("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]),
("Cardiff", 51.5, -3.2, [12, 8, 15, 0, 0, 0, 2]),
("St Helier", 49.2, -2.1, [0, 0, 0, 0, 6, 10, 0])]
listNames :: Place -> [String]
listNames details = [name | (name,north,east,[figure]) <- details]
There are several problems with your current solution:
type Place = (String, Int, Int, [Int]) but ("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]) The problem here is that you have specified the two middle fields of the tuple to be Ints, but you pass in 51.5 and -0.1, which are fractional values. I would recommend changing Place to: type Place = (String, Float, Float, [Int]) (you could also look into using a record).
Your listNames function's signature epxects only a single place: listNames :: Place -> [String], but you actually mean to have it take a list of places. You should change it to listNames :: [Place] -> [String].
Your list comprehension uses a restrictive pattern match, while you want one that accepts pretty much anything: the [figure] part of the pattern match only matches a list with a single element, which you are binding to figure. Make sure that you understand the difference between the list type notation [a] and the list constructor [1, 2, 3].
Not only that, but you can disregard all but the place name anyway: [name | (name, _, _, _) <- details].
All together, your code would become:
type Place = (String, Float, Float, [Int])
testData :: [Place]
testData = [("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]),
("Cardiff", 51.5, -3.2, [12, 8, 15, 0, 0, 0, 2]),
("St Helier", 49.2, -2.1, [0, 0, 0, 0, 6, 10, 0])]
listNames :: [Place] -> [String]
listNames details = [name | (name, _, _, _) <- details]

Resources