Unable to remove blank dict from an array in pyspark column - python-3.x

I have a column in my DF where data type is :
--testcolumn:array
--element: map
-----key:string
-----value: string
testcolumn
Row1:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"},{"":""},{"":""}]
Row2:
[{"class":"6","Roll:"1","Name":"Ram1"}]
[{"":""},{"":""}{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row3:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row4:
[{"":""},{"":""}{"class":"6","Roll:"1","Name":"Ram1"},{"":""},{"":""}]
expecting output:
outputcolumn
Row1:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row2:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row3:
[{"class":"6","Roll:"1","Name":"Ram1"},{"class":"6","Roll:"2","Name":"Ram2"},{"class":"6","Roll:"3","Name":"Ram3"}]
Row4:
[{"class":"6","Roll:"1","Name":"Ram1"}]
I have tried this code but it is not giving me output:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,'{"":""}','')))
it will be really great if someone can help me.

Assuming all your values remain strings, and that your data frame does look something like this:
df = spark.createDataFrame([
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"},{"":""},{"":""}]},
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"":""},{"":""},{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"}]},
{"testcolumn": [{"class":"6","Roll":"1","Name":"Ram1"},{"class":"6","Roll":"2","Name":"Ram2"},{"class":"6","Roll":"3","Name":"Ram3"}]},
{"testcolumn": [{"":""},{"":""},{"class":"6","Roll":"1","Name":"Ram1"},{"":""},{"":""}]}
])
df.show(truncate=False)
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |testcolumn |
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}, { -> }, { -> }] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}, { -> }, { -> }, {Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}]|
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{ -> }, { -> }, {Roll -> 1, class -> 6, Name -> Ram1}, { -> }, { -> }] |
# +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You could accomplish this by a) casting the whole column to a string with to_json, b) cleaning up the empty maps with regexp_replace, and c) casting the column back with from_json.
import pyspark.sql.functions as F
df_out = df \
.withColumn("testcolumn", F.to_json(F.col("testcolumn"))) \
.withColumn("testcolumn", F.regexp_replace(F.col("testcolumn"), ',?\{"":""\}', '')) \
.withColumn("testcolumn", F.regexp_replace(F.col("testcolumn"), "(?<=\[),", "")) \
.withColumn("testcolumn", F.from_json(F.col("testcolumn"), "ARRAY<MAP<STRING, STRING>>"))
df_out.show(truncate=False)
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |testcolumn |
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}]|
# |[{Roll -> 1, class -> 6, Name -> Ram1}, {Roll -> 2, class -> 6, Name -> Ram2}, {Roll -> 3, class -> 6, Name -> Ram3}] |
# |[{Roll -> 1, class -> 6, Name -> Ram1}] |
# +------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

delta (OSS) + MERGE recreating the underlying parquet files though there is no change in the incoming data

I am using delta (OSS - version 0.7.0 with pyspark 3.0.1) and the table is getting modified (merge) every 5 mins - microbatch pyspark script.
When I run for the first time it created 18 small files (numTargetRowsInserted -> 32560) and I used the same data and rerun again though there is no change in the data, table is touched and the version is updated and the number of small files increased to 400 and perviously added 18 files were marked as removed. However, except the first MERGE, subsequent merger is having the following values numTargetRowsCopied -> 32560 in the OperationMetics. Why the target rows copied again and the older files are marked as removed? Am i missing anything?
OperationMetrics data is as below,
operationMetrics |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 68457, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 66410]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 16838, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 48810]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12399, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 15039] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12244, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 14828] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 67154, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 70194]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 20367, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 80719]|
[numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 32560, scanTimeMs -> 7035, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 0, rewriteTimeMs -> 11606] |
Merge SQL :
MERGE INTO Target_table tgt
USING Source_table src
ON src.pk_col = tgt.pk_col
WHEN MATCHED AND src.operation=="DELETE" THEN DELETE
WHEN MATCHED AND src.operation=="UPDATE" THEN UPDATE SET *
WHEN NOT MATCHED AND src.operation!="DELETE" THEN INSERT *
That's a known behavior of the Delta - it rewrites every file that has matching record in the ON clause, regardless of the condition for WHEN MATCHED / WHEN NOT MATCHED.
In your case, if you're using the same data, you still have matches with data in the table, so ON condition is executed, and then when it's going through MATCHED/NOT MATCHED it doesn't find anything. To avoid this you need to think on how you can change condition to be more specific.
Look into this talk (and slides) - it explains how MERGE works
Hi try to see this for basic guidelines .
I recomend use partition for minimize the cost of intence merge
operations
write with coalece 1 this guarantee 1 file per partition but watch
out with the cardinaliti of columns selected for partition
my case

Combine 2 map columns in spark sql

Hello I am using the combine udf from brickhouse to combine to maps for the below example
select combine(map('a',1,'b',0),map('a',0,'c',1))
If does combine the maps but I want to keep the highest value while combining the maps. is the possible?
You can use udf to concat the map to get the key having max value as below-
val mapConcat = udf((map1: Map[String, Int], map2: Map[String, Int]) => {
val finalMap = mutable.Map.empty[String, mutable.ArrayBuffer[Int]]
map1.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
map2.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
finalMap.mapValues(_.max)
})
spark.udf.register("my_map_concat", mapConcat)
spark.range(2).selectExpr("map('a',1,'b',0)","map('a',0,'c',1)",
"my_map_concat(map('a',1,'b',0),map('a',0,'c',1))")
.show(false)
Output-
+----------------+----------------+-------------------------------------+
|map(a, 1, b, 0) |map(a, 0, c, 1) |UDF(map(a, 1, b, 0), map(a, 0, c, 1))|
+----------------+----------------+-------------------------------------+
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
+----------------+----------------+-------------------------------------+

Haskell replace an item in a existing list [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
Im trying to create a function that replace a given existing place with a new place
data Place = Place String Coord [Int]
deriving (Ord,Eq,Show,Read)
data Coord = Cord Double Double
deriving (Ord ,Eq ,Show ,Read)
testData :: [Place]
testData = [ Place "London" (Cord 51.5 (-0.1)) [0, 0, 5, 8, 8, 0, 0],
Place"Cardiff" (Cord 51.5 (-3.2)) [12, 8, 15, 0, 0, 0, 2],
Place"Norwich" (Cord 52.6 (1.3) ) [0, 6, 5, 0, 0, 0, 3],
Place "Birmingham" (Cord 52.5 (-1.9)) [0, 2, 10, 7, 8, 2, 2],
Place"Liverpool" (Cord 53.4 (-3.0)) [8, 16, 20, 3, 4, 9, 2],
Place "Hull" (Cord 53.8 (-0.3)) [0, 6, 5, 0, 0, 0, 4],
Place "Newcastle" (Cord 55.0 (-1.6)) [0, 0, 8, 3, 6, 7, 5],
Place "Belfast" (Cord 54.6 (-5.9)) [10, 18, 14, 0, 6, 5, 2],
Place "Glasgow" (Cord 55.9 (-4.3)) [7, 5, 3, 0, 6, 5, 0],
Place"Plymouth" (Cord 50.4 (-4.1)) [4, 9, 0, 0, 0, 6, 5],
Place "Aberdeen" (Cord 57.1 (-2.1)) [0, 0, 6, 5, 8, 2, 0],
Place "Stornoway" (Cord 58.2 (-6.4)) [15, 6, 15, 0, 0, 4, 2],
Place"Lerwick" (Cord 60.2 (-1.1)) [8, 10, 5, 5, 0, 0, 3],
Place"St Helier" (Cord 49.2 (-2.1)) [0, 0, 0, 0, 6, 10, 0] ]
replaceLocate :: String -> Place -> [Place] -> [Place]
replaceLocate _ _ [] = []
replaceLocate str (Place l d rains) ((Place p c rain):xs)
| str == p = Place l d rains : replaceLocate (Place l d rains) str xs
| otherwise = Place p c rain : replaceLocate (Place l d rains) str xs
while using String to search for the Places that I wanted to change.
But it gives me this error :
Smth.hs:96:22: error:
• Couldn't match type ‘Place’ with ‘[Char]’
Expected type: String
Actual type: Place
• In the pattern: Place l d rains
In an equation for ‘replaceLocate’:
replaceLocate str (Place l d rains) ((Place p c rain) : xs)
| str == p
= Place l d rains : replaceLocate (Place l d rains) str xs
| otherwise
= Place p c rain : replaceLocate (Place l d rains) str xs
|
96 | replaceLocate str (Place l d rains) ((Place p c rain):xs) | ^^^^^^^^^^^^^^^
Smth.hs:97:16: error:
• Couldn't match type ‘[Char]’ with ‘Place’
Expected type: Place
Actual type: String
• In the second argument of ‘(==)’, namely ‘p’
In the expression: str == p
In a stmt of a pattern guard for
an equation for ‘replaceLocate’:
str == p
|
97 | | str == p = Place l d rains : replaceLocate (Place l d rains) str xs | ^
Smth.hs:97:82: error:
• Couldn't match type ‘Place’ with ‘[Char]’
Expected type: String
Actual type: Place
• In the second argument of ‘replaceLocate’, namely ‘str’
In the second argument of ‘(:)’, namely
‘replaceLocate (Place l d rains) str xs’
In the expression:
Place l d rains : replaceLocate (Place l d rains) str xs
|
97 | | str == p = Place l d rains : replaceLocate (Place l d rains) str xs | ^^^
Smth.hs:98:86: error:
• Couldn't match type ‘Place’ with ‘[Char]’
Expected type: String
Actual type: Place
• In the second argument of ‘replaceLocate’, namely ‘str’
In the second argument of ‘(:)’, namely
‘replaceLocate (Place l d rains) str xs’
In the expression:
Place p c rain : replaceLocate (Place l d rains) str xs
|
98 | | otherwise = Place p c rain : replaceLocate (Place l d rains) str xs | ^^^
In your recurve call you are swapping the first two parameters. You need to replace this:
replaceLocate (Place l d rains) str xs
With this:
replaceLocate str (Place l d rains) xs

How to return a list of names in Haskell?

I am writing a simple program that deals with rainfall for places in the UK. Each place has a name, location in degrees north and east, and list of rainfall figures.
How do I return a list of the names of all places, example:
[London, Cardiff, St Helier].
Error:
1. Couldn't match type
2. list comprehension
import Data.Char
import Data.List
type Place = (String, Int, Int, [Int])
testData :: [Place]
testData = [("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]),
("Cardiff", 51.5, -3.2, [12, 8, 15, 0, 0, 0, 2]),
("St Helier", 49.2, -2.1, [0, 0, 0, 0, 6, 10, 0])]
listNames :: Place -> [String]
listNames details = [name | (name,north,east,[figure]) <- details]
There are several problems with your current solution:
type Place = (String, Int, Int, [Int]) but ("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]) The problem here is that you have specified the two middle fields of the tuple to be Ints, but you pass in 51.5 and -0.1, which are fractional values. I would recommend changing Place to: type Place = (String, Float, Float, [Int]) (you could also look into using a record).
Your listNames function's signature epxects only a single place: listNames :: Place -> [String], but you actually mean to have it take a list of places. You should change it to listNames :: [Place] -> [String].
Your list comprehension uses a restrictive pattern match, while you want one that accepts pretty much anything: the [figure] part of the pattern match only matches a list with a single element, which you are binding to figure. Make sure that you understand the difference between the list type notation [a] and the list constructor [1, 2, 3].
Not only that, but you can disregard all but the place name anyway: [name | (name, _, _, _) <- details].
All together, your code would become:
type Place = (String, Float, Float, [Int])
testData :: [Place]
testData = [("London", 51.5, -0.1, [0, 0, 5, 8, 8, 0, 0]),
("Cardiff", 51.5, -3.2, [12, 8, 15, 0, 0, 0, 2]),
("St Helier", 49.2, -2.1, [0, 0, 0, 0, 6, 10, 0])]
listNames :: [Place] -> [String]
listNames details = [name | (name, _, _, _) <- details]

Convert Spark DataFrame Map into Array of Maps of `{"Key": key, "Value": value}`

How can I take a Spark DataFrame structured like this:
val sourcedf = spark.createDataFrame(
List(
Row(Map("AL" -> "Alabama", "AK" -> "Alaska").asJava),
Row(Map("TX" -> "Texas", "FL" -> "Florida", "NJ" -> "New Jersey").asJava)
).asJava, StructType(
StructField("my_map", MapType(StringType, StringType, false)) ::
Nil))
or in a text form, sourcedf.show(false) shows:
+----------------------------------------------+
|my_map |
+----------------------------------------------+
|[AL -> Alabama, AK -> Alaska] |
|[TX -> Texas, FL -> Florida, NJ -> New Jersey]|
+----------------------------------------------+
and programmatically transform to this structure:
val targetdf = spark.createDataFrame(
List(
Row(List(Map("Key" -> "AL", "Value" -> "Alabama"), Map("Key" -> "AK", "Value" -> "Alaska")).asJava),
Row(List(Map("Key" -> "TX", "Value" -> "Texas"), Map("Key" -> "FL", "Value" -> "Florida"), Map("Key" -> "NJ", "Value" -> "New Jersey")).asJava)
).asJava, StructType(
StructField("my_list", ArrayType(MapType(StringType, StringType, false), false)) ::
Nil))
or in a text form, targetdf.show(false) shows:
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[Key -> AL, Value -> Alabama], [Key -> AK, Value -> Alaska]] |
|[[Key -> TX, Value -> Texas], [Key -> FL, Value -> Florida], [Key -> NJ, Value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+```
So whilst using Scala, I couldn't figure out how to handle a java.util.Map with provided Encoders, I probably would have had to write one myself and I figured it was too much work.
However, I can see two ways to do this without converting to java.util.Map and using scala.collection.immutable.Map.
You could convert into a Dataset[Obj] and flatMap.
case class Foo(my_map: Map[String, String])
case class Bar(my_list: List[Map[String, String]])
implicit val encoder = ExpressionEncoder[List[Map[String, String]]]
val ds: Dataset[Foo] = sourcedf.as[Foo]
val output: Dataset[Bar] = ds.map(x => Bar(x.my_map.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList))
output.show(false)
Or you can use a UDF
val mapToList: Map[String, String] => List[Map[String, String]] = {
x => x.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList
}
val mapToListUdf: UserDefinedFunction = udf(mapToList)
val output: Dataset[Row] = sourcedf.select(mapToListUdf($"my_map").as("my_list"))
output.show(false)
Both output
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[key -> AL, value -> Alabama], [key -> AK, value -> Alaska]] |
|[[key -> TX, value -> Texas], [key -> FL, value -> Florida], [key -> NJ, value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+

Resources