how to add a new element in RDD

how to add a new element in RDD - apache-spark

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4))
i want to add a new max element in RDDs
Array[(String, Int)] = Array((a,1,4), (b,2,4), (c,3,4), (d,4,4))

In the definition, you are saying:
(a, 1) -> (a, 1, 4)
(b, 2) -> (b, 2, 4)
(c, 1) -> (c, 3, 4) where is the 3 coming from now?
(d, 3) -> (d, 4, 4) where is the 4 coming from now?
In case your new max is the maximum value of your value RDD plus one, then you can sort descending by value and get the value of the first element:
val max = df1.sortBy(_._2, ascending = false).collect(0)._2 + 1
val df2 = df1.map(r => (r._1, r._2, max)
This gives:
(a,1,4)
(b,2,4)
(c,1,4)
(d,3,4)
which should be what you want.

Related

Combine 2 map columns in spark sql

Hello I am using the combine udf from brickhouse to combine to maps for the below example
select combine(map('a',1,'b',0),map('a',0,'c',1))
If does combine the maps but I want to keep the highest value while combining the maps. is the possible?

You can use udf to concat the map to get the key having max value as below-
val mapConcat = udf((map1: Map[String, Int], map2: Map[String, Int]) => {
val finalMap = mutable.Map.empty[String, mutable.ArrayBuffer[Int]]
map1.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
map2.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
finalMap.mapValues(_.max)
})
spark.udf.register("my_map_concat", mapConcat)
spark.range(2).selectExpr("map('a',1,'b',0)","map('a',0,'c',1)",
"my_map_concat(map('a',1,'b',0),map('a',0,'c',1))")
.show(false)
Output-
+----------------+----------------+-------------------------------------+
|map(a, 1, b, 0) |map(a, 0, c, 1) |UDF(map(a, 1, b, 0), map(a, 0, c, 1))|
+----------------+----------------+-------------------------------------+
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
+----------------+----------------+-------------------------------------+

How to change a value in a certain postion of the matrix(and return the whole matrix changed)?

My current code is below. I think all of the functions, except for the last one are correct. What I'm trying to achieve with changeValueMatrix is to give a matrix, a matrix position and a value and then that value will replace the one that is at the current position. I've managed to reach the position and to change the value but I can only return the row on which I changed it and not the whole matrix. I am a Haskell beginner and I've only learned recursion just now but it would be ideal to use it here if possible.
type Matrix a = [[a]]
type MatrixDimension = (Int,Int)
type MatrixPosition = (Int,Int)
matrixDimension :: Matrix a -> MatrixDimension
matrixDimension m = (length m, length (head m))
returnValueList :: Int -> [a] -> a
returnValueList 0 (x:xs) = x
returnValueList i(x:xs) = returnValue (i-1)(xs)
changeValueList :: Int -> a -> [a] -> [a]
changeValueList 0 value (x:xs) = (value:xs)
changeValueList i value (x:xs) = x:(changeValueList (i-1) (value) (xs))
returnValueMatrix :: MatrixPosition-> Matrix a -> a
returnValueMatrix(m,n) matrix = returnValueList n (returnreturnValueList matrix)
changeValueMatrix :: MatrixPosition -> a -> Matrix a -> Matrix a
changeValueMatrix(0,c) value (x:xs) = a:xs
where a = changeValueList c value x
changeValueMatrix(r,c) valor (x:xs) =
where
row = returnValueList r (x:xs)
b = changeValueList c value row

You can build changeValueMatrix from the functions you’ve already defined:
changeValueMatrix :: MatrixPosition -> a -> Matrix a -> Matrix a
changeValueMatrix (r, c) value matrix
= changeValueList r -- (3)
(changeValueList c value -- (2)
(returnValueList r matrix)) -- (1)
matrix
At (1) you look up the row at index r in matrix, at (2) you replace the element at column c in that row with value, and at (3) you replace the row at index r in matrix with the modified row. For example:
-- Given: mat = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
changeValueMatrix (1, 1) 0 mat
==
changeValueList 1
(changeValueList 1 0
(returnValueList 1 mat))
mat
==
changeValueList 1
(changeValueList 1 0 [4, 5, 6])
mat
==
changeValueList 1 [4, 0, 6] mat
==
[ [1, 2, 3]
, [4, 0, 6]
, [7, 8, 9]
]
If you want a version of this using explicit recursion, which only traverses the rows once, you can inline the definition of changeValueList into changeValueMatrix:
changeValueMatrix (0, c) value (x : xs)
= changeValueList c value x : xs
changeValueMatrix (r, c) value (x : xs)
= x : changeValueMatrix (r - 1, c) value xs
Be aware that your code has a few failure cases, though:
Negative indices will produce infinite loops because you only test for 0 and recur with i - 1 on any other number
Overly large indices will run into the end of the list and crash because you don’t handle the [] case—the pattern matches are non-exhaustive, which the compiler will point out when enabling all warnings with -Wall
Similarly, matrices of zero width or height are representable, but these functions don’t handle the possibility (e.g. matrixDimension calls head on a possibly-empty list); you can avoid this using Data.List.NonEmpty or Data.Array as your backing type, the latter of which is also more efficient

Spark convert PairRDD to RDD

What is the best way to convert a PairRDD into an RDD with both K and V are merged (in java)?
For example, the PairRDD contains K as some string and V as a JSON. I want to add this K to the value JSON and produce an RDD.
Input PairRDD
("abc", {"x:"100", "y":"200"})
("def", {"x":"400", "y":"500")
Output should be and RDD as follows
({"x:"100", "y":"200","z":"abc"})
({"x":"400", "y":"500","z":"def"})

You can use map to translate between the two
consider:
scala> pairrdd.foreach(println)
(def,Map(x -> 400, y -> 500))
(abc,Map(x -> 100, y -> 200))
(I think that's what your sample is meant to represent)
scala> val newrdd = prdd.map(X=> X._2 ++ Map("z"-> X._1))
scala> newrdd.foreach(println)
Map(x -> 100, y -> 200, z -> abc)
Map(x -> 400, y -> 500, z -> def)
You'll have to change the val newrdd to java syntax, but the right side of the equation (I believe) will stay the same

Scala parallel frequency calculation using aggregate doesn't work

I'm learning Scala by working the exercises from the book "Scala for the Impatient". Please see the following question and my answer and code. I'd like to know if my answer is correct. Also the code doesn't work (all frequencies are 1). Where's the bug?
Q10: Harry Hacker reads a file into a string and wants to use a
parallel collection to update the letter frequencies concurrently on
portions of the string. He uses the following code:
val frequencies = new scala.collection.mutable.HashMap[Char, Int]
for (c <- str.par) frequencies(c) = frequencies.getOrElse(c, 0) + 1
Why is this a terrible idea? How can he really parallelize the
computation?
My answer:
It is not a good idea because if 2 threads are concurrently updating the same frequency, the result is undefined.
My code:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) }, _ ++ _)
}
Unit test:
"Method parFrequency" should "return the frequency of each character in a string" in {
val freq = parFrequency("harry hacker")
freq should have size 8
freq('h') should be(2) // fails
freq('a') should be(2)
freq('r') should be(3)
freq('y') should be(1)
freq(' ') should be(1)
freq('c') should be(1)
freq('k') should be(1)
freq('e') should be(1)
}
Edit:
After reading this thread, I updated the code. Now the test works if ran alone, but fails if ran as a suite.
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), (m1, m2) => m1.merged(m2)({
case ((k, v1), (_, v2)) => (k, v1 + v2)
}))
}
Edit 2:
See my solution below.

++ does not combine the values of identical keys. So when you merge the maps, you get (for shared keys) one of the values (which in this case is always 1), not the sum of the values.
This works:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) },
(a,b) => b.foldLeft(a){case (acc, (k,v))=> acc updated (k, acc.getOrElse(k,0) + v) })
}
val freq = parFrequency("harry hacker")
//> Map(e -> 1, y -> 1, a -> 2, -> 1, c -> 1, h -> 2, r -> 3, k -> 1)
The foldLeft iterates over one of the maps, updating the other map with the key/values found.

You trouble in first case as you detected by yourself was in ++ operator which just concatenating, dropping second occurence of same key.
Now in the second case you have the (_, c) => ImmutableHashMap(c -> 1) which just drops all of chars found my the map in seqop stage.
My suggestion is to extend the Map type with special compination operation, working like merged in HashMap and preserve collecting from first example at seqop stage:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith[V1 >: V](m2: Map[K, V1])(f: (V1, V1) => V1): Map[K, V1] = {
val kv1 = m1.filterKeys(!m2.contains(_))
val kv2 = m2.filterKeys(!m1.contains(_))
val common = (m1.keySet & m2.keySet).toSeq map (k => (k, f(m1(k), m2(k))))
(common ++ kv1 ++ kv2).toMap
}
}
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => {m + (c -> (m.getOrElse(c, 0) + 1))}, (m1, m2) => (m1 unionWith m2)(_ + _))
}
Or you can use fold solution from Paul's answer, but for better performance for each merge choose lesser map to traverse:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith(m2: Map[K, V])(f: (V, V) => V): Map[K, V] =
if (m2.size > m1.size) m2.unionWith(m1)(f)
else m2.foldLeft(m1) {
case (acc, (k, v)) => acc + (k -> acc.get(k).fold(v)(f(v, _)))
}
}

This seems to work. I like it better than the other solutions proposed here because:
It's lot less code than an implicit class and slightly less code than using getOrElse with foldLeft.
It uses the merged function from the API which's intended to do what I want.
It's my own solution :)
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), _.merged(_) {
case ((k, v1), (_, v2)) => (k, v1 + v2)
})
}
Thanks for taking the time to help me out.

convert tuples to a datatype in haskell

I am having issues to convert a list with multiple tuples into a datatype
data SensorValue = SensorValue {a:: Integer, b:: Integer, c:: [Integer]} deriving (Show)
my list with tuples looks like this:
[(1, [(2, [3,4,5]), (2, [2,3,1]), (3, [2,3,7])]), (2, [(1, [4,4,1]), (2, [2,3,1]), (3, [9,0,3])]),...]
so basically my list looks like [(Integer, [(Integer, [Integer])])]
Example
If I take the first tuple from my list (1, [(2, [3,4,5]) then my expected output is:
a SensorValue object with :
a = 1 -- first element of the first tuple
b = 2 -- first element of the second tuple
c = [3,4,5] -- second element of the second tuple
I know how to get to the first tuple with fst but how do I get to the second tuple?

You can use pattern matching here. Your function would look something like this:
f :: (Integer,[(Integer,[Integer])]) -> [SensorValue]
f (x,((y,z):zs)) = SensorValue x y z : f (x,zs) -- First element same for all
f(x,[]) = []
Demo
You would still need to specify the conditions to handle other cases e.g. What happens if the list that forms the second element of the outer tuple is empty?

List comprehensions or do syntax make this quite nice -- assuming you understand them!
doSyntax, listComprehensions :: [(Integer, [(Integer, [Integer])])] -> [SensorValue]
doSyntax sensorPoints = do
(a, pointsAtA ) <- sensorPoints
(b, valuesAtAB) <- pointsAtA
return (SensorValue a b valuesAtAB)
listComprehensions sensorPoints =
[ SensorValue a b valuesAtAB
| (a, pointsAtA ) <- sensorPoints
, (b, valuesAtAB) <- pointsAtA
]
Depending on just what you want to do, you might even consider storing just one sensor value in each element of the result list. Like this (with a variant on the naming scheme above, just for fun):
data SensorValue = SensorValue { a, b, val :: Integer }
fromRawData abvalM =
[ SensorValue a b val
| (a, bvalM) <- abvalM
, (b, valM) <- bvalM
, val <- valM
]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to add a new element in RDD - apache-spark

I have a spark pair RDD (key, count) as below Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4)) i want to add a new max element in RDDs Array[(String, Int)] = Array((a,1,4), (b,2,4), (c,3,4), (d,4,4))

Related

Combine 2 map columns in spark sql

How to change a value in a certain postion of the matrix(and return the whole matrix changed)?

Spark convert PairRDD to RDD

Scala parallel frequency calculation using aggregate doesn't work

convert tuples to a datatype in haskell

Categories

Resources