I'm learning Scala by working the exercises from the book "Scala for the Impatient". Please see the following question and my answer and code. I'd like to know if my answer is correct. Also the code doesn't work (all frequencies are 1). Where's the bug?
Q10: Harry Hacker reads a file into a string and wants to use a
parallel collection to update the letter frequencies concurrently on
portions of the string. He uses the following code:
val frequencies = new scala.collection.mutable.HashMap[Char, Int]
for (c <- str.par) frequencies(c) = frequencies.getOrElse(c, 0) + 1
Why is this a terrible idea? How can he really parallelize the
computation?
My answer:
It is not a good idea because if 2 threads are concurrently updating the same frequency, the result is undefined.
My code:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) }, _ ++ _)
}
Unit test:
"Method parFrequency" should "return the frequency of each character in a string" in {
val freq = parFrequency("harry hacker")
freq should have size 8
freq('h') should be(2) // fails
freq('a') should be(2)
freq('r') should be(3)
freq('y') should be(1)
freq(' ') should be(1)
freq('c') should be(1)
freq('k') should be(1)
freq('e') should be(1)
}
Edit:
After reading this thread, I updated the code. Now the test works if ran alone, but fails if ran as a suite.
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), (m1, m2) => m1.merged(m2)({
case ((k, v1), (_, v2)) => (k, v1 + v2)
}))
}
Edit 2:
See my solution below.
++ does not combine the values of identical keys. So when you merge the maps, you get (for shared keys) one of the values (which in this case is always 1), not the sum of the values.
This works:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) },
(a,b) => b.foldLeft(a){case (acc, (k,v))=> acc updated (k, acc.getOrElse(k,0) + v) })
}
val freq = parFrequency("harry hacker")
//> Map(e -> 1, y -> 1, a -> 2, -> 1, c -> 1, h -> 2, r -> 3, k -> 1)
The foldLeft iterates over one of the maps, updating the other map with the key/values found.
You trouble in first case as you detected by yourself was in ++ operator which just concatenating, dropping second occurence of same key.
Now in the second case you have the (_, c) => ImmutableHashMap(c -> 1) which just drops all of chars found my the map in seqop stage.
My suggestion is to extend the Map type with special compination operation, working like merged in HashMap and preserve collecting from first example at seqop stage:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith[V1 >: V](m2: Map[K, V1])(f: (V1, V1) => V1): Map[K, V1] = {
val kv1 = m1.filterKeys(!m2.contains(_))
val kv2 = m2.filterKeys(!m1.contains(_))
val common = (m1.keySet & m2.keySet).toSeq map (k => (k, f(m1(k), m2(k))))
(common ++ kv1 ++ kv2).toMap
}
}
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => {m + (c -> (m.getOrElse(c, 0) + 1))}, (m1, m2) => (m1 unionWith m2)(_ + _))
}
Or you can use fold solution from Paul's answer, but for better performance for each merge choose lesser map to traverse:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith(m2: Map[K, V])(f: (V, V) => V): Map[K, V] =
if (m2.size > m1.size) m2.unionWith(m1)(f)
else m2.foldLeft(m1) {
case (acc, (k, v)) => acc + (k -> acc.get(k).fold(v)(f(v, _)))
}
}
This seems to work. I like it better than the other solutions proposed here because:
It's lot less code than an implicit class and slightly less code than using getOrElse with foldLeft.
It uses the merged function from the API which's intended to do what I want.
It's my own solution :)
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), _.merged(_) {
case ((k, v1), (_, v2)) => (k, v1 + v2)
})
}
Thanks for taking the time to help me out.
Related
I am trying to group an RDD by using groupby. Most of the docs suggest not to use groupBy because of how it works internally to group keys. Is there another way to achieve that. I cannot use reducebyKey because I am not doing a reduction operation here.
Ex-
Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
.flatmap(x -> someOp(x))
.values()
.filter()
aggregateByKey [Pair] see
Works like the aggregate function except the aggregation is applied to the values with the same key. Also unlike the aggregate function the initial value is not applied to the second reduce.
Listing Variants
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U)
⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Example :
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
The following sequential Merge Sort returns the result very quickly :-
def mergeSort(xs: List[Int]): List[Int] = {
def merge(xs: List[Int], ys: List[Int]): List[Int] = (xs, ys) match {
case (Nil, _) => ys
case (_, Nil) => xs
case (x :: xs1, y :: ys1) => if (x <= y) x :: merge(xs1, ys) else y :: merge(xs, ys1)
}
val mid = xs.length / 2
if (mid <= 0) xs
else {
val (xs1, ys1) = xs.splitAt(mid)
merge(mergeSort(xs1), mergeSort(ys1))
}
}
val newList = (1 to 10000).toList.reverse
mergeSort(newList)
However, when I try to parallelize it using Futures, it times out :-
def mergeSort(xs: List[Int]): List[Int] = {
def merge(xs: List[Int], ys: List[Int]): List[Int] = (xs, ys) match {
case (Nil, _) => ys
case (_, Nil) => xs
case (x :: xs1, y :: ys1) => if (x <= y) x :: merge(xs1, ys) else y :: merge(xs, ys1)
}
val mid = xs.length / 2
if (mid <= 0) xs
else {
val (xs1, ys1) = xs.splitAt(mid)
val sortedList1 = Future{mergeSort(xs1)}
val sortedList2 = Future{mergeSort(ys1)}
merge(Await.result(sortedList1,5 seconds), Await.result(sortedList2,5 seconds))
}
}
val newList = (1 to 10000).toList.reverse
mergeSort(newList)
I get a Timeout exception. I understand that this is probably because this code spawns log2 10000 threads which adds a lot of delay as the Execution context Threadpool may not have that many threads.
1.) How do I exploit the inherent parallelism in merge sort and parallelize this code ?
2.) For what use cases are Futures useful and when should they be avoided ?
Edit 1 : Refactored code based on the feedback I've gotten so far :-
def mergeSort(xs: List[Int]): Future[List[Int]] = {
#tailrec
def merge(xs: List[Int], ys: List[Int], acc: List[Int]): List[Int] = (xs, ys) match {
case (Nil, _) => acc.reverse ::: ys
case (_, Nil) => acc.reverse ::: xs
case (x :: xs1, y :: ys1) => if (x <= y) merge(xs1, ys, x :: acc) else merge(xs, ys1, y :: acc)
}
val mid = xs.length / 2
if (mid <= 0) Future {
xs
}
else {
val (xs1, ys1) = xs.splitAt(mid)
val sortedList1 = mergeSort(xs1)
val sortedList2 = mergeSort(ys1)
for (s1 <- sortedList1; s2 <- sortedList2) yield merge(s1, s2, List())
}
}
Usually when using Futures, you should a) await as little as possible and prefer to work within Futures, and b) pay attention to which execution context you are using.
As an example of a), here's how you could change this:
def mergeSort(xs: List[Int]): Future[List[Int]] = {
def merge(xs: List[Int], ys: List[Int]): List[Int] = (xs, ys) match {
case (Nil, _) => ys
case (_, Nil) => xs
case (x :: xs1, y :: ys1) => if (x <= y) x :: merge(xs1, ys) else y :: merge(xs, ys1)
}
val mid = xs.length / 2
if (mid <= 0) Future(xs)
else {
val (xs1, ys1) = xs.splitAt(mid)
val sortedList1 = mergeSort(xs1)
val sortedList2 = mergeSort(ys1)
for (s1 <- sortedList1; s2 <- sortedList2) yield merge(s1, s2)
}
}
val newList = (1 to 10000).toList.reverse
Await.result(mergeSort(newList), 5 seconds)
However there's still a ton of overhead here. Typically you would only parallelize significantly-sized chunks of work to avoid being dominated by overhead, which in this case would probably mean falling back to a single-threaded version when recursion reaches a list below some constant size.
I have a big Excel file, which i read with Excel Provider in F#.
The rows should be grouped by some column. Processing crashes with OutOfMemoryException. Not sure whether the Seq.groupBy call is guilty or excel type provider.
To simplify it I use 3D Point here as a row.
type Point = { x : float; y: float; z: float; }
let points = seq {
for x in 1 .. 1000 do
for y in 1 .. 1000 do
for z in 1 .. 1000 ->
{x = float x; y = float y; z = float z}
}
let groups = points |> Seq.groupBy (fun point -> point.x)
The rows are already ordered by grouped column, e.g. 10 points with x = 10, then 20 points with x = 20 and so one. Instead of grouping them I need just to split the rows in chunks until changed. Is there some way to enumerate the sequence just once and get sequence of rows splitted, not grouped, by some column value or some f(row) value?
If the rows are already ordered then this chunkify function will return a seq<'a list>. Each list will contain all the points with the same x value.
let chunkify pred s = seq {
let values = ref []
for x in s do
match !values with
|h::t -> if pred h x then
values := x::!values
else
yield !values
values := [x]
|[] -> values := [x]
yield !values
}
let chunked = points |> chunkify (fun x y -> x.x = y.x)
Here chunked has a type of
seq<Point list>
Another solution, along the same lines as Kevin's
module Seq =
let chunkBy f src =
seq {
let chunk = ResizeArray()
let mutable key = Unchecked.defaultof<_>
for x in src do
let newKey = f x
if (chunk.Count <> 0) && (newKey <> key) then
yield chunk.ToArray()
chunk.Clear()
key <- newKey
chunk.Add(x)
}
// returns 2 arrays, each with 1000 elements
points |> Seq.chunkBy (fun pt -> pt.y) |> Seq.take 2
Here's a purely functional approach, which is surely slower, and much harder to understand.
module Seq =
let chunkByFold f src =
src
|> Seq.scan (fun (chunk, (key, carry)) x ->
let chunk = defaultArg carry chunk
let newKey = f x
if List.isEmpty chunk then [x], (newKey, None)
elif newKey = key then x :: chunk, (key, None)
else chunk, (newKey, Some([x]))) ([], (Unchecked.defaultof<_>, None))
|> Seq.filter (snd >> snd >> Option.isSome)
|> Seq.map fst
Lets start with the input
let count = 1000
type Point = { x : float; y: float; z: float; }
let points = seq {
for x in 1 .. count do
for y in 1 .. count do
for z in 1 .. count ->
{x = float x; y = float y; z = float z}
}
val count : int = 1000
type Point =
{x: float;
y: float;
z: float;}
val points : seq<Point>
If we try to evalute points then we get a OutOfMemoryException:
points |> Seq.toList
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at Microsoft.FSharp.Collections.FSharpList`1.Cons(T head, FSharpList`1 tail)
at Microsoft.FSharp.Collections.SeqModule.ToList[T](IEnumerable`1 source)
at <StartupCode$FSI_0011>.$FSI_0011.main#()
Stopped due to error
It might be same reason that groupBy fails, but I'm not sure. But it tells us that we have to use seq and yield to return the groups with. So we get this implementation:
let group groupBy points =
let mutable lst = [ ]
seq { for p in points do match lst with | [] -> lst <- [p] | p'::lst' when groupBy p' p -> lst <- p::lst | lst' -> lst <- [p]; yield lst' }
val group : groupBy:('a -> 'a -> bool) -> points:seq<'a> -> seq<'a list>
It is not the most easily read code. It takes each point from the points sequence and prepends it to an accumulator list while the groupBy function is satisfied. If the groupBy function is not satisfied then a new accumulator list is generated and the old one is yielded. Note that the order of the accumulator list is reversed.
Testing the function:
for g in group (fun p' p -> p'.x = p.x ) points do
printfn "%f %i" g.[0].x g.Length
Terminates nicely (after some time).
Other implementation with bug fix and better formatting.
let group (groupBy : 'a -> 'b when 'b : equality) points =
let mutable lst = []
seq {
yield! seq {
for p in points do
match lst with
| [] -> lst <- [ p ]
| p' :: lst' when (groupBy p') = (groupBy p) -> lst <- p :: lst
| lst' ->
lst <- [ p ]
yield (groupBy lst'.Head, lst')
}
yield (groupBy lst.Head, lst)
}
Seems there is no one line purely functional solution or already defined Seq method which I have overseen.
Therefore as an alternative here my own imperative solution. Comparable to #Kevin's answer but actually satisfies more my need. The ref cell contains:
The group key, which is calculated just once for each row
The current chunk list (could be seq to be conform to Seq.groupBy), which contains the elements in the input order for which the f(x) equals to the sored group key (requires equality).
.
let splitByChanged f xs =
let acc = ref (None,[])
seq {
for x in xs do
match !acc with
| None,_ ->
acc := Some (f x),[x]
| Some key, chunk when key = f x ->
acc := Some key, x::chunk
| Some key, chunk ->
let group = chunk |> Seq.toList |> List.rev
yield key, group
acc := Some (f x),[x]
match !acc with
| None,_ -> ()
| Some key,chunk ->
let group = chunk |> Seq.toList |> List.rev
yield key, group
}
points |> splitByChanged (fun point -> point.x)
The function has the following signature:
val splitByChanged :
f:('a -> 'b) -> xs:seq<'a> -> seq<'b * 'a list> when 'b : equality
Correctures and even better solutions are welcome
I'm trying to write a function that uses references and destructively updates a sorted linked list while inserting a value.
My code is as follows:
Control.Print.printDepth := 100;
datatype 'a rlist = Empty | RCons of 'a * (('a rlist) ref);
fun insert(comp: (('a * 'a) -> bool), toInsert: 'a, lst: (('a rlist) ref)): unit =
let
val r = ref Empty;
fun insert' comp toInsert lst =
case !lst of
Empty => r := (RCons(toInsert, ref Empty))
| RCons(x, L) => if comp(toInsert, x) then r := (RCons(toInsert, lst))
else ((insert(comp,toInsert,L)) ; (r := (RCons(x, L))));
in
insert' comp toInsert lst ; lst := !r
end;
val nodes = ref (RCons(1, ref (RCons(2, ref (RCons(3, ref (RCons(5, ref Empty))))))));
val _ = insert((fn (x,y) => if x <= y then true else false), 4, nodes);
!nodes;
!nodes returns
val it = RCons (1,ref (RCons (2,ref (RCons (3,ref (RCons (4,%1)) as %1)))))
: int rlist
when it should return
val it = RCons (1,ref (RCons (2,ref (RCons (3,ref (RCons (4, ref (RCons(5, ref Empty))))))))
: int rlist
It means that your code is buggy, and has returned a cyclic list, where the tail of ref(RCons(4, ...)) is actually the same ref(RCons(4, ...)) again.
Remarks:
You don't need to pass comp and toInsert to the inner function, they are already in scope.
if C then true else false is the same as writing just C.
In SML, you typically use comparison functions of type t * t -> order, and they are predefined in the library, see e.g. Int.compare.
About 70% of the parentheses in your code are redundant.
You don't normally want to use such a data structure in ML.
If you absolutely have to, here is how I would write the code:
datatype 'a rlist' = Empty | Cons of 'a * 'a rlist
withtype 'a rlist = 'a rlist' ref
fun insert compare x l =
case !l of
Empty => l := Cons(x, ref Empty)
| Cons(y, l') =>
case compare(x, y) of
LESS => l := Cons(x, ref(!l))
| EQUAL => () (* or whatever else you want to do in this case *)
| GREATER => insert compare x l'
Changed
| RCons(x, L) => if comp(toInsert, x) then r := (RCons(toInsert, lst))
to
| RCons(x, L) => if comp(toInsert, x) then r := (RCons(toInsert, ref(!lst)))
I'm working on an experimental programming language that has global polymorphic type inference.
I recently got the algorithm working sufficiently well to correctly type the bits of sample code I'm throwing at it. I'm now looking for something more complex that will exercise the edge cases.
Can anyone point me at a source of really gnarly and horrible code fragments that I can use for this? I'm sure the functional programming world has plenty. I'm particularly looking for examples that do evil things with function recursion, as I need to check to make sure that function expansion terminates correctly, but anything's good --- I need to build a test suite. Any suggestions?
My language is largely imperative, but any ML-style code ought to be easy to convert.
My general strategy is actually to approach it from the opposite direction -- ensure that it rejects incorrect things!
That said, here are some standard "confirmation" tests I usually use:
The eager fix point combinator (unashamedly stolen from here):
datatype 'a t = T of 'a t -> 'a
val y = fn f => (fn (T x) => (f (fn a => x (T x) a)))
(T (fn (T x) => (f (fn a => x (T x) a))))
Obvious mutual recursion:
fun f x = g (f x)
and g x = f (g x)
Check out those deeply nested let expressions too:
val a = let
val b = let
val c = let
val d = let
val e = let
val f = let
val g = let
val h = fn x => x + 1
in h end
in g end
in f end
in e end
in d end
in c end
in b end
Deeply nested higher order functions!
fun f g h i j k l m n =
fn x => fn y => fn z => x o g o h o i o j o k o l o m o n o x o y o z
I don't know if you have to have the value restriction in order to incorporate mutable references. If so, see what happens:
fun map' f [] = []
| map' f (h::t) = f h :: map' f t
fun rev' [] = []
| rev' (h::t) = rev' t # [h]
val x = map' rev'
You might need to implement map and rev in the standard way :)
Then with actual references lying around (stolen from here):
val stack =
let val stk = ref [] in
{push = fn x => stk := x :: !stk,
pop = fn () => stk := tl (!stk),
top = fn () => hd (!stk)}
end
Hope these help in some way. Make sure to try to build a set of regression tests you can re-run in some automatic fashion to ensure that all of your type inference behaves correctly through all changes you make :)