Errors in Overloaded Spark RDD Function zipPartitions - apache-spark

I'm trying to use the zipPartitions function defined in Spark's RDD class (url to Spark Scala Docs here: http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD).
The function is overloaded, and contains several implementations.
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
I defined a function, merge, with type signature:
merge(iter1: Iterator[(Int,Int)], iter2: Iterator[(Int,Int)]): Iterator[(Int,Int)]
and have two RDD's of type [Int].
However, when I do Rdd1.zipPartitions(Rdd2,merge), the spark shell throws an error and says:
error: missing arguments for method merge;
follow this method with `_' if you want to treat it as a partially applied function
This is strange, because elsewhere, I am able to pass a function as an argument into another method fine. However, if I add two _'s to merge, and try
Rdd1.zipPartitions(Rdd2,merge(_:Iterator[(Int,Int)], _: Iterator[(Int,Int)]), then I get a different error:
error: overloaded method value zipPartitions with alternatives:
[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$34: scala.reflect.ClassTag[B], implicit evidence$35: scala.reflect.ClassTag[C], implicit evidence$36: scala.reflect.ClassTag[D], implicit evidence$37: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$30: scala.reflect.ClassTag[B], implicit evidence$31: scala.reflect.ClassTag[C], implicit evidence$32: scala.reflect.ClassTag[D], implicit evidence$33: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$27: scala.reflect.ClassTag[B], implicit evidence$28: scala.reflect.ClassTag[C], implicit evidence$29: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$24: scala.reflect.ClassTag[B], implicit evidence$25: scala.reflect.ClassTag[C], implicit evidence$26: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, V](rdd2: org.apache.spark.rdd.RDD[B], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$20: scala.reflect.ClassTag[B], implicit evidence$21: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
cannot be applied to (org.apache.spark.rdd.RDD[(Int, Int)], (Iterator[(Int, Int)], Iterator[(Int, Int)]) => Iterator[(Int, Int)])
val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))
I suspect the error lies in this bottom line:
The function definition that I'm trying to match with this call is:
[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
however, what scala sees is
val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))
where the [B] type parameter has already been converted to [(Int,Int)].
Any insights into how to get this to work would be very appreciated!

If you look at the signature, you'll see that this is actually a function with multiple parameter lists, not one list with multiple parameters. The invocation you need is more like:
RDD1.zipPartitions(RDD1)(merge)
(Not sure about the type references you added in your original?)
There may still be some other adjustments you need to make this work, but that is the essence of fixing the error you currently see.

Related

Remove Self from Callable Type signature to match Instance method

The motivation for this is to type check event handlers, to ensure matching between the types that the registered events expect as arguments and those the handler is primed to give.
I am trying to track function signatures in type annotations for a class based function decorator. This is just a mypy stubs project: the actual implementation will get the same result in a different way.
So, we have a basic decorator skeleton like so
from typing import Any, Callable, Generic, TypeVar
FuncT = TypeVar("FuncT", bound=Callable)
class decorator(Generic[FuncT]):
def __init__(self, method: FuncT) -> None:
... # Allows mypy to infer the parameter type
__call__: FuncT
execute: FuncT
With the following stub example
class Widget:
def bar(self: Any, a: int) -> int:
...
#decorator
def foo(self: Any, a: int) -> int:
...
w = Widget()
reveal_type(Widget.bar)
reveal_type(w.bar)
reveal_type(Widget.foo.__call__)
reveal_type(w.foo.__call__)
The revealed types are as follows:
Widget.bar (undecorated class method): 'def (self: demo.Widget, a: builtins.int) -> builtins.int'
w.bar (undecorated instance method): 'def (a: builtins.int) -> builtins.int'
Widget.foo.__call__ (decorated class method): 'def (self: demo.Widget, a: builtins.int) -> builtins.int'
w.foo.__call__ (decorated instance method): 'def (self: demo.Widget, a: builtins.int) -> builtins.int'
The implication of this is that if I call w.bar(2) it passes the type checker, but if I call w.foo(2) or w.foo.execute(2) then mypy complains that there aren't enough parameters. Meanwhile all of Widget.bar(w, 2) Widget.foo(w, 2), and Widget.foo.execute(w, 2) pass fine.
What I'm after is a way to annotate this to persuade w.foo.__call__ and w.foo.execute to give the same signature as w.bar.
This is now possible using ParamSpec from PEP 612. It also requires an intermediate class overloading __get__ to distinguish class from instance access.
FuncT = TypeVar("FuncT", bound=Callable)
FuncT2 = TypeVar("FuncT2", bound=Callable)
SelfT = TypeVar("SelfT")
ParamTs = ParamSpec("ParamTs")
R = TypeVar("R")
class DecoratorCallable(Generic[FuncT]):
__call__ : FuncT
# FuncT and FuncT2 refer to the method signature with and without self
class DecoratorBase(Generic[FuncT, FuncT2]):
#overload
def __get__(self, instance: None, owner: object) -> DecoratorCallable[FuncT]:
# when a method is accessed directly, instance will be None
...
#overload
def __get__(self, instance: object, owner: object) -> DecoratorCallable[FuncT2]:
# when a method is accessed through an instance, instance will be that object
...
def __get__(self, instance: Optional[object], owner: object) -> DecoratorCallable:
...
def decorator(f: Callable[Concatenate[SelfT, ParamTs], R]) -> DecoratorBase[Callable[Concatenate[SelfT, ParamTs], R], Callable[ParamTs, R]] :
...
class Widget:
def bar(self: Any, a: int) -> int:
...
#decorator
def foo(self: Any, a: int) -> int:
...
With the same class as before, the revealed types are now
Widget.bar (undecorated class method): 'def (self: Any, a: builtins.int) -> builtins.int'
w.bar (undecorated instance method): 'def (a: builtins.int) -> builtins.int'
Widget.foo.__call__ (decorated class method): 'def (Any, a: builtins.int) -> builtins.int'
w.foo.__call__ (decorated instance method): 'def (a: builtins.int) -> builtins.int'
This means that MyPy will correctly allow Widget.foo(w, 2) and w.foo(2), and will correctly disallow Widget.foo(w, "A"), w.foo("A"), w.foo(2, 5), and x: dict = w.foo(2). It also allows for keyword arguments; w.foo(a=2) passes.
One corner case that it fails on is that it forgets about the name of self and so Widget.foo(self=w, a = 2) fails with Unexpected keyword argument "self".

How to convert RDD Structure

How to convert
RDD[(String, (((A, B), C), D))]
to
RDD[(String, (A, B, C, D))]
Do I need to use flatMapValues? I have no idea how to use it.
Can anybody help with this?
You can just use mapValues and select the values from tuple as
rdd.mapValues(x => (x._1._1._1, x._1._1._2, x._1._2, x._2))
This is almost a Scala question, more than Spark. Alternatively, try a pattern match like:
rdd.mapValues { case (((a, b), c), d) => (a, b, c, d) }
mapValues is important as it maintains the partitioner of the RDD, if any.

Are there any ways to recursively flatten tuples?

In Rust, is there any way to use traits and impls to (recursively) flatten tuples?
If it helps, something that works with N nested pairs is a good start
trait FlattenTuple {
fn into_flattened(self) -> /* ??? */
}
// such that
assert_eq!((1, (2, 3)).into_flattened(), (1, 2, 3))
It would be even better if it could be extended work with any kind of nested tuple such that:
assert_eq!(((1, 2), 2, (3, (4, 5))).into_flattened(), (1, 2, 2, 3, 4, 5))
Maybe for certain small definitions of "flatten", but realistically not really.
Start with the most specific implementation:
trait FlattenTuple {
fn into_flattened(self) -> (u8, u8, u8);
}
impl FlattenTuple for (u8, (u8, u8)) {
fn into_flattened(self) -> (u8, u8, u8) {
(self.0, (self.1).0, (self.1).1)
}
}
Then make it a bit more generic:
trait FlattenTuple {
type Output;
fn into_flattened(self) -> Self::Output;
}
impl<A, B, C> FlattenTuple for (A, (B, C)) {
type Output = (A, B, C);
fn into_flattened(self) -> Self::Output {
(self.0, (self.1).0, (self.1).1)
}
}
And then repeat for every possible permutation:
impl<A, B, C, D, E, F> FlattenTuple for ((A, B), C, (D, (E, F))) {
type Output = (A, B, C, D, E, F);
fn into_flattened(self) -> Self::Output {
((self.0).0, (self.0).1, self.1, (self.2).0, ((self.2).1).0, ((self.2).1).1)
}
}
These two implementations cover your two cases.
However, you'd then have to enumerate every input type you'd like, probably via code generation. There's no way I'm aware of to "inspect" the input type and then "splice" it into the output type.
You can even try to write something somewhat recursive:
impl<A, B, C, D, E, F> FlattenTuple for (A, B)
where A: FlattenTuple<Output = (C, D)>,
B: FlattenTuple<Output = (E, F)>,
{
type Output = (C, D, E, F);
fn into_flattened(self) -> Self::Output {
let (a, b) = self;
let (c, d) = a.into_flattened();
let (e, f) = b.into_flattened();
(c, d, e, f)
}
}
But this will quickly run into base-case issues: the terminal 42 doesn't implement FlattenTuple, and if you try to impl<T> FlattenTuple for T you will hit conflicting trait implementations.

Existential type in higher order function

I've got a function whose job is to compute some optimal value of type a wrt some value function of type a -> v
type OptiF a v = (a -> v) -> a
Then I have a container that wants to store such a function together with another function which uses the values values:
data Container a = forall v. (Ord v) => Cons (OptiF a v) (a -> Int)
The idea is that whoever implements a function of type OptiF a v should not be bothered with the details of v except that it's an instance of Ord.
So I've written a function which takes such a value function and a container. Using the OptiF a v it should compute the optimal value wrt val and plug it in the container's result function:
optimize :: (forall v. (Ord v) => a -> v) -> Container a -> Int
optimize val (Cons opti result) = result (opti val)
So far so good, but I can't call optimize, because
callOptimize :: Int
callOptimize = optimize val cont
where val = (*3)
opti val' = if val' 1 > val' 0 then 100 else -100
cont = Cons opti (*2)
does not compile:
Could not deduce (v ~ Int)
from the context (Ord v)
bound by a type expected by the context: Ord v => Int -> v
at bla.hs:12:16-32
`v' is a rigid type variable bound by
a type expected by the context: Ord v => Int -> v at bla.hs:12:16
Expected type: Int
Actual type: Int
Expected type: Int -> v
Actual type: Int -> Int
In the first argument of `optimize', namely `val'
In the expression: optimize val cont
where line 12:16-32 is optimize val cont.
Am I misunderstanding existential types in this case? Does the forall v in the declaration of optimize mean that optimize may expect from a -> v whatever v it wants? Or does it mean that optimize may expect nothing from a -> v except that Ord v?
What I want is that the OptiF a v is not fixed for any v, because I want to plug in some a -> v later on. The only constraint I'd like to impose is Ord v. Is it even possible to express something like that using existential types (or whatever)?
I managed to achieve that with an additional typeclass which provides an optimize function with a similar signature to OptiF a v, but that looks much uglier to me than using higher order functions.
This is something that's easy to get wrong.
What you have in the signature of optimize is not an existential, but a universal.
...since existentials are somewhat outdated anyway, let's rewrite your data to GADT form, which makes the point clearer as the syntax is essentially the same as for polymorphic functions:
data Container a where
(:/->) :: Ord v => -- come on, you can't call this `Cons`!
OptiF a v -> (a->Int) -> Container a
Observe that the Ord constraint (which implies that here's the forall v...) stands outside of the type-variable–parameterised function signature, i.e. v is a parameter we can dictate from the outside when we want to construct a Container value. In other words,
For all v in Ord there exists the constructor (:/->) :: OptiF a v -> (a->Int) -> Container a
which is what gives rise to the name "existential type". Again, this is analog to an ordinary polymorphic function.
On the other hand, in the signature
optimize :: (forall v. (Ord v) => a -> v) -> Container a -> Int
you have a forall inside the signature term itself, which means that what concrete type v may take on will be determined by the callee, optimize, internally – all we have control over from the outside is that it be in Ord. Nothing "existential" about that, which is why this signature won't actually compile with XExistentialQuantification or XGADTs alone:
<interactive>:37:26:
Illegal symbol '.' in type
Perhaps you intended -XRankNTypes or similar flag
to enable explicit-forall syntax: forall <tvs>. <type>
val = (*3) obviously doesn't fulfill (forall v. (Ord v) => a -> v), it actually requires a Num instance which not all Ords have. Indeed, optimize shouldn't need the rank2 type: it should work for any Ord-type v the caller might give to it.
optimize :: Ord v => (a -> v) -> Container a -> Int
in which case your implementation doesn't work anymore, though: since (:/->) is really an existential constructor, it needs to contain only any OptiF function, for some unknown type v1. So the caller of optimize has the freedom to choose the opti-function for any particular such type, and the function to be optimised for any possibly other fixed type – that can't work!
The solution that you want is this: Container shouldn't be existential, either! The opti-function should work for any type which is in Ord, not just for one particular type. Well, as a GADT this looks about the same as the universally-quantified signature you originally had for optimize:
data Container a where
(:/->) :: (forall v. Ord v => OptiF a v) -> (a->Int) -> Container a
With that now, optimize works
optimize :: Ord v => (a -> v) -> Container a -> Int
optimize val (opti :/-> result) = result (opti val)
and can be used as you wanted
callOptimize :: Int
callOptimize = optimize val cont
where val = (*3)
opti val' = if val' 1 > val' 0 then 100 else -100
cont = opti :/-> (*2)

Is it necessary to specify every superclass in a class context of a class declaration?

The ArrowList class from the hxt package has the following declaration:
class (Arrow a, ArrowPlus a, ArrowZero a, ArrowApply a) => ArrowList a where ...
The ArrowPlus class is declared as:
class ArrowZero a => ArrowPlus a where ...
The ArrowZero class is declared as:
class Arrow a => ArrowZero a where ...
And the ArrowApply class is declared as:
class Arrow a => ArrowApply a where ...
Why can't it just be written as:
class (ArrowPlus a, ArrowApply a) => ArrowList a where ...?
No, it's not necessary to include all the superclasses. If you write
class (ArrowPlus a, ArrowApply a) => ArrowList a where
it will work. However, here are two possible reasons for mentioning all the superclasses explicitly.
It might be more readable as you can tell at a glance what all the superclasses are.
It might be slightly more efficient, as listing the superclasses explicitly will result in a direct dictionary lookup at runtime, while for a transitive superclass it will first lookup the dictionary for the superclass and then lookup the class member in that.
For example, take this inheritance chain:
module Example where
class Foo a where
foo :: a -> String
class Foo a => Bar a
class Bar a => Baz a
class Baz a => Xyzzy a
quux :: Xyzzy a => a -> String
quux = foo
Looking at the generated core for this (with ghc -c -ddump-simpl), we see that this generates a chain of lookup calls. It first looks up the dictionary for Baz in Xyzzy, then Bar in that, then Foo, and finally it can look up foo.
Example.quux
:: forall a_abI. Example.Xyzzy a_abI => a_abI -> GHC.Base.String
[GblId, Arity=1, Caf=NoCafRefs]
Example.quux =
\ (# a_acE) ($dXyzzy_acF :: Example.Xyzzy a_acE) ->
Example.foo
# a_acE
(Example.$p1Bar
# a_acE
(Example.$p1Baz # a_acE (Example.$p1Xyzzy # a_acE $dXyzzy_acF)))
Modifying the definition of Xyzzy to explicitly mention Foo:
class (Foo a, Baz a) => Xyzzy a
We see that it can now get the Foo dictionary straight from the Xyzzy one and look up foo in that.
Example.quux
:: forall a_abD. Example.Xyzzy a_abD => a_abD -> GHC.Base.String
[GblId, Arity=1, Caf=NoCafRefs]
Example.quux =
\ (# a_acz) ($dXyzzy_acA :: Example.Xyzzy a_acz) ->
Example.foo # a_acz (Example.$p1Xyzzy # a_acz $dXyzzy_acA)
Note that this may be GHC-specific. Tested with version 7.0.2.

Resources