how to get this below list using spark rdd?

how to get this below list using spark rdd? - apache-spark

List(1,2,3,4..100)==> List((1,2),(2,3),(3,4)...(100,101))==>List(3,5,7,....201)
scala> x.map(x=>x,x+1).map(x=>x._1+x._2)
:26: error: too many arguments (2) for method map: (f: Int => B)(implicit bf: scala.collection.generic.CanBuildFrom[List[Int],B,That])That
x.map(x=>x,x+1).map(x=>x._1+x._2)
am trying to transform the 1 to 100 values but am getting the above error.Is there any issue with any the code?

your map function's return is incorrect.
try this:
input.map(x => (x,x+1)).map(x => x._1 + x._2)
Although, I don't see the need for two map functions when you can do it in one like this:
input.map(x => x + x + 1)
the above expression will also give you the same result.

Related

How can you store the results from a forEach in Spark

DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?

The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))

How to filter a typed Spark Dataset using pattern matching

I try to jump on the typed Dataset API but I'm stuck with filtering:
val ds: Dataset[(Int, Int)] = Seq((1,1)).toDS
ds.filter(ij => ij._1 > ij._2) // does work, but is not readable
ds.filter{case (i,j) => i<j} // does not work
Error:(36, 14) missing parameter type for expanded function The
argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: ?
I don't understand why pattern matching does not work with filter, while it works fine with map:
ds.map{case (i,j) => i+j}

Make it explicit:
ds.filter{x => x match { case (i,j) => i < j}}

Apparently it's a bug : https://issues.apache.org/jira/browse/SPARK-19492 Tanks to Bodgan for the information

This is a little bit more readable:
val ds: Dataset[(Int, Int)] = Seq((1,1)).toDS
ds.filter('_1 > '_2)
Note: you need to import spark.implicits._

Pairing two RDD-s

Let's say I have two RDD-s, where one is a map of the other. For example:
RDD[Double] N;
RDD[Double] logN = N.map(x => Math.Log(x));
And I want to operate on matching pairs from both of them.
Something like this:
RDD[Double] NlogN = (N,logN).map((x,y) => x*y);
Is this kind of an operation available in spark?

You are looking for zip
N.zip(logN).map { case (x, y) => ... }

Reading and learning Spark API?

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:
val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
When I read the reduceByKey API, I see:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?
I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!

in the code below:
reduceByKey((x, y) => x + y)
you could read for more clarity, something like this:
reduceByKey((sum, addend) => sum + addend)
so, for every key, you iterate that function fore every element with that key.
Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.

Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on
There will be the same key with multiple values in the RDD[(K,V)]
When you use the reduceByKey . For each values in the key the function will be applied.
For example consider the following program
val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))
val rdd = sc.parallelize(data)
val res = rdd.reduceByKey((x,y) => x+y)
res.foreach(println)
You will get the output as
(key2,64)
(key1,24)
Here the Sequence of values are passed to the function . For key1 -> (2,20,2)
In the end , You will have a single value for each key.
You could use spark shell to try out the APIs.

String method to change particular element in Scala

I need to write a method in Scala that overrides the toString method. I wrote it but I also have to check that if there is an element that is '1' I will change it to 'a', else write the list as it is with the string method. Any suggestions how this can be done?

What error are you getting? seems to work for me
val l = List(1, 2, 3)
println(this)
override def toString(): String = {
val t = l.map({
case 1 => "a"
case x => x
})
t.toString
}
getting List(a, 2, 3) printed out

I see from the comments on your question that list is a List[List[Int]].
Look at the beginning of your code:
list.map { case 1 => 'a'; case x => x}
map expects a function that takes an element of list as a parameter - a List[Int], in your case. But your code works directly on Int.
With this information, it appears that the error you get is entirely correct: you declared a method that expects an Int, but you pass a List[Int] to it, which is indeed a type mismatch.
Try this:
list.map {_.map { case 1 => 'a'; case x => x}}
This way, the function you defined to transform 1 to a and leave everything else alone is applied to list's sublists, and this type-checks: you're applying a function that expects an Int to an Int.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to get this below list using spark rdd? - apache-spark

your map function's return is incorrect. try this: input.map(x => (x,x+1)).map(x => x._1 + x._2) Although, I don't see the need for two map functions when you can do it in one like this: input.map(x => x + x + 1) the above expression will also give you the same result.

Related

How can you store the results from a forEach in Spark

How to filter a typed Spark Dataset using pattern matching

Pairing two RDD-s

Reading and learning Spark API?

String method to change particular element in Scala

Categories

Resources