RDD.distinct() ++ RDD will make the processs not stop - apache-spark

Here is a request in the data process, AlertRDD is used to produce unique alerts count and not unique alerts count for one day.
So I write below method include both two of them,
implicit class RDDOps(val rdd: RDD[(String, String)]) {
def addDistinctRDD() = {
rdd.distinct().map{
case (elem1, elem2) =>
(elem1, elem2, "AlertUniqueUsers")
} ++ rdd.map{
case (elem1, elem2) =>
(elem1, elem2, "Alerts")
}
}
}
When I run AlertRDD.addDistinctRDD.save(//path), the program will never stop, but if run it as as AlertRDD.cache.addDistinctRDD.save(//path), it works well.
Could anybody know the reason?
Update
It looks like nobody can answer my question, probably my question is not well understood.
Why I add unique AlertRDD and non-unique AlertRDD together?
Because the follow-up process of two RDDs is the same, Something like below excerpt code:
.map((_, 1))
.reduceByKey(_ + _)
.map()
....
.save($path)
If my solution is not the good one, can anyone suggest a better solution?

Related

How to loop through a map and display only one row if there is matching values

I have a map with differnt keys and multiple values.If there is any matching job among different keys,I have to display only one row and grouping code values.
def data = ['Test1':[[name:'John',dob:'02/20/1970',job:'Testing',code:51],[name:'X',dob:'03/21/1974',job:'QA',code:52]],
'Test2':[name:'Michael',dob:'04/01/1973',job:'Testing',code:52]]
for (Map.Entry<String, List<String>> entry : data.entrySet()) {
String key = entry.getKey();
List<String> values = entry.getValue();
values.eachWithIndex{itr,index->
println("key is:"+key);
println("itr values are:"+itr);
}
}
Expected Result : [job:Testing,code:[51,52]]
First flatten all the relevant maps, so whe just have a flat list of all of them, then basically the same as the others suggested: group by the job and just keep the codes (via the spread operator)
def data = ['Test1':[[name:'John',dob:'02/20/1970',job:'Testing',code:51],[name:'X',dob:'03/21/1974',job:'QA',code:52]], 'Test2':[name:'Michael',dob:'04/01/1973',job:'Testing',code:52]]
assert data.values().flatten().groupBy{it.job}.collectEntries{ [it.key, it.value*.code] } == [Testing: [51, 52], QA: [52]]
Note: the question will be changed according to the comments from the other answers.
Above code will give you the jobs and and their codes.
As of now, it's not clear, what the new expected output should be.
You can use the groovy's collection methods.
First you need to extract the lists, since you dont need the key of the top level element
def jobs = data.values()
Then you can use the groupBy method to group by the key "job"
def groupedJobs = jobs.groupBy { it.job }
The above code will produce the following result with your example
[Testing:[[name:John, dob:02/20/1970, job:Testing, code:51], [name:Michael, dob:04/01/1973, job:Testing, code:52]]]
Now you can get only the codes as values and do appropriate changes to make key as job by the following collect function
def result = groupedJobs.collect {key, value ->
[job: key, code: value.code]
}
The following code (which uses your sample data set):
def data = ['Test1':[name:'John', dob:'02/20/1970', job:'Testing', code:51],
'Test2':[name:'Michael', dob:'04/01/1973', job:'Testing', code:52]]
def matchFor = 'Testing'
def result = [job: matchFor, code: data.findResults { _, m ->
m.job == matchFor ? m.code : null
}]
println result
results in:
~> groovy solution.groovy
[job:Testing, code:[51, 52]]
~>
when run. It uses the groovy Map.findResults method to collect the codes from the matching jobs.

Map error: wrong number of parameters

I am new to spark programming and i got stuck while using map.
My data Rdd contains.
Array[(String, Int)] = Array((steve,5), (bill,4), (" amzon",6), (flikapr,7))
and while using again map i am getting below mentioned error .
data.map((k,v) => (k,v+1))
<console>:32: error: wrong number of parameters; expected = 1
data.map((k,v) => (k,v+1))
I am trying to pass a tuple with key value and wants to get back a tuple with 1 + to value.
Please help , why i am getting error.
Thanks
You almost got it. rdd.map() operates on each record of the RDD and in your case, that record is a tuple. You can simply access the tuple members using Scala's underscore accessors like this:
val data = sc.parallelize(Array(("steve",5), ("bill",4), ("amzon",6), ("flikapr",7)))
data.map(t => (t._1, t._2 + 1))
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Or better yet, use Scala's powerful pattern matching like this:
data.map({ case (k, v) => (k, v+1) }).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Here's the best so far -- key-value tuples are so common in Spark that we usually refer to them as PairRDDs and they come with plenty of convenience functions. For your use case, you only need to operate on the value without changing the key. You can simply use mapValues():
data.mapValues(_ + 1).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)

Why do multiple print() methods in Spark Streaming affect the values in my list?

I'm trying to receive one JSON line per two seconds, store them in a List which has elements from a costum Class, created by me, and print the resulting List after each execution of the context. So I'm doing something like this:
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaReceiverInputDStream<String> streamData = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<LinkedList<StreamValue>> getdatatoplace = streamData.map(new Function<String, LinkedList<StreamValue>>() {
#Override
public LinkedList<StreamValue> call(String s) throws Exception {
//Access specific attributes in the JSON
Gson gson = new Gson();
Type type = new TypeToken<Map<String, String>>() {
}.getType();
Map<String, String> retMap = gson.fromJson(s, type);
String a = retMap.get("exp");
String idValue = retMap.get("id");
//insert values into the stream_Value LinkedList
stream_Value.push(new StreamValue(idValue, a, UUID.randomUUID()));
return stream_Value;
}
});
getdatatoplace.print();
This works very well, and I get the following result:
//at the end of the first batch duration/cycle
getdatatoplace[]={json1}
//at the end of the second batch duration/cycle
getdatatoplace[]={json1,json2}
...
However, if I do multiple prints of getdatatoplace, let's say 3:
getdatatoplace.print();
getdatatoplace.print();
getdatatoplace.print();
then I get this result:
//at the end of the first print
getdatatoplace[]={json1}
//at the end of the second print
getdatatoplace[]={json1,json1}
//at the end of the third print
getdatatoplace[]={json1,json1,json1}
//Context ends with getdatatoplace.size()=3
//New cycle begins, and I get a new value json2
//at the end of the first print
getdatatoplace[]={json1,json1,json1,json2}
...
So what happens is that, for each print that I do, even though I do stream_Value.push before, and the commands I gave in my batch duration haven't ended yet, stream_Value pushes values to my List for every print that I do.
My question is, why does this happen, and how do I make it so that, independentely of the number of print() methods I use, I get just one JSON line stored in my list per Batch Duration/per execution.
I hope I was not confusing, as I am new to Spark and may have confused some of the vocabulary. Thank you so much.
PS: Even if I print another DStream, the same thing happens. Say I do this, each with same 'architecture' of the Stream above:
JavaDStream1.print();
JavaDStream2.print();
At the end of JavaDStream2.print(), the list within JavaDstream1 has one extra value.
Spark Streaming uses the same computation model as Spark. The operations we declare on the data form a Direct Acyclic Graph (DAG) that's evaluated when actions are used to materialize such computations on the data.
In Spark Streaming, output operations, such as print() will schedule the materialization of these operations at every batch interval.
The DAG for this Stream would look something like this:
[TextStream]->[map]->[print]
print will schedule the map operation on the data received by the socketTextStream. When we add more print actions, our DAG looks like:
/->[map]->[print]
[TextStream] ->[map]->[print]
\->[map]->[print]
And here the issue should become visible. The map operation is executed three times. That's expected behavior and normally not an issue, because map is supposed to be a stateless transformation.
The root cause of the problem here is that map contains a mutation operation, as it adds elements to a global collection stream_Value defined outside the scope of the function passed to map.
This not only causes the duplication issues, but will not work in general, when Spark Streaming runs in its usual cluster mode.

Compare each value of one RDD to each key/value pair of another RDD

This has been bothering me for a while and I am sure I am being very brainless.
I have two RDDs of key/value pairs, corresponding to a name and associated sparse vector:
RDDA = [ (nameA1, sparsevectorA1), (nameA2, sparsevectorA2), (nameA3, sparsevectorA3) ]
RDDB = [ (nameB1, sparsevectorB1), (nameB2, sparsevectorB2) ]
I want the end result to compare each element of the first RDD against each element in the second, producing an RDD of 3 * 2 = 6 elements. In particular, I want the name of the element in the second RDD and the dot product of the two sparsevectors:
RDDC = [ (nameB1, sparsevectorA1.dot(sparsevectorB1)), (nameB2, sparsevectorA1.dot(sparsevectorB2)),
(nameB1, sparsevectorA2.dot(sparsevectorB1)), (nameB2, sparsevectorA2.dot(sparsevectorB2)),
(nameB1, sparsevectorA3.dot(sparsevectorB1)), (nameB2, sparsevectorA3.dot(sparsevectorB2)) ]
Is there an appropriate map or inbuilt function to do this?
I assume such an operation must exist hence my feeling of brainlessness. I can easily and inelegantly do this if I collect the two RDDs and then implement a for loop, but of course that is not satisfactory as I want to keep them in RDD form.
Thanks for your help!
Is there an appropriate map or inbuilt function to do this?
Yes, there is and it is called cartesian.
def transform(ab):
(_, vec_a), (name_b, vec_b) = ab
return name_b, vec_a.dot(vec_b)
rddA.cartesian(rddB).map(transform)
Problem is that Cartesian product on the large dataset is usually a really bad idea and there is usually much better approach out there.

Buiding a stack in a map

I have a string that looks like this:
"7-6-4-1"
or
"7"
or
""
That is, a set of numbers separated by -. There may be zero or more numbers.
I want to return a stack with the numbers pushed on in that order (i.e. push 7 first and 1 ast, for the first example)
If I just wanted to return a list I could just go str.split("-").map{_.toInt} (although this doesn't work on the empty string)/
There's no toStack to convert to a Stack though. So currently, I have
{
val s = new Stack[Int];
if (x.nonEmpty)
x.split('-').foreach {
y => s.push(y.toInt)
}
s
}
Which works, but is pretty ugly. What am I missing?
EDIT: Thanks to all the responders, I learnt quite a bit from this discussion
Stack(x.split("-").map(_.toInt).reverse: _*)
The trick here is to pass the array you get from split into the Stack companion object builder. By default the items go in in the same order as the array, so you have to reverse the array first.
Note the "treat this is a list, not as a single item" annotation, : _*.
Edit: if you don't want to catch the empty string case separately, like so (use the bottom one for mutable stacks, the top for immutable):
if (x.isEmpty) Stack() else Stack(x.split("-").map(_.toInt).reverse: _*)
if (x.isEmpty) Stack[Int]() else Stack(x.split("-").map(_.toInt).reverse: _*)
then you can filter out empty strings:
Stack(x.split("-").filterNot(_.isEmpty).map(_.toInt).reverse: _*)
which will also "helpfully" handle things like 7-9----2-2-4 for you (it will give Stack(4,2,2,9,7)).
If you want to handle even more dastardly formatting errors, you can
val guard = scala.util.control.Exception.catching[Int](classOf[NumberFormatException])
Stack(x.split("-").flatMap(x => guard.opt(x.toInt)).reverse: _*)
to return only those items that actually can be parsed.
(Stack[Int]() /: (if(x.isEmpty) Array.empty else x.split("-")))(
(stack, value) =>
stack.push(value toInt))
Dont forget the ever handy breakOut which affords slightly better performance than col: _* (see Daniel's excellent explanation)
Used here with Rex Kerr's .filterNot(_.isEmpty) solution:
import scala.collection.immutable.Stack
import scala.collection.breakOut
object StackFromString {
def stackFromString(str: String): Stack[Int] =
str.split("-").filterNot(_.isEmpty)
.reverse.map(_.toInt)(breakOut)
def main(args: Array[String]): Unit = {
println(stackFromString("7-6-4-1"))
println(stackFromString("7"))
println(stackFromString(""))
}
}
Will output:
Stack(1, 4, 6, 7)
Stack(7)
Stack()

Resources