I am new to spark programming and i got stuck while using map.
My data Rdd contains.
Array[(String, Int)] = Array((steve,5), (bill,4), (" amzon",6), (flikapr,7))
and while using again map i am getting below mentioned error .
data.map((k,v) => (k,v+1))
<console>:32: error: wrong number of parameters; expected = 1
data.map((k,v) => (k,v+1))
I am trying to pass a tuple with key value and wants to get back a tuple with 1 + to value.
Please help , why i am getting error.
Thanks
You almost got it. rdd.map() operates on each record of the RDD and in your case, that record is a tuple. You can simply access the tuple members using Scala's underscore accessors like this:
val data = sc.parallelize(Array(("steve",5), ("bill",4), ("amzon",6), ("flikapr",7)))
data.map(t => (t._1, t._2 + 1))
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Or better yet, use Scala's powerful pattern matching like this:
data.map({ case (k, v) => (k, v+1) }).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Here's the best so far -- key-value tuples are so common in Spark that we usually refer to them as PairRDDs and they come with plenty of convenience functions. For your use case, you only need to operate on the value without changing the key. You can simply use mapValues():
data.mapValues(_ + 1).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Related
Is there an easy way to use glom to get an unknown key from a dictionary?
the ??? represents random data that I am trying to capture from an API
book = {"???":[{ "globalIdenity":208565940},{"globalIdenity":228049454}]}
spec =
output_data = glom(book, spec)
print(output_data)
Use an iterator for values then use the next call to get the value
book = {"???":[{ "globalIdenity":208565940},{"globalIdenity":228049454}]}
result = next(iter(book.values()))
print(result)
#output
[{'globalIdenity': 208565940}, {'globalIdenity': 228049454}]
I am working on a problem where I have to convert around 7 million list-value pairs to key-value pairs by using map() function in PySpark where the length of the list used in given list-value pair can be at most 20.
For example:
listVal= [(["ank","nki","kit"],21),(["arp","rpi","pit"],22)]
Now, I want key-value pairs as
keyval= [("ank",21),("nki",21),("kit",21),("arp",22),("rpi",22),("pit",22)]
When I write
keyval= listval.map(lambda x: some_function(x))
where some_function() is defined as:
def some_function(x):
shingles=[]
for i in range(len(x[0])):
temp=[]
temp.append(x[0][i])
temp.append(x[1])
shingles.append(tuple(temp))
return shingles
I don't get the desired output because I think map() returns one key-value pair for an item of the list, not multiple key-value pairs. I have tried other things also and searched on web but did not find anything related to it.
Any help would be appreciated.
so using your limitations this can be done with pyspark's .flatmap()
def conversion(n):
return [(x, n[1]) for x in n[0]]
listVal.flatMap(conversion)
or in one line
listVal.flatMap(lambda n: [(x, n[1]) for x in n[0]])
I am new to Spark and reasonably new to Clojure (although I really like what Clojure can do so far). I am currently trying to parse JSON in Clojure using sparkling, and I am having trouble with the basics of transforming data and getting it back in a form I can understand and debug. I use dummy data in the example below, but my actual data is over 400GB.
As an example, I first tried splitting each line of my JSON input (each line is a full record) by commas so that I would have a list of keys and values (for eventual conversion to keyword and value maps). In Scala (for which it is easier to find Spark examples) with dummy data, this works fine:
val data = sc.parallelize(Array ("a:1,b:2","a:3,b:4"))
val keyVals = data.map(line => line.split(","))
keyVals.collect()
This returns Array[Array[String]] = Array(Array(a:1, b:2), Array(a:3, b:4)), which is at least a reasonable starting point for key-value mapping.
However, when I run the following in clojure with sparkling:
(def jsony-strings (spark/parallelize sc ["a:1,b:2","a:3,b:4"]))
(def jsony-map (->> jsony-strings
(spark/map (fn [l] (string/split l #",")))
))
(spark/collect jsony-map)
I get the usual concurrency spaghetti from the JVM, the crux of which seems to be:
2018-08-24 18:49:55,796 WARN serialization.Utils:55 - Error deserializing object (clazz: gdelt.core$fn__7437, namespace: gdelt.core)
java.lang.ClassNotFoundException: gdelt.core$fn__7437
Which is an error I seem to get pretty much anything I try to do something more complex than counts.
Can someone please point me in the right direction?
I guess I should note that my Big Problem is processing lots and lots of lines of JSON in a bigger-than-memory (400G) dataset. I will be using the JSON keys to filter, sort, calculate, etc., and the Spark pipelines looked good for both rapid parallel processing and convenient for these functions. But I am certainly open to considering other alternatives for processing this dataset.
You should use Cheshire for this:
;; parse some json
(parse-string "{\"foo\":\"bar\"}")
;; => {"foo" "bar"}
;; parse some json and get keywords back
(parse-string "{\"foo\":\"bar\"}" true)
;; => {:foo "bar"}
I like to use a shortcut for the 2nd case, since I always want to convert string keys into clojure keywords:
(is= {:a 1 :b 2} (json->edn "{\"a\":1, \"b\":2}"))
It is just a simple wrapper with (I think) an easier-to-remember name:
(defn json->edn [arg]
"Shortcut to cheshire.core/parse-string"
(cc/parse-string arg true)) ; true => keywordize-keys
(defn edn->json [arg]
"Shortcut to cheshire.core/generate-string"
(cc/generate-string arg))
Update: Note that Cheshire can work with lazy streams:
;; parse a stream lazily (keywords option also supported)
(parsed-seq (clojure.java.io/reader "/tmp/foo"))
Here is a request in the data process, AlertRDD is used to produce unique alerts count and not unique alerts count for one day.
So I write below method include both two of them,
implicit class RDDOps(val rdd: RDD[(String, String)]) {
def addDistinctRDD() = {
rdd.distinct().map{
case (elem1, elem2) =>
(elem1, elem2, "AlertUniqueUsers")
} ++ rdd.map{
case (elem1, elem2) =>
(elem1, elem2, "Alerts")
}
}
}
When I run AlertRDD.addDistinctRDD.save(//path), the program will never stop, but if run it as as AlertRDD.cache.addDistinctRDD.save(//path), it works well.
Could anybody know the reason?
Update
It looks like nobody can answer my question, probably my question is not well understood.
Why I add unique AlertRDD and non-unique AlertRDD together?
Because the follow-up process of two RDDs is the same, Something like below excerpt code:
.map((_, 1))
.reduceByKey(_ + _)
.map()
....
.save($path)
If my solution is not the good one, can anyone suggest a better solution?
This has been bothering me for a while and I am sure I am being very brainless.
I have two RDDs of key/value pairs, corresponding to a name and associated sparse vector:
RDDA = [ (nameA1, sparsevectorA1), (nameA2, sparsevectorA2), (nameA3, sparsevectorA3) ]
RDDB = [ (nameB1, sparsevectorB1), (nameB2, sparsevectorB2) ]
I want the end result to compare each element of the first RDD against each element in the second, producing an RDD of 3 * 2 = 6 elements. In particular, I want the name of the element in the second RDD and the dot product of the two sparsevectors:
RDDC = [ (nameB1, sparsevectorA1.dot(sparsevectorB1)), (nameB2, sparsevectorA1.dot(sparsevectorB2)),
(nameB1, sparsevectorA2.dot(sparsevectorB1)), (nameB2, sparsevectorA2.dot(sparsevectorB2)),
(nameB1, sparsevectorA3.dot(sparsevectorB1)), (nameB2, sparsevectorA3.dot(sparsevectorB2)) ]
Is there an appropriate map or inbuilt function to do this?
I assume such an operation must exist hence my feeling of brainlessness. I can easily and inelegantly do this if I collect the two RDDs and then implement a for loop, but of course that is not satisfactory as I want to keep them in RDD form.
Thanks for your help!
Is there an appropriate map or inbuilt function to do this?
Yes, there is and it is called cartesian.
def transform(ab):
(_, vec_a), (name_b, vec_b) = ab
return name_b, vec_a.dot(vec_b)
rddA.cartesian(rddB).map(transform)
Problem is that Cartesian product on the large dataset is usually a really bad idea and there is usually much better approach out there.