Why reduceByKey transformation is shown first in the DAG? - apache-spark

I have executed below wordcount program in pyspark shell
got_rdd = sc.textFile('C:\\Spark\\pyspark\\book1.txt')
got_rdd = got_rdd.flatMap(lambda x : x.split(" "))
got_rdd = got_rdd.filter(lambda x : x!="")
got_rdd = got_rdd.map(lambda x : (x,1))
got_rdd = got_rdd.reduceByKey(lambda a,b : a + b)
got_rdd.saveAsTextFile('C:\\Spark\\pyspark\\book1_op')
This is the DAG shown for me in spark UI
Why reduceByKey transformation is shown first in the DAG followed by map?
Because in the code I have applied map first followed by reduceByKey.

Related

how to use mix two RDD with python

I have the following two RDDs,the first one is:
training2 = training.map(lambda x:(x[0],(x[1],x[2])))
training2.collect()
#[(u'1', (u'4298118681424644510', u'7686695')),
# (u'1', (u'4860571499428580850', u'21560664')),
# (u'1', (u'9704320783495875564', u'21748480')),
# (u'1', (u'13677630321509009335', u'3517124')),
and the second one is:
user_id2 = user_id.map(lambda x:(x[0],(x[1],x[2])))
user_id2.collect()
#[(u'1', (u'1', u'5')),
# (u'2', (u'2', u'3')),
# (u'3', (u'1', u'5')),
# (u'4', (u'1', u'3')),
# (u'5', (u'2', u'1')),
In both RDD the parameter u'1',u'2'... indicates de user id, so I need to mix both RDD by key, the output must combinate for every key be something like this:
u'1', (u'1', u'5', u'4298118681424644510', u'7686695')
How'about add two rdd and use aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None)
you can also use reduceByKey or groupByKey
for example
zero_value=set()
def seq_op(x, y) :
x.add(y)
return x
def comb_op(x, y) :
return x.union(y)
numbers = sc.parallelize([0,0,1,2,5,4,5,5,5]).map(lambda x : ["Even" if (x % 2 == 0) else "Odd", x])
numbers.collect()
numbers.aggregateByKey(zero_value, seq_op, comb_op).collect()
# results looks like [("Even", {0, 2, 4,}), ....]

Spark - Sort DStream by Key and limit to 5 values

I've started to learn spark and I wrote a pyspark streaming program to read stock data (symbol, volume) from port 3333.
Sample data streamed at 3333
"AAC",111113
"ABT",7451020
"ABBV",7325429
"ADPT",318617
"AET",1839122
"ALR",372777
"AGN",4170581
"ABC",3001798
"ANTM",1968246
I want to display the top 5 symbols based on volume. So I used a mapper to read each line, then split it by comma and reversed.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("localhost", 3333)
stocks = lines.map(lambda line: sorted(line.split(','), reverse=True))
stocks.pprint()
Following is the output of stocks.pprint()
[u'111113', u'"AAC"']
[u'7451020', u'"ABT"']
[u'7325429', u'"ABBV"']
[u'318617', u'"ADPT"']
[u'1839122', u'"AET"']
[u'372777', u'"ALR"']
[u'4170581', u'"AGN"']
[u'3001798', u'"ABC"']
[u'1968246', u'"ANTM"']
I've got the following function in mind to display the stock symbols but not sure how to sort the stocks by key(volume) and then limit the function to display only first 5 values.
stocks.foreachRDD(processStocks)
def processStocks(stock):
for st in stock.collect():
print st[1]
Since stream represents an infinite sequence all you can do is sort each batch. First, you'll have to correctly parse the data:
lines = ssc.queueStream([sc.parallelize([
"AAC,111113", "ABT,7451020", "ABBV,7325429","ADPT,318617",
"AET,1839122", "ALR,372777", "AGN,4170581", "ABC,3001798",
"ANTM,1968246"
])])
def parse(line):
try:
k, v = line.split(",")
yield (k, int(v))
except ValueError:
pass
parsed = lines.flatMap(parse)
Next, sort each batch:
sorted_ = parsed.transform(
lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False))
Finally, you can pprint top elements:
sorted_.pprint(5)
If all went well you should get output like below:
-------------------------------------------
Time: 2016-10-02 14:52:30
-------------------------------------------
('ABT', 7451020)
('ABBV', 7325429)
('AGN', 4170581)
('ABC', 3001798)
('ANTM', 1968246)
...
Depending on the size of a batch full sort can be prohibitively expensive. In that case you can take top and parallelize:
sorted_ = parsed.transform(lambda rdd: rdd.ctx.parallelize(rdd.top(5)))
or even reduceByKey:
from operator import itemgetter
import heapq
key = itemgetter(1)
def create_combiner(key=lambda x: x):
def _(x):
return [(key(x), x)]
return _
def merge_value(n=5, key=lambda x: x):
def _(acc, x):
heapq.heappush(acc, (key(x), x))
return heapq.nlargest(n, acc) if len(acc) > n else acc
return _
def merge_combiners(n=5):
def _(acc1, acc2):
merged = list(heapq.merge(acc1, acc2))
return heapq.nlargest(n, merged) if len(merged) > n else merged
return _
(parsed
.map(lambda x: (None, x))
.combineByKey(
create_combiner(key=key), merge_value(key=key), merge_combiners())
.flatMap(lambda x: x[1]))

PySpark map not working

I am new to Apache Spark and a simple map function implemented as
from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')
f = open("Tweets_tokenised.txt")
tokenised_tweets = f.readlines()
f = open("positive.txt")
pos_words=f.readlines()
f = open("negative.txt")
neg_words=f.readlines()
def sentiment(line):
global pos_words
global neg_words
pos = 0
neg = 0
for word in line.split():
if word in pos_words:
pos=pos+1
if word in neg_words:
neg=neg+1
if(pos > neg):
return 1
else:
return 0
dist_tweets=sc.textFile("Tweets_tokenised.txt").map(sentiment)
#(lambda line: sentiment(line))
dist_tweets.saveAsTextFile("RDD.txt")
Basically I am reading a file(containing tokenised and stemmed tweets) and then doing a simple positive-negative word count on it within the map function.(3rd line from the end)But RDD.txt has nothing in it.The function sentiment is not being called at all.
Can someone point out the error
You can't change the value of a global variable inside a map transformation in Apache Spark to achieve this you need an Accumulator, however even using using them I think that is not the correct approach.
In your case if your pos_words and neg_words are not so big, you could define them as Broadcast lists, and then count by sentiment.
Something like:
pos = sc.broadcast(["good", "gold", "silver"])
neg = sc.broadcast(["evil", "currency", "fiat"])
# I will suppose that every record is a different tweet and are stored in tuples.
tweets = sc.parallelize([("banking", "is", "evil"), ("gold", "is", "good")])
(tweets
.flatMap(lambda x: x)
.map(lambda x: (1 if x in pos.value else -1 if x in neg.value else 0, 1))
.reduceByKey(lambda a, b: a + b).take(3))
# notice that I count neutral words.
# output -> [(0, 3), (1, 2), (-1, 1)]
Note, you can check the example right here.
PD: If your idea was to count the positive and negative words per message, the approach vary very slightly.

How can I select balanced sampling for binary classification?

There is my code, load data from hive, and do sample balance:
// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))
// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)
// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)
val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()
val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()
// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())
// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0
and i calculate the trainDataSize and testDataSize for evaluate the model confidence
Ok I haven't tested this code, but it should go like this :
val data: RDD[LabeledPoint] = ???
val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
.keyBy(_.label)
.sampleByKeyExact(false, fractions) // Optionally with seed
.values
You can convert your LabeledPoint into PairRDDs than apply a sampleByKeyExact using the fractions you wish to use.

Printing ClusterID and its elements using Spark KMeans algo.

I have this program which prints the MSSE of Kmeans algorithm on apache-spark. There are 20 clusters generated. I am trying to print the clusterID and the elements that got assigned to respective clusterID. How do i loop over the clusterID to print the elements.
Thank you guys!!
val sc = new SparkContext("local", "KMeansExample","/usr/local/spark/", List("target/scala-2.10/kmeans_2.10-1.0.jar"))
// Load and parse the data
val data = sc.textFile("kmeans.csv")
val parsedData = data.map( s => Vectors.dense(s.split(',').map(_.toDouble)))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
val clusterCenters = clusters.clusterCenters map (_.toArray)
println("The Cluster Centers are = " + clusterCenters)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
as I know you should run predict for each elements.
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
List<Vector> vectors = parsedData.collect();
for(Vector vector: vectors){
System.out.println("cluster "+clusters.predict(vector) +" "+vector.toString());
}

Resources