Spark RDD Sampling, faster with or without replacement? - apache-spark

In general, all other things being equal, which would be expected to run more quickly?
val a = myRDD.sample(true, 0.01)
val b = myRDD.sample(false, 0.01)

Related

Element-wise addition of RDDs in PySpark

Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.
Hint: You may use zip() in Spark, not the zip() in Python.
I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.
spark = SparkSession(sc)
numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))
def ele_wise_add(rdd1, rdd2):
rdd3 = rdd1.zip(rdd2).map(lambda x,y: x + y)
return rdd3
rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())
rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.
rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]
expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]
rdd1.zip(rdd2) would create a single tuple for each pair, so when writing lambda function, you only have x and not y. So you'd want to sum(x) or x[0] + x[1], not x + y.
rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))
rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]

How to understand this piece of code in Spark

I need help understanding this piece of code. I know the output is 10. However, I would like to know why. I am very new to Spark and I need to learn it for an academic exam. So I would like to know how it got the output.
data_reduce = sc.parallelize([1.0, 2, .5, .1, 5, .2], 1)
data_reduce.reduce(lambda x, y: x / y)
in first line of your code we are crearting a dataframe.
data_reduce = sc.parallelize([1.0, 2, .5, .1, 5, .2], 1) # 1 partition
in above piece of code
SC : sc is the spark context variable we are using here. As you are executing the spark shell so spark shell autmatically provides you the sc variable. but in case of other non spark shell applications you will have to create another sc variable.
sc is like entry point of you program. SparkContext is created you can use it to create RDDs, accumulators and broadcast variables, access Spark services and run jobs
parallelize : There are multiple ways to create rdd in spark. Example loading a file, loading data from table similarly using parallelize functions you can create dataframe by passing collections like arrays and list see the example below
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
data_reduce : This is your RDD onceOnce created, the distributed dataset (data_reduce) can be operated on in parallel.
second line of code
data_reduce.reduce(lambda x, y: x / y)
Here we are calling reduce function in your RDD. In your example here we are doing cumulative sum of all the elements in your RDD. I hope you are aware of partitions concepts in RDD. Then we know our data is distributed across different nodes in form of partitions in you case
[1.0, 2, .5, .1, 5, .2]
lets say if it is distributed in two partitions
so it will be like
partition 1 : [1.0, 2, .5]
partition 2 : [.1, 5, .2]
Now here reduce function will be called on each partitions
Here reduce method accepts a function (accum, n) => (accum + n). This function initialize accumulator(accum) variable with default integer value 0, divides up an element every when reduce method is called and returns final value when all elements of RDD X are processed. It returns the final value rather than another RDD.
okay so lets understand how reduce is working here
step 1 : [1.0, 2, .5, .1, 5, .2].reduce(lambda x,y : x/y )
here x = 1.0 , y=2 thus x/y = 0.5
step 2: now 0.5 will be stored in x and y will be new element from
rdd
so x= 0.5 and y = 0.5 thus x/y = 1
step 3 : Similarly now x = 1 and y = 0.1 so x/y = 10
step 4 : x=10,y=5 so x/y = 2
step 5 : x=2, y=0.2 so x/y = 10
So 10 is your final answer i hope i clears you now :)
You can read more detailed info about reduce function from here

Apache Spark in Yarn not using all cores [duplicate]

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread.
The documentation says they have implemented distributed version of SGD.
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")
I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

How to get the probabilities of classes in Spark Naive Bayes classifier?

I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code:
val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true)
val piVector = new DenseVector(model.pi)
//val prob = thetaMatrix.multiply(test.features)
val x = test.map {p =>
val prob = thetaMatrix.multiply(p.features)
BLAS.axpy(1.0, piVector, prob)
prob
}
Does this work properly? The line BLAS.axpy(1.0, piVector, prob) keeps giving me an error that the value 'axpy' is not found.
In a recent pull-request this was added to the Spark trunk and will be released in Spark 1.5 (closing SPARK-4362). you can therefore call
def predictProbabilities(testData: RDD[Vector]): RDD[Vector]
or
def predictProbabilities(testData: Vector): Vector

Mllib missing values handling

I'm using the corr from mllib with basic interface
like
val a:RDD[Double] = sc.makeRDD(Seq(1., 1., 0.))
val b:RDD[Double] = sc.makeRDD(Seq(1., -1., 0.))
val r = Statistics.corr(a, b)
println(r)
Is there a possibility to have casewise or pairwise removal of NAN and Infinity values?
By Default Mllib provides NAN as a result of corr in case of infinity or NAN values.
To my knowledge, there is no built-in function and you need to filter those values out by your own. One approach is to use java.Double (http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html) functionality:
import java.lang.Double.isNaN
import java.lang.Double.isInfinite
val filtered1 = data1.filter((!isNaN(_))&&(!isInfinite(_)))
val filtered2 = data2.filter((!isNaN(_))&&(!isInfinite(_)))
val r = Statistics.corr(filtered1, filtered2)
println(r)

Resources