How to understand this piece of code in Spark - apache-spark

I need help understanding this piece of code. I know the output is 10. However, I would like to know why. I am very new to Spark and I need to learn it for an academic exam. So I would like to know how it got the output.
data_reduce = sc.parallelize([1.0, 2, .5, .1, 5, .2], 1)
data_reduce.reduce(lambda x, y: x / y)

in first line of your code we are crearting a dataframe.
data_reduce = sc.parallelize([1.0, 2, .5, .1, 5, .2], 1) # 1 partition
in above piece of code
SC : sc is the spark context variable we are using here. As you are executing the spark shell so spark shell autmatically provides you the sc variable. but in case of other non spark shell applications you will have to create another sc variable.
sc is like entry point of you program. SparkContext is created you can use it to create RDDs, accumulators and broadcast variables, access Spark services and run jobs
parallelize : There are multiple ways to create rdd in spark. Example loading a file, loading data from table similarly using parallelize functions you can create dataframe by passing collections like arrays and list see the example below
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
data_reduce : This is your RDD onceOnce created, the distributed dataset (data_reduce) can be operated on in parallel.
second line of code
data_reduce.reduce(lambda x, y: x / y)
Here we are calling reduce function in your RDD. In your example here we are doing cumulative sum of all the elements in your RDD. I hope you are aware of partitions concepts in RDD. Then we know our data is distributed across different nodes in form of partitions in you case
[1.0, 2, .5, .1, 5, .2]
lets say if it is distributed in two partitions
so it will be like
partition 1 : [1.0, 2, .5]
partition 2 : [.1, 5, .2]
Now here reduce function will be called on each partitions
Here reduce method accepts a function (accum, n) => (accum + n). This function initialize accumulator(accum) variable with default integer value 0, divides up an element every when reduce method is called and returns final value when all elements of RDD X are processed. It returns the final value rather than another RDD.
okay so lets understand how reduce is working here
step 1 : [1.0, 2, .5, .1, 5, .2].reduce(lambda x,y : x/y )
here x = 1.0 , y=2 thus x/y = 0.5
step 2: now 0.5 will be stored in x and y will be new element from
so x= 0.5 and y = 0.5 thus x/y = 1
step 3 : Similarly now x = 1 and y = 0.1 so x/y = 10
step 4 : x=10,y=5 so x/y = 2
step 5 : x=2, y=0.2 so x/y = 10
So 10 is your final answer i hope i clears you now :)
You can read more detailed info about reduce function from here


Operate along a dimension without writing back data in dask array on Xarray

I have a dataset with 3 dimensions ('time', 'x' and 'y'). I want to apply this function foo along the time dimension:
def foo(arr):
lo, hi = np.percentile(arr, (1,99))
arr = np.clip(arr, lo, hi)
arr = (arr - lo) / (hi - lo)
return arr
Basically, I want to scale each "image" in the time dimension with a function like foo
In numpy, I could just do something like:
for i in range(data.shape[0]):
data[i] = foo(data[i])
but since the data is stored in dask arrays, I am unable to write back the modified data. I hit this error:
TypeError: this variable's data is stored in a dask array, which does not support item assignment. To assign to this variable, you must first load it into memory explicitly using the .load() method or accessing its .values attribute.
How would one go about doing this in xarray/dask?
There is no need to loop over the time dimension, you can do this in a vectorized way:
da = xr.tutorial.open_dataset(
"air_temperature", chunks={"lat": -1, "lon": -1, "time": 10}
def scale_image(da, quantiles):
quantiles = da.quantile(quantiles, dim=("lat", "lon"))
lower = quantiles.isel(quantile=0, drop=True)
upper = quantiles.isel(quantile=1, drop=True)
clipped = xr.apply_ufunc(np.clip, da, lower, upper, dask="allowed")
return (clipped - lower) / (upper - lower)
scaled = scale_image(da, quantiles=[0.01, 0.99])
Like this it is not necessary to load the whole array into memory.
I realized that you can use xarray's apply_ufunc directly with your foo function as well, if you provide the axis argument to np.percentile and take care of making the array shapes consistent.
It seems, that the dask-version of the percentile function is not implemented for multi-dimensional arrays, but you can use the parallelized option for apply_ufunc to make it work with the numpy function:
def foo(arr):
lo, hi = np.percentile(arr, (1, 99), axis=[1, 2])
arr = np.clip(arr, lo[:, None, None], hi[:, None, None])
return (arr - lo[:, None, None]) / (hi[:, None, None] - lo[:, None, None])
scaled2 = xr.apply_ufunc(foo, da, dask="parallelized")

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see, below). From there, you can create a mapping of the position to the average with map_from_arrays.
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 =, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 =
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
[Stage 0:> (0 + 4) / 4]
|group|positional_averages |
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |

Tensorflow map_fn Out of Memory Issues

I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array =,a_b_itertools_product),
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view

Spark RDD: lookup from other RDD

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
# Create an RDD with the counts of each
# possible itemset
counts = (
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
# apply your metric calculation independently on each partition
# collect results from the partitions into a single list of results
# reduce the list to combine the metrics calculated on each partition
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Spark Matrix multiplication with python

I am trying to do matrix multiplication using Apache Spark and Python.
Here is my data
from pyspark.mllib.linalg.distributed import RowMatrix
My RDD of vectors
rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])
My maxtrix
mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)
I would like to do something like this:
mat = mat1 * mat2
I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:
def matrix_multiply(df1, df2):
nb_row = df1.count()
for i in range(0, nb_row):
row_out = []
for r in range(0, len(row)):
r_value = 0
col =[list_col[r]]).collect()
col = [list(c)[0] for c in col]
for c in range(0, len(col)):
r_value += row[c] * col[c]
return mat
My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time?
If someone have another idea it will be helpful for me.
You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
