In tutorial 'Text classification from scratch',
# Let's make a text-only dataset (no labels):
text_ds = x, y: x)
How to understand this map function here?
In this case, the map function is helping in doing asynchronous processing. The example in the tutorial mentioned is using text-only data. It is discarding the labels. This is done using lambda function x, y --> x. This transformation is applied to each sample of data on the CPU of host machine while your GPU is processing the previous sample of data. This asynchronous processing is being done by map function. Since the GPU doesn't have to wait for next batch of data, you get full utilization.
I have my model written and trained in Keras. I'm trying to use it for inference in production. I receive SQS "task" messages containing a tuple of (path_in, path_out).
I can obviously use:
batch_messages = []
while True:
while len(batch_messages) < BATCH_SIZE:
msg = sqs.read_messsage()
assert len(batch_messages) == BATCH_SIZE
batch = np.array([read_image(msg.path_in) for msg in batch_messages])
output_batch = model.predict(batch)
for i in range(BATCH_SIZE):
write_output(output_batch[i], path=batch_messages[i].path_out)
batch_messages = []
The problem with that is that the code wastes most of the time reading from SQS, reading the image from disk and writing it back at the end. This means the GPU is idle during all this time.
I'm aware of Keras' Sequence, but not sure if it is intended for that case as well, and for inference (and not training)
I would suggest you to use Tensorflow Serving solution as it implements a server side batching strategy which optimizes the inference speed and GOU utilization. Also if you'd like to speed up your pipeline, you should convert the model into a TensorRT model which optimizes the models operations to a specific GPU (and it does a lot more).
When setting up a data input pipeline to Tensorflow (web cam images), a large amount of time is spent loading the data from the system RAM to the GPU memory.
I am trying to feed a constant stream of images (1024x1024) through my object detection network. I'm currently using a V100 on AWS to perform inference.
The first attempt was with a simple feed dict operation.
# Get layers
img_input_tensor = sess.graph.get_tensor_by_name('import/input_image:0')
img_anchors_input_tensor = sess.graph.get_tensor_by_name('import/input_anchors:0')
img_meta_input_tensor = sess.graph.get_tensor_by_name('import/input_image_meta:0')
detections_input_tensor = sess.graph.get_tensor_by_name('import/output_detections:0')
detections =,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
This produced inference times around 0.06 ms per image.
However, after reading the Tensorflow manual I noticed that the API was recommended for loading data for inference.
# setup data input
data =, img_meta_input_tensor, img_anchors_input_tensor, detections_input_tensor))
iterator = data.make_initializable_iterator() # create the iterator
next_batch = iterator.get_next()
# load data,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
# inference
detections =[next_batch])[0][3]
This sped up inference time to 0.01ms, put the time taken to load the data took 0.1 ms. This Iterator methods is much longer than the 'slower' feed_dict method significantly. Is there something I can do to speed up the loading process?
Here is a great guide on data pipeline optimization. I personally find the .prefetch method the easiest way to boost your input pipeline. However, the article provides much more advanced techniques.
However, if your input data is not in tfrecords, but you feed it by yourself, you have to implement the described techniques (buffering, interleaved operations) somehow by yourself.
What is the purpose of the following PyTorch function (doc):
torch.addmm(beta=1, mat, alpha=1, mat1, mat2, out=None)
More specifically, is there any reason to prefer this function instead of just using
beta * mat + alpha * (mat1 # mat2)
The addmm function is an optimized version of the equation beta*mat + alpha*(mat1 # mat2). I ran some tests and timed their execution.
If beta=1, alpha=1, then the execution of both the statements (addmm and manual) is approximately the same (addmm is just a little faster), regardless of the matrices size.
If beta and alpha are not 1, then addmm is two times faster than the manual execution for smaller matrices (with total elements in order of 105). But, if matrices are large (in order of 106), the speedup seems negligible (39ms v/s 41ms)
I am taking this course.
It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?
I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time
rdd1=rdd.reduce(lambda x,y: x+y)
update 1--------------------------------------------
will below steps give faster answer as compared to reduce function?
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)
Update 2-----------------------------------
Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.
import datetime
distData = sc.parallelize(data,4)
a=distData.reduce(lambda x,y:x+y)
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
b=distData.aggregate(0, seqOp, combOp)
reduce behavior differs a little bit between native (Scala) and guest languages (Python) but simplifying things a little:
each partition is processed sequentially element by element
multiple partitions can be processed at the same time either by a single worker (multiple executor threads) or different workers
partial results are fetched to the driver where the final reduction is applied (this is a part which has different implementation in PySpark and Scala)
Since it looks like you're using Python lets take a look at the code:
reduce creates a simple wrapper for a user provided function:
def func(iterator):
This is wrapper is used to mapPartitions:
vals = self.mapPartitions(func).collect()
It should be obvious this code is embarrassingly parallel and doesn't care how the results are utilized
Collected vals are reduced sequentially on the driver using standard Python reduce:
reduce(f, vals)
where f is a functions passed to RDD.reduce
In comparison Scala will merge partial results asynchronously as they come from the workers.
In case of treeReduce step 3. can performed in a distributed manner as well. See Understanding treeReduce() in Spark
To summarize reduce, excluding driver side processing, uses exactly the same mechanisms (mapPartitions) as the basic transformations like map or filter, and provide the same level of parallelism (once again excluding driver code). If you have a large number of partitions or f is expensive you can parallelism / distribute final merging using tree* family of methods.
I am quite new to Spark but already have programming experience in BSP model. In BSP model (e.g. Apache Hama), we have to handle all the communication and synchronization of nodes on our own. Which is good on one side because we have a finer control on what we want to achieve but on the other hand it adds more complexity.
Spark on the other hand, takes all the control and handles everything on its own (which is great) but I don't understand how it works internally especially in cases where we have alot of data and message passing between nodes. Let me put an example
zb = sc.broadcast(z)
r_i = => Math.pow(norm(x - zb.value), 2))
u_i = => ux._1 + ux._2 - zb.value)
x_i = f.prox( => {zb.value - ui}), rho)
x = x_i.reduce(_+_) / f.numSplits.toDouble
u = u_i.reduce(_+_) / f.numSplits.toDouble
z = g.prox(x+u, f.numSplits*rho)
r = Math.sqrt(r_i.reduce(_+_))
This is a method taken from here, which runs in a loop (let's say 200 times). x_i contains our data (let's say 100,000 entries).
In a BSP style program if we have to process this map operation, we will partition this data and distribute on multiple nodes. Each node will process sub part of data (map operation) and will return the result to master (after barrier synchronization). Since master node wants to process each individual result returned (centralized master- see figure below), we send the result of each entry to master (reduce operator in spark). So, (only) master receives 100,000 messages after each iterations. It processes this data and sends the new values to slaves again which again start processing for next iteration.
Now, since Spark takes control from user and does internally everything, I am unable to understand how Spark collects all the data after map operations (asynchronous message passing? i heard it has p2p message passing ? what about synchronization between map tasks? If it does synchronization, then is it right to say that Spark is actually a BSP model ?). Then in order to apply the reduce function, does it collects all the data on a central machine (If yes, does it receives 100,000 messages on a single machine?) or it reduces in a distributed fashion (If yes, then how can this be performed ?)
Following figure shows my reduce function on master. x_i^k-1 represents the i-th value calculated (in previous iteration) against x_i data entry of my input. x_i^k represents the value of x_i calculated in current iteration. Clearly, this equation, needs the results to be collected.
I actually want to compare both styles of distributed programming to understand when to use Spark and when to move to BSP. Further, I looked alot on the internet, all I find is how map/reduce works but nothing useful was available on actual communication/synchronization. Any helful material will be useful aswell.