How to know how long will take to execute pysaprk code - apache-spark

Is there a way to know how much time a code will take to finish? or an approximation
I am thinking something like when you are coping a file in windows, it says how much time is left, or for example when you download something, it tells you approximately how much time it will take
Is there a way to do this for a spark code? from something very simple like queries, to more complex code
Thanks

Spark themselves have considered implementing this but decided against it due to uncertainties in predicting the completion time of stragglers. See the discussion in this spark issue https://issues.apache.org/jira/browse/SPARK-5216
So you will not get that information from spark. Instead you must implement your own estimation model.

Related

Multiproccessing Pipe() Wrapper Broken: Something is Hanging. V5

Enhanced Python Multiprocessing Data Pipeline Wrapper
This is the goal...
Objective
This is a piece of a big project I'm working on. This is an important part that will massively simplify report transmission in my program. The program tests a function against millions of inputs and uses multiprocessing to speed thing up. Source code on Pastebin.
Goals and Benefit
Put simply, multiprocessing.Pipe() is inadequate. It should be able to handle massive strings and switch process execution between a sender and receiver. I wrote this to implement:
Automatic error handling
Transmission error categorization
Data transmission chunking and reassembly
Unlimited data transmission size
Process synchronization
Simple abstraction to enhance usability
Former Problem
It has a weird bug I can't find. Days and plenty of documentation later, it's not fixed. I've left in a good many debug lines. Try entering "hi": you don't see "Receiver.Test: Output: hi" but should. Try a second time, it just hangs: Sample output.
Fixed by a dear friend.
Tests
The GPE works. Both of these first two tests work. For test 1, this source code outputs these results correctly and consistently. For test 2, this source code outputs something like these results correctly. For test 3, this source code outputs something like these results correctly.
Plea!
It's time to ask for help. It is part of a larger project. To be fair, there are a good many lines of code. This should be part of the multiprocessing module. I'm humbled. Can someone tell me what up? PLEASE? ANYONE??
Nobody answered...
Problem
In receive_oscillate, yield from looks like it never clears.
Notes
Also, deeply nested functions and one liner compound conditionals are not idiosyncratic to Python. Furthermore, breaking up your deeply nested functions and adding automated unit tests will help reduce bugs and facilitate maintenance.

Best way to search through a very big dataset?

I have text files that contain about 12gbs worth of tweets and need to search through this dataset off of keywords. What is the best way to go about doing this?
Familiar with Java, Python, R. I don't think my computer can handle the files if, for example, I do some sort of script that goes through each text file in python
"Oh, Python, or any other language, can most-certainly do it." Might take a few seconds, but the job will get done. I suggest that the best approach to your problem is: "straight ahead." Write scripts that process the files one line at a time.
Although "12 gigabytes" sounds enormous to us, to any modern-day machine it's really not that big at all.
Build hashes (associative arrays) in memory as needed. Generally avoid database-operations (other than "SQLite" database files, maybe ...), but, if you happen to find yourself needing "indexed file storage," SQLite is a terrific tool.
. . . with one very-important caveat: "when using SQLite, use transactions, even when reading." By default, SQLite will physically-commit every write and physically-verify every read, unless you are in a transaction. Then, and only then, it will "lazy read/write," as you might have expected it to do all the time. (And then, "that sucker's f-a-s-t...!")
If you want to be exact, then you need to see at every file once, so if your computer can't take that load, then say goodbye to exactness.
Another approach, would be to use approximation algorithms which are faster than the exact ones, but come in the expense of loosing accuracy.
That should get you started and I will stop my answer here, since the topic is just too broad to continue with from here.

Matlab parallel programming

First of all, sorry for the general title and, probably, for the general question.
I'm facing a dilemma, I always worked in c++ and right now I'm trying to do something very similar to one of my previous projects, which is to parallelize a single-target object tracker written in matlab in order to assign to each concurrent thread an object and then gather the results at each frame. In c++, I used boost thread API to do so, and with good results. Is it possibile in matlab? Reading around I'm finding it rather unclear, I'm reading a lot about the parfor loop but that's pretty much it? Can I impose synchronization barriers similar to boost::barrier in order to stop each thread after each frame and wait for other before going to the next frame?
Basically, I wish to initialize some common data structures and then launch a few parallel instances of the tracker, which shares some data and take different objects to track as input. Any suggestion will be greatly appreciated!
parfor is only one piece of functionality provided by Parallel Computing Toolbox. It's the simplest, and most people find it the most immediately useful, which is probably why most of the resources your research has found discuss only that.
parfor gives you a way to very simply parallelize "embarassingly parallel" tasks, in other words tasks that are independent and do not require any communication between them (such as, for example, parameter sweeps or Monte Carlo analyses).
It sounds like that's not what you need. From your question, I'm not entirely sure exactly what you do need; but since you mention synchronization, barriers, and waiting for one task to finish before another moves forward, I would suggest you take a look at features of Parallel Computing Toolbox such as labSend, labReceive, labBarrier, and spmd, that allow you to implement a more message-passing style of parallelization. There is plenty more functionality in the toolbox than just parfor.
Also - don't be afraid to ask MathWorks for advice on this, there are several (free) recorded webinars and tutorials on this sort of parallelization that they can point you towards.
Hope that helps!

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

Server side language for cpu/memory intensive process

Whats a good server side language for doing some pretty cpu and memory intensive things that plays well with php and mysql. Currently, I have a php script which runs some calculations based on a large subset of a fairly large database and than updates that database based on those calculations (1.5millions rows). The current implementation is very slow, taking 1-2 hours depending on other activities on the server. I was hoping to improve this and was wondering what peoples opinions are on a good language for this type of task?
Where is the bottleneck? Run some real profiling, and see what exactly is causing the problem. Is it the DB I/O? Is it the cpu? Is the algorithm inefficient? Are you calling slow library methods in a tight inner loop? Could precalculation be used.
You're pretty much asking what vehicle you need to get from point A to point B, and you've offered a truck, car, bicycle, airplane, jet, and helicopter. The answer won't make sense without more context.
The language isn't the issue, your issue is probably where you are doing these calculations. Sounds like you may be better off writing this in SQL, if possible. Is it? What are you doing?
I suspect your bottleneck is not the computation. It definitely takes several hours to just update a few million records.
If that's the case, you can write a customized function in c/c++ for MySQL and execute the function in stored procedure.
We do this in our database to re-encrypt some sensitive fields during key-rotation. It shrunk key-rotation time from days to hours. However, it's a pain to maintain your own copy of MySQL. We have been looking for alternatives but nothing is close to the performance of this approach.

Resources