I have just written a nontrivial reduce function in which the rereduce execution path differs from the standard reduce.
It is my understanding that rereduce is only performed in certain situations; specifically, to merge reduce operations performed across multiple nodes or large ranges of keys.
As my local development environment runs on a single node with a relatively small dataset, how can I effectively test the behaviour of my reduce function in a rereduce scenario?
You could write a script to generate larger amounts of realistic dummy data. The only way I've been able to test my map-reduce's are with real or fake data, but lots of it.
It's a side benefit and perversely fun, but you also get a good idea of how long it will take for indexing, and view requests, and get a peek at how your app will do at scale. Load testing never hurts.
I don't test my scripts in couchdb. I use instead:
a JS-IDE (Webstorm)
behaviour driven testing (jasmine)
a full suite of JSON test documents
a self written runner script that mocks the call for map and reduce functions.
an external version control system to manage changes to the queries.
This way I can regression and unit test my map and reduce functions whilst at the same time support a level of complexity that can just not be supported in couchdb.
Related
I have a spark application that applied many transformations on many files
firstly I created one transformation (many data Frames that carry out those transformation ) an a single action (persistence the result, about 1M row), however, this version doesn't work it always throws CG, or heap Exceptions, therefore, I decompose it to intermediate actions, and I persist every intermediate result, At first I thought that having many read/write operations will have performance issue however it works, so my question is:
what is the best way to decompose spark transformation (I think that reading/writing operations are not optimal)?
IO is slower than simple computation, but extremely complex computation may be slower than IO. Cache is limited and need to be used to reduce compute time.
I would cache the extremely complex computation so that they won't be reevaluated multiple times. If the data is used more than twice then it breaks even the IO time.
If the computation is not that complex then you needn't cache and just recompute. But see how many times its being reused, if reuse is high then cache yields better performance.
There are various storage options (memory, Disk, both) to cache intermediate data, you can leverage that instead of writing them explicitly to disk.
Being new to Snowflake I am trying to understand how to write JavaScript based Stored Procedures (SP) to take advantage of multi-thread/parallel processing.
My background is SQL Server and writing SP, taking advantage of performance feature such as degrees of parallelism, worker threads, indexing, column store segment elimination.
I am started to get accustomed to setting up the storage and using clustering keys, micro partitioning, and any other performance feature available, but I don't get how Snowflake SPs break down a given SQL statement into parallel streams. I am struggling to find any documentation to explain the internal workings.
My concern is producing SPs that serialise everything on one thread and become bottlenecks.
I am wondering if I am applying the correct technique/ need a different mindset to developing SPs.
I hope I have explained my concern sufficiently. In essence I am building a PoC to migrate an on-premise SQL Server DWH ETL solution to Snowflake/Matillion ELT solution, one aspect being evaluating the compute virtual warehouse size I need.
stateless UDF will run in parallel by default, this what I observed when do large amount of binary data importing via base64 encoding.
stateful UDF's run in parallel on the date as controlled by the PARTITION BY and ORDER BY clauses used on the data. The only trick to remember is to always force initialize your data, as the javascript instance can be used on subsequent PARTITON BY batches, thus don't rely on check for undefined to know if it's the first row.
I'm learning Spark and trying to process some huge dataset. I don't understand why I don't see decrease in stage completion times with following strategy (pseudo):
data = sc.textFile(dataset).cache()
while True:
data.count()
y = data.map(...).reduce(...)
data = data.filter(lambda x: x < y).persist()
So idea is to pick y so that it most of the time ~halves the data. But for some reason it looks like all the data is always processed again on each count().
Is this some kind of an anti-pattern? How I'm supposed to do this with Spark?
Yes, that is an anti-pattern.
map, same as most, but not all, of the distributed primitives in Spark, is pretty much by definition a divide and conquer approach. You take the data, you compute splits, and transparently distribute computing of individual splits over the cluster.
Trying to further divide this process, using high level API, makes no sense at all. At best it will provide no benefits at all, at worst it will incur the cost of multiple data scans, caching and spills.
Spark is lazily evaluated so in the for or while loop above each call to data.filter does not sequentially return the data but instead sequentially returns Spark calls to be executed later. All these calls get aggregated and then executed simultaneously when you do something later.
In particular, results remain unevaluated and merely represented until a Spark Action gets called. Past a certain point the application can’t handle that many parallel tasks.
In a way we’re running into a conflict between two different representations: conventional structured coding with its implicit (or at least implied) execution patterns and independent, distributed, lazily-evaluated Spark representations.
Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following:
Building a base predictive model for Linear Regression(One off activity)
Pass the data points to get the value for the response variable.
Do something with the result.
At regular intervals update the models.
This has to be done in a sync (req/resp) mode so that the caller code invokes the prediction code, gets the result and carries on with the downstream. The caller code is outside spark (it's a webapp).
I'm struggling to understand if Spark/Spark Streaming is a good fit for doing the Linear Regression purely because of it's async nature.
From what I understand, it simply works of a Job Paradigm where you tell it a source (DB, Queue etc) and it does the computation and pushes the result to a destination (DB, Queue, File etc). I can't see a HTTP/Rest interface which could be used to get the results.
Is Spark the right choice for me? Or are there any better ideas to approach this problem?
Thanks.
If i got it right then in general you have to solve three basic problems:
Build model
Use that model to perform predictions per sync http
request (for scoring or smth like this) from the outside (in your
case this will be webapp)
Update model to make it more precise
within some interval
In general Spark is a platform for performing distributed batch computations on some datasets in a pipeline manner. So with job paradigm you were right. Job is actually a pipeline which will be executed by Spark and which have start and end operations. You will get such benefits as distributed workload among your cluster and effective resource utilization and good (in compare with other such platforms and frameworks) performance due to data partiotioning which allows to perform parallel executions in narrow operation.
So for me the right solution will be to use Spark to build and update your model and then export it in some other solution which will serve your requests and use this model for predictions.
Do something with the result
In step 3 you can use kafka and spark streaming to pass the corrected result and update your model's precision.
Some useful links which can probable help:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
https://datascience.stackexchange.com/questions/13028/loading-and-querying-a-spark-machine-learning-model-outside-of-spark
http://openscoring.io/blog/2016/07/04/sparkml_realtime_prediction_rest_approach/
I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.
Is it a good idea to perform the above task in Spark?
I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.
However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.
To be clear, I'm thinking of something like the following
IDs = sc.parallelize(range(0, n))
def f(x):
for i in range(0,100):
message = make_message(x, i)
SEND_OVER_NETWORK(message)
return (x, 100)
IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)
for (ID, count) in counts.collect():
print ("%i ran %i times" % (ID, count))
Generally speaking it doesn't make sense:
Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.
Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.
You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.
Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.
What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.
It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.