Why is Spark Mllib KMeans algorithm extremely slow? - apache-spark

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.
KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000

It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.
By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.
Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).
If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.

Please try other implementations of k-means. Some like the variants in ELKI are way better than Spark, even on only a single CPU. You will be surprised how much performance you can get out of a single node, without going to a cluster! From my experiments, you would need at least a 100 node cluster to beat good local implementations, unfortunately.
I read that these C++ versions are multi-core (but single-node) and probably the fastest K-means you can find right now, but I have not yet tried that myself yet (for all my needs, the ELKI versions were bazingly fast, finishing in a few seconds on my largest data sets).

Related

How big can batches in Flink respectively Spark get?

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Spark with divide and conquer

I'm learning Spark and trying to process some huge dataset. I don't understand why I don't see decrease in stage completion times with following strategy (pseudo):
data = sc.textFile(dataset).cache()
while True:
data.count()
y = data.map(...).reduce(...)
data = data.filter(lambda x: x < y).persist()
So idea is to pick y so that it most of the time ~halves the data. But for some reason it looks like all the data is always processed again on each count().
Is this some kind of an anti-pattern? How I'm supposed to do this with Spark?
Yes, that is an anti-pattern.
map, same as most, but not all, of the distributed primitives in Spark, is pretty much by definition a divide and conquer approach. You take the data, you compute splits, and transparently distribute computing of individual splits over the cluster.
Trying to further divide this process, using high level API, makes no sense at all. At best it will provide no benefits at all, at worst it will incur the cost of multiple data scans, caching and spills.
Spark is lazily evaluated so in the for or while loop above each call to data.filter does not sequentially return the data but instead sequentially returns Spark calls to be executed later. All these calls get aggregated and then executed simultaneously when you do something later.
In particular, results remain unevaluated and merely represented until a Spark Action gets called. Past a certain point the application can’t handle that many parallel tasks.
In a way we’re running into a conflict between two different representations: conventional structured coding with its implicit (or at least implied) execution patterns and independent, distributed, lazily-evaluated Spark representations.

How to change number of parallel tasks in pyspark

How to change number of parallel tasks in pyspark ?
I mean how to change number of virtual maps that is run on my PC. actually I want to sketch Speed up chart by number of map functions.
sample code:
words = sc.parallelize(["scala","java","hadoop"])\
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
If you understand my purpose but I asked it in a wrong way I would appreciate if you correct it
Thanks
For this toy example number of parallel tasks will depend on:
Number of partition for the input rdd - set by spark.default.parallelism if not configured otherwise.
Number of threads assigned to local (might be superseded by the above).
Physical and permission-based capabilities of the system.
Statistical properties of the dataset.
However Spark is not a lightweight parallelization - for this we have low overhead alternatives like threading and multiprocessing, higher level components built on top of these (like joblib or RxPy) and native extensions (to escape GIL with threading).
Spark itself is heavyweight, with huge coordination and communication overhead, and as stated by by desernaut it is hardly justified for anything than testing, when limited to a single node. In fact, it can make things much worse with higher parallelism

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

When does Cassandra hit Amdahl's law?

I am trying to understand the claims that Cassandra scales linearly with the number of nodes. In a quick look around the 'net I have not seen much of a treatment of this topic. Surely there are serial processing elements in Cassandra that must limit the speed gained as N increases. Any thoughts, pointers or links on this subject would be appreciated.
Edit to provide perspective:
I am working on a project that has a current request for a 1,000+ node Cassandra infrastructure. I did not come-up with this spec. I find myself proposing that N be reduced to a range between 200 and 500, with each node being at least twice as fast for serial computation. This is easy to achieve without a cost penalty per node by making simple changes to the server configuration.
Cassandra's scaling is better described in terms of Gustafson's law, rather than Amdahl's law. Gustafson scaling looks at how much more data you can process as the number of nodes increases. That is, if you have N times as many nodes, you can process a dataset N times larger in the same amount of time.
This is possible because Cassandra uses very little cluster-wide coordination, except for schema and ring changes. Most operations only involve a number of nodes equal to the replication factor, which stays constant as the dataset grows -- hence nearly linear scale out.
By contrast, Amdahl scaling looks at how much faster you can process a fixed dataset as the number of nodes increases. That is, if you have N times as many nodes, can you process the same dataset N times faster?
Clearly, at some point you reach a limit where adding more nodes doesn't make your requests any faster, because there is a minimum amount of time needed to service a request. Cassandra is not linear here.
In your case, it sounds like you're asking whether it's better to have 1,000 slow nodes or 200 fast ones. How big is your dataset? It depends on your workload, but the usual recommendation is that the optimal size of nodes is around 1TB of data each, making sure you have enough RAM and CPU to match (see cassandra node limitations). 1,000 sounds like far too many, unless you have petabytes of data.

Resources