If you have several nodes in Hazelcast will the return value in IMap.size() method return the size of the map in that node or the total size of all objects associated with that map distributed through all nodes?
Looking through the javadocs, it doesn't show that method being overridden, so I imagine that method call won't return what would normally be expected in a non-distributed map.
It returns the estimated size of the complete distributed map. Estimated because it just gets the count of elements in a partition at a given point in time, since multiple partition are operated one after another that does not need to be the 100% real amount but it is if you get the size but not mutating the actual map.
Related
I have a pretty large text dataset that needs to be preprocessed, with additional live augmentation on batches during training. I implemented a custom torch Dataset where I read my raw text file, chunk it into shards that contain ~5000 samples processed per our requirements. Each shard is a TensorDataset containing, for each sample, the tokens, token types, position ids, etc from HuggingFace tokenizers.
Since each shard is pretty large, we can only load a single shard in memory at a time. We don't really to the DDP mode, because our work requires a single GPU. Since we can't randomly access every index in the dataset due to sharding, we have to forgo the idx argument in the Dataset's __getitem__, and instead implement our own approach, where our Dataset object tracks the index inside the shard that we have retrieved so far. We can shuffle the shard contents and the loading order of the shards themselves, so that is not a concern.
Unfortunately, when we use multiple worker processes, they create replicas of the shard in the Dataset object, ergo they also create replicas of our internal index tracker. This means for 2 worker processes, we are returning 2 copies of the same entry in the shard.
I realize there is a way to directly control what indices a worker process can access inside our shard with the worker_init_fn, where we could use the size of the current shard to basically tell worker 1 to work on the first half, and worker 2 to work on the second half.
However, I am not too sure about how to refresh the shard when we exhaust samples. That is, since each worker contains a replica of the original shard, and their own counting method presumably, is there a way to share information between workers so we know both workers have retrieved, say 5000 samples, exhausting the shard. Then we can load a new shard into both workers. Ideally, the workers should share the shard as well, instead of creating another copy of the same data.
Addendum: I looked at a few examples of IterableDatasets from Pytorch at this link. It is unclear to me, for example in the following snipper, when the stream is being refreshed. If I could figure that out, I could probably synchronize a refresh of the shard across all workers.
An alternate solution for us is to use HDF5 files, but the read overhead is too large compared to out current sharding method.
def __getitem__(self, idx):
"""
Given an index, retrieve a single sample from the post-processed
dataset.
We retrieve from shards, so we need to track the current shard we are
working with, as well as which sample in the shard we are retrieving.
`self.getcount` tracks the current index, and is initialized to 0 in __init__
Shuffling is controlled with the `data_shuffle` argument during
initialization of the Dataset.
"""
self.last_idx = idx
response = self.sharded_dataset[self.getcount]
self.getcount += 1
if self.getcount == self.current_shardsize: # we have exhausted examples in this shard
self.getcount = 0
self.shard_load_index += 1 # increment the shard index that we will load
if self.shard_load_index == self.max_shard_number: # we have processed all shards.
self.shard_load_index = 0
self.sharded_dataset = self.load_shard(self.shard_load_index)
self.current_shardsize = len(self.sharded_dataset)
if self.masking:
self.sharded_dataset = self.sharded_refresh_mask_ids(self.sharded_dataset)
return response
Here, getcount is our internal tracker of the index. Each worker seems to get its own copy of getcount, so we get each sample from self.sharded_dataset twice.
We are applying few predicates on imap containing just 100,000 objects to filter data. These predicates will change per user. While doing POC on my local machine (16 GB) with two nodes(each node shows 50000) and 100,000 records, I am getting output in 30 sec which is way more than querying the database directly.
Will increasing number of nodes reduce the time, I even tried with PagingPredicate but it takes around 20 sec for each page
IMap objectMap = hazelcastInstance.getMap("myMap");
MultiMap resultMap = hazelcastInstance.getMap("myResultMap");
/*Option 1 : passing hazelcast predicate for imap.values*/
objectMap.values(predicate).parallelStream().forEach(entry -> resultMap(userId, entry));
/*Option 2: applying java predicate to entrySet OR localkeyset*/
objectMap.entrySet.parallelstream().filter(predicate).forEach(entry -> resultMap(userId, entry));
More nodes will help, but the improvement is difficult to quantify. It could be large, it could be small.
Part of the work in the code sample involves applying a predicate across 100,000 entries. If there is no index, the scan stage checks 50,000 entries per node if there are 2 nodes. Double up to 4 nodes, each has 25,000 entries to scan and so the scan time will half.
The scan time is part of the query time, the overall result set also has to be formed from the partial results from each node. So doubling the number of nodes might nearly half the run time as a best case, or it might not be a significant improvement.
Perhaps the bigger question here is what are you trying to achieve ?
objectMap.values(predicate) in the code sample retrieves the result set to a central point, which then has parallelStream() applied to try to merge the results in parallel into a MultiMap. So this looks like more of an ETL than a query.
Use of executors as per the title, and something like objectMap.localKeySet(predicate) might allow this to be parallelised out better, as there would be no central point holding intermediate results.
I am a little lost in this task. There is a requirement for our caching solution to split a large data dictionary into partitions and perform operations on them in separate threads.
The scenario is: We have a large pool of data that is to be kept in memory (40m rows), the chosen strategy is first to have a Dictionary with int key. This dictionary contains a subset of 16 dictionaries that are keyed by guid and contain a data class.
The number 16 is calculated on startup and indicates CPU core count * 4.
The data class contains a byte[] which is basically a translated set of properties and their values, int pointer to metadata dictionary and checksum.
Then there is a set of control functions that takes care of locking and assigns/retrieves Guid keyed data based on a division of the first segment of guid (8 hex numbers) by divider. This divider is just FFFFFFFF / 16. This way each key will have a corresponding partition assigned.
Now I need to figure out how to perform operations (key lookup, iterations and writes) on these dictionaries in separate threads in parallel? Will I just wrap these operations using Tasks? Or will it be better to load these behemoth dictionaries into separate threads whole?
I have a rough idea how to implement data collectors, that will be the easy part I guess.
Also, is using Dictionaries a good approach? Their size is limited to 3mil rows per partition and if one is full, the control mechanism tries to insert on another server that is using the exact same mechanism.
Is .NET actually a good language to implement this solution?
Any help will be extremely appreciated.
Okay, so I implemented ReaderWriterLockSlim and implemented concurrent access through System.Threading.Tasks. I also managed to exclude any dataClass object from the storage, now it is only a dictionary of byte[]s.
It's able to store all 40 million rows taking just under 4GB of RAM and through some careful SIMD optimized manipulations performs EQUALS, <, > and SUM operation iterations in under 20ms, so I guess this issue is solved.
Also the concurrency throughput is quite good.
I just wanted to post this in case anybody faces similar issue in the future.
Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:
https://trongkhoanguyenblog.wordpress.com/
https://0x0fff.com/spark-architecture-shuffle/
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).
For expose my question, I want to start from a very common transformation:
val rdd = reduceByKey(_ + _)
This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.
My questions are:
Is Map-Side aggregation implemented using internally a mapPartition transformation and thus aggregating all the same keys using the combiner function or is it implemented with a AppendOnlyMap or ExternalAppendOnlyMap?
If AppendOnlyMap or ExternalAppendOnlyMap maps are used for aggregating, are they used also for reduce side aggregation that happens into the ResultTask?
What exaclty the purpose about these two kind of maps (AppendOnlyMap or ExternalAppendOnlyMap)?
Are AppendOnlyMap or ExternalAppendOnlyMap used from all shuffle managers or just from the sortManager?
I read that after AppendOnlyMap or ExternalAppendOnlyMap are full, are spilled into a file, how exactly does this steps happen?
Using the Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory is fill up, we start sorting map, spilling it to disk and then clean up the map, my question is : what is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system, but they are treat differently, Shuffle write records, are not put into the appendOnlyMap.
Can you explain in depth what happen when reduceByKey being executed, explaining me all the steps involved for to accomplish that? Like for example all the steps for map side aggregation, shuffling and so on.
It follows the description of reduceByKey step-by-step:
reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
Both methods use ExternalAppendOnlyMap.insertAllMethod
ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.
insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.
As of Spark 2.0 there is only sort based manager: SPARK-14667
I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):
from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
# do something with item
I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?
Thanks for any insights.
Edit: A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.
-Raj
Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.
It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.
Edit:
Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.