I have a pretty large text dataset that needs to be preprocessed, with additional live augmentation on batches during training. I implemented a custom torch Dataset where I read my raw text file, chunk it into shards that contain ~5000 samples processed per our requirements. Each shard is a TensorDataset containing, for each sample, the tokens, token types, position ids, etc from HuggingFace tokenizers.
Since each shard is pretty large, we can only load a single shard in memory at a time. We don't really to the DDP mode, because our work requires a single GPU. Since we can't randomly access every index in the dataset due to sharding, we have to forgo the idx argument in the Dataset's __getitem__, and instead implement our own approach, where our Dataset object tracks the index inside the shard that we have retrieved so far. We can shuffle the shard contents and the loading order of the shards themselves, so that is not a concern.
Unfortunately, when we use multiple worker processes, they create replicas of the shard in the Dataset object, ergo they also create replicas of our internal index tracker. This means for 2 worker processes, we are returning 2 copies of the same entry in the shard.
I realize there is a way to directly control what indices a worker process can access inside our shard with the worker_init_fn, where we could use the size of the current shard to basically tell worker 1 to work on the first half, and worker 2 to work on the second half.
However, I am not too sure about how to refresh the shard when we exhaust samples. That is, since each worker contains a replica of the original shard, and their own counting method presumably, is there a way to share information between workers so we know both workers have retrieved, say 5000 samples, exhausting the shard. Then we can load a new shard into both workers. Ideally, the workers should share the shard as well, instead of creating another copy of the same data.
Addendum: I looked at a few examples of IterableDatasets from Pytorch at this link. It is unclear to me, for example in the following snipper, when the stream is being refreshed. If I could figure that out, I could probably synchronize a refresh of the shard across all workers.
An alternate solution for us is to use HDF5 files, but the read overhead is too large compared to out current sharding method.
def __getitem__(self, idx):
"""
Given an index, retrieve a single sample from the post-processed
dataset.
We retrieve from shards, so we need to track the current shard we are
working with, as well as which sample in the shard we are retrieving.
`self.getcount` tracks the current index, and is initialized to 0 in __init__
Shuffling is controlled with the `data_shuffle` argument during
initialization of the Dataset.
"""
self.last_idx = idx
response = self.sharded_dataset[self.getcount]
self.getcount += 1
if self.getcount == self.current_shardsize: # we have exhausted examples in this shard
self.getcount = 0
self.shard_load_index += 1 # increment the shard index that we will load
if self.shard_load_index == self.max_shard_number: # we have processed all shards.
self.shard_load_index = 0
self.sharded_dataset = self.load_shard(self.shard_load_index)
self.current_shardsize = len(self.sharded_dataset)
if self.masking:
self.sharded_dataset = self.sharded_refresh_mask_ids(self.sharded_dataset)
return response
Here, getcount is our internal tracker of the index. Each worker seems to get its own copy of getcount, so we get each sample from self.sharded_dataset twice.
Related
We are applying few predicates on imap containing just 100,000 objects to filter data. These predicates will change per user. While doing POC on my local machine (16 GB) with two nodes(each node shows 50000) and 100,000 records, I am getting output in 30 sec which is way more than querying the database directly.
Will increasing number of nodes reduce the time, I even tried with PagingPredicate but it takes around 20 sec for each page
IMap objectMap = hazelcastInstance.getMap("myMap");
MultiMap resultMap = hazelcastInstance.getMap("myResultMap");
/*Option 1 : passing hazelcast predicate for imap.values*/
objectMap.values(predicate).parallelStream().forEach(entry -> resultMap(userId, entry));
/*Option 2: applying java predicate to entrySet OR localkeyset*/
objectMap.entrySet.parallelstream().filter(predicate).forEach(entry -> resultMap(userId, entry));
More nodes will help, but the improvement is difficult to quantify. It could be large, it could be small.
Part of the work in the code sample involves applying a predicate across 100,000 entries. If there is no index, the scan stage checks 50,000 entries per node if there are 2 nodes. Double up to 4 nodes, each has 25,000 entries to scan and so the scan time will half.
The scan time is part of the query time, the overall result set also has to be formed from the partial results from each node. So doubling the number of nodes might nearly half the run time as a best case, or it might not be a significant improvement.
Perhaps the bigger question here is what are you trying to achieve ?
objectMap.values(predicate) in the code sample retrieves the result set to a central point, which then has parallelStream() applied to try to merge the results in parallel into a MultiMap. So this looks like more of an ETL than a query.
Use of executors as per the title, and something like objectMap.localKeySet(predicate) might allow this to be parallelised out better, as there would be no central point holding intermediate results.
Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks
I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.
I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6
Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.