Does findAll() on a repository fetch all results before returning? - spring-data-cassandra

Does findAll() on a repository fetch all results before returning? The reason I am asking this is because I got OutOfMemoryException when calling it on a rather large dataset (~15 GB on disk). Since it returned an Iterable<T> I was expecting it do background batching so that I easily could iterate over the entire dataset.
I guess I'll file an issue on the JIRA if the method does not allow larger query results.

Currently, we do not support paging, but would like to explore supporting it.
Please enter an issue at https://jira.spring.io/browse/DATACASS
-matthew

Related

Firestore GET query Error: 9 FAILED_PRECONDITION: The requested snapshot version is too old

I am performing a query to get data from Firestore. I am doing it for 20K+ files.
First, I used the get() method , but it didn't work.
Then, I tried to do the same using stream.
It is working sometimes and sometimes not.
Now, the only solution that I have is probably to use limit() with get().
What I wanted to ask is what are the best practices for doing larger reads from Firestore or If anyone has done the same in the past can also share their approach. It would be quite helpful.
How soon is this error observed, once you start your process? It seems this is hitting some product timeout due to a large dataset, please be advised on the performance best practices on snapshot usage here[4].
After investigating further in terms of workarounds, please consider going through the following additional approaches :
By paginating the data as mentioned here[1].
By batching or splitting up the request2.
Instead of using a document snapshot you can use real time updates like[3]
[1] : https://stackoverflow.com/questions/60175102/the-requested-snapshot-version-is-too-old-error-in-firestore#:~:text=by%20paginating%20the%20data
[3] : https://firebase.google.com/docs/firestore/query-data/listen
[4] : https://cloud.google.com/firestore/docs/best-practices#realtime_updates

Firestore Query performance issue on Firebase Cloud Functions

I am facing timeout issues on a firebase https function so I decided to optimize each line of code and realized that a single query is taking about 10 seconds to complete.
let querySnapshot = await admin.firestore()
.collection("enrollment")
.get()
The enrollment collection has about 23k documents, totaling approximately 6MB.
To my understanding, since the https function is running on a cloud function stateless server, it should not suffer from the query result size. Both Firestore and Cloud Functions are running on the same region (us-central). Yet 10 seconds is indeed a high interval of time for executing such a simple query that results in a small snapshot size.
An interesting fact is that later in the code I update those 23k documents back with a new field using Bulk Writter and it takes less than 3 seconds to run bulkWriter.commit().
Another fact is that the https function is not returning any of the query results to the client, so there shouldn't be any "downloading" time affecting the function performance.
Why on earth does it take 3x longer to read values from a collection than writing to it? I always thought Firestore architecture was meant for apps with high reading rates rather than writing.
Is there anything you would propose to optimize this?
When we perform the get(), a query is created to all document snapshots and the results are returned. These results are fetched sequentially within a single execution, i.e. the list is returned and parsed sequentially until all documents have been listed.
While the data may be small, are there any subcollections? This may add some additional latency as the API fetches and parses subcollections.
Updating the fields with a bulk writer update is over 3x the speed because the bulkwriter operation is performed in parallel and is queued based upon Promises. This allows many more operations per second.
The best way to optimize listing all documents is summarised in this link, and Google’s recommendation follows the same guideline being to use an index for faster queries and to use multiple readers that fetch the documents in parallel.

Azure search index storage size stops at 8MB

I am trying to ingest a load of 13k json documents into azure search engine, but the index stops at around 6k documents without any error for the indexer and the index storage size is 7.96MB and it doesn't surpass this limit no matter what.
I have tried using smaller batches of 3k/indexer and after that 1k/indexer, but I got the same result.
In my json I have around 10 simple fields, and 20 complex fields (which have other nested complex fields, but up to level 5).
Do you have any idea if there is a limit per size for an index? And where I can set it up?
As SLA, I think we are using S1 plan (based on what limits we have - 50 indexers, and so on)
Thanks
Really hard to help without seeing it, but I remember I faced a problem like this in the past. In my case, it was a problem of duplicating with the key field.
I also recommend you smaller batches (~500 documents)
PS: Take a look if your nested jsons are not too big (in case it's marked as retrievable).

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?
I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool
Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.
just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Does Hazelcast support bulk set or asynchronous bulk put operation?

I have a use-case of inserting a lot of data during big calculation which really don't have to be available in the cluster immediately (so the cluster can synchronize as we go).
Currently I'm inserting batches using putAll() operation and it's blocking and taking time.
I've read a blog post about efficiency of set() operation but there is no analogous setAll(). I also saw putAsync() and didn't see matching putAllAsync() (I'm not interested in the future object).
Am I overlooking something? How can I improve insertion performance?
EDIT: Feature request: https://github.com/hazelcast/hazelcast/issues/5337
I think you're right, they're missing. Could you create a feature request, maybe you're also interested in helping to implement them using the Hazelcast Incubator?

Resources