Gridgain: Write timed out (socket was concurrently closed) - gridgain

While trying to upload data to Gridgain using GridDataLoader, I'm getting
'Write timed out (socket was concurrently closed).'
I'm trying to load 10 million lines of data using a .csv file on a cluster having 13 nodes (16 core cpu).
The structure of my GridDataLoader is GridDataLoader where Key is a composite key. While using a primitive data type as the key there was no issue. But when I changed it to a composite key this error is coming.

I guess this is because it takes up too large space on heap when it tries to parse your csv and create entries. As a result, if you don't configure your heap-size large enough, you are likely suffering from GC pauses since when GC kicks in, everything has to pause, and that's why you got this time out error. I think it may help if you can break that large csv into smaller files and load them one by one.

Related

FoundationDb: What is meaning of FDBException: Transaction is too old to perform reads or be committed

I am trying to execute a getRange command in fdbCli but it fails with
FDBException: Transaction is too old to perform reads or be committed
What is the meaning of this particular exception?
Does it mean by query took more than 5 sec to complete?
Fdb keeps a list of the transaction started within 5 sec. Also, data nodes only keep versions of the last 5sec. So if the read version is smaller than the last version kept by dataNodes, the dataNodes have no way to answer the request. That's why fdb throws this exception. the trick to evade from such exceptions is to split one huge time taking transaction to many small transactions. I also noticed fdb performs really well if the transaction time < 300ms.
Firstly - yes, you are correct (your query took more than 5 seconds to complete).
If the read request’s timestamp is older than 5 seconds, the storage server may have already flushed the data from its in-memory multi-version data structure to its on-disk single-version data structure. This means the storage server does not have the data older than the 5 seconds. So the client will receive the error you've mentioned.
NB: You can avoid this problem via the use of a RecordCursor and by using passing a continuation to your query.
More on continuations here.

ArangoDB - arangoimp on csv files is very slow on large datasets

I am new to arango. I'm trying to import some of my data from Neo4j into arango.
I am trying to add millions of nodes and edges to store playlist data for various people. I have the csv files from neo4j. I ran a script to change the format of the csv files of node to have a _key attribute. And the edges to have a _to and _from attribute.
When I tried this on a very small dataset, things worked perfectly and I could see the graph on the UI and perform queries. Bingo!
Now, I am trying to add millions of rows of data ( each arangoimp batch imports a csv with about 100,000 rows ). Each batch has 5 collections ( a different csv file for each)
After about 7-8 batches of such data, the system all of a sudden gets very slow, unresponsive and throws the following errors:
ERROR error message: failed with error: corrupted collection
This just randomly comes up for any batch, though the format of the data is exactly the same as the previous batches
ERROR Could not connect to endpoint 'tcp://127.0.0.1:8529', database: '_system', username: 'root'
FATAL got error from server: HTTP 401 (Unauthorized)'
Otherwise it just keeps processing for hours with barely any progress
I'm guessing all of this has to do with the large number of imports. Some post said that maybe I have too many file descriptors, but I'm not sure how to handle it.
Another thing I notice, is that the biggest collection of all the 5 collections, is the one that mostly gets the errors ( although the other ones also do). Do the file descriptors remain specific to a certain collection, even on different import statements?
Could someone please help point me in the right direction? I'm not sure on how to begin debugging the problem
Thank you in advance
The problem here is, that the server must not be overrun in terms of available disk I/O. The situation may benefit from more available RAM.
The system also has to maintain indices while importing, which increases complexity with the number of documents in the collections.
With ArangoDB 3.4 we have improved Arangoimp to maximize throughput, without maxing out which should resolve this situation and remove the necessity to split the import data into chunks.
However, as its already is, the CSV format should be prepared, JSONL is also supported.

Cassandra: Issue with blob creation for large file

We are trying to load a file in to a blob column in Cassandra. When we load files of 1-2 MB files, it goes through fine. While loading large file, say around 50 MB, getting following error:
Cassandra failure during write query at consistency LOCAL_QUORUM (1 responses were required but only 0 replica responded, 1 failed)
It is a single node development DB. Any hints or support will be appreciated.
50mb is pretty big for a cell. Although a little out of date its still accurate: http://cassandra.apache.org/doc/4.0/faq/#can-large-blob
There is no mechanism for streaming out of cells in Cassandra so the cells content needs to be serialized in as single response, in memory. Your probably hitting a limit or bug somewhere thats throwing an exception and causing the failed query (check cassandras system.log, may be an exception in there that will describe whats occuring better).
If you have a CQL collection or logged batch there are additional lower limits.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
You can try chunking your blobs into parts. Id actually recommend like 64kb, and on client side, iterate through them and generate a stream (to also prevent loading it completely in memory on your side).
CREATE TABLE exampleblob (
blobid text,
chunkid int,
data blob,
PRIMARY KEY (blobid, chunkid));
Then just SELECT * FROM exampleblob WHERE blobid = 'myblob'; and iterate through results. Inserting gets more complex though since you have to have logic to split up your file, this can also be done in streaming fashion though and be memory efficient on your app side.
Another alternative is to just upload the blob to S3 or some distributed file store, use a hash of the file as the bucket/filename. In Cassandra just store the filename as a reference to it.

Grails Excel import fails for huge data

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.
If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.
In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails
Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

Custom in-memory cache

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?
Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

Resources