ArangoDB - arangoimp on csv files is very slow on large datasets - arangodb

I am new to arango. I'm trying to import some of my data from Neo4j into arango.
I am trying to add millions of nodes and edges to store playlist data for various people. I have the csv files from neo4j. I ran a script to change the format of the csv files of node to have a _key attribute. And the edges to have a _to and _from attribute.
When I tried this on a very small dataset, things worked perfectly and I could see the graph on the UI and perform queries. Bingo!
Now, I am trying to add millions of rows of data ( each arangoimp batch imports a csv with about 100,000 rows ). Each batch has 5 collections ( a different csv file for each)
After about 7-8 batches of such data, the system all of a sudden gets very slow, unresponsive and throws the following errors:
ERROR error message: failed with error: corrupted collection
This just randomly comes up for any batch, though the format of the data is exactly the same as the previous batches
ERROR Could not connect to endpoint 'tcp://127.0.0.1:8529', database: '_system', username: 'root'
FATAL got error from server: HTTP 401 (Unauthorized)'
Otherwise it just keeps processing for hours with barely any progress
I'm guessing all of this has to do with the large number of imports. Some post said that maybe I have too many file descriptors, but I'm not sure how to handle it.
Another thing I notice, is that the biggest collection of all the 5 collections, is the one that mostly gets the errors ( although the other ones also do). Do the file descriptors remain specific to a certain collection, even on different import statements?
Could someone please help point me in the right direction? I'm not sure on how to begin debugging the problem
Thank you in advance

The problem here is, that the server must not be overrun in terms of available disk I/O. The situation may benefit from more available RAM.
The system also has to maintain indices while importing, which increases complexity with the number of documents in the collections.
With ArangoDB 3.4 we have improved Arangoimp to maximize throughput, without maxing out which should resolve this situation and remove the necessity to split the import data into chunks.
However, as its already is, the CSV format should be prepared, JSONL is also supported.

Related

Possible ways of comparing large records from one table on a database and another table on another database

I am looking into the ways for comparing records from same table but on different databases. I just need to compare and find the missing records.
I tried out a few methods.
loading the records into a pandas data frame, I used read_sql. But it is taking more time and memory to complete the load and if the records are large, I am getting a memory error.
Tried setting up a standalone cluster of spark and run the comparison, it is also throwing java heap space error. tuning the conf is not working as well.
Please let me know if there are other ways to handle this huge record comparison.
--update
Do we have a tool readily available for cross data source comparison
If your data size is huge you can use cloud services to run your spark job and get the results. Here you can use aws glue which is serverless and is charged as you go.
Or if your data is not considerably large and is something one time job then you can use google colab which is free and run your comparision over it .

Arangodb - slow cursors

Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks

copy command row size limit in cassandra

Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader

Gridgain: Write timed out (socket was concurrently closed)

While trying to upload data to Gridgain using GridDataLoader, I'm getting
'Write timed out (socket was concurrently closed).'
I'm trying to load 10 million lines of data using a .csv file on a cluster having 13 nodes (16 core cpu).
The structure of my GridDataLoader is GridDataLoader where Key is a composite key. While using a primitive data type as the key there was no issue. But when I changed it to a composite key this error is coming.
I guess this is because it takes up too large space on heap when it tries to parse your csv and create entries. As a result, if you don't configure your heap-size large enough, you are likely suffering from GC pauses since when GC kicks in, everything has to pause, and that's why you got this time out error. I think it may help if you can break that large csv into smaller files and load them one by one.

Grails Excel import fails for huge data

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.
If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.
In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails
Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

Resources