Cannot find efficient method for dumping leveldb to flat file(s) - node.js

I'm using LevelDB as part of a local process that, when all is done, has ~10-100 million JSON entries.
I need to get these into a portable format, ideally as one or more csv (or even line delimited json) files to import into a separate mongodb system.
I did a quick test in node.js to stream the db contents to a file (using node-levelup, and on my machine it took about 18.5 minutes for 10 million pairs. Seems pretty slow.
Looking for suggestions on quicker dump/export from leveldb.
I've considered using mongodb as the store for the local processing, because mongoexport is much quicker, but there is a lot more overhead in setup as I'd need several shards to get more speed on my writes.

The fastest way to retrieve all entries in leveldb is using it's iterator which may node-levelup already did for that.
Since you are still need a tool to parse the exported file, I suggest you just copy leveldb's data dir as the exported file. You can open it and iterate it in python/ruby/..., just any script which have a leveldb wrapper.

Related

Best way: how to export dynamodb table to a csv and store it in s3

We have one lambda that will update dynamodb table after some operation.
Now we want to export whole dynamodb table into a s3 bucket with a csv format.
Any efficient way to do this.
Also I have found the below way of streaming directly from dynamodb to s3
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
But in above it will store in json format. and can not find a way to do this efficiently for 10GB data
As far as I can tell you have three "simple" options.
Option #1: Program that does a Scan
It is fairly simple to write a program that does a (parallel) scan of your table and then outputs the result in a CSV. A no bells and whistles version of this is about 100-150 lines of code in Python or Go.
Advantages:
Easy to develop
Can be run easily multiple times from local machines or CI/CD pipelines or whatever.
Disadvantages:
It will cost you a bit of money. Scanning the whole table will use up some read units. Depending on the amount you are readin, this might get costly fast.
Depending on the amount of data this can take a while.
Note: If you want to run this in a Lambda then remember that Lambdas can run for a maximum of 15 minutes. So once you more data than can be processed within those 15 minutes, you probably need to switch to Step Functions.
Option #2: Process a S3 backup
DynamoDB allows you to create backups of your table to S3 (as the article describes you linked). Those backups will either be in JSON or a JSON like AWS format. You then can write a program that converts those JSON files to CSV.
Advantages:
(A lot) cheaper than a scan
Disadvantages:
Requires more "plumbing" because you need to first create the backup, then do download it from S3 to wherever you want to process it etc.
Probably will take longer than option #1

What is the best way to process more than 7 MB JSON file in Node

I have a large file of JSON (over 7 MB) and I want to iterate it to find my match data.
Is it a good a way to read the entire file in memory and keep it for next call or there are other ways which have better performance and a few memory using?
Data stored in the JSON format is meant to be read in all at once. That's how the format works. It's not a format that you would generally incrementally search without first reading all the data in. While there are some modules that support streaming it in and somewhat examining it incrementally, that is not what the format was intended for, nor what it is best suited for.
So, you really have several questions to ask yourself:
Can you read the whole block of data into memory at once and parse it into Javascript?
Is the amount of memory it takes to do that OK in your environment?
Is the time to do that OK for your application?
Can you cache it in memory for awhile so you can more efficiently access it the next time you need something from it?
Or, should this data really be in a database that supports efficient searches and efficient modifications with far lower memory usage and much better performance?
If you're OK with the first four questions, then just read it in, parse it and keep the resulting Javascript object in memory. If you're not OK with any of the first four questions, then you probably should put the data into a format that can more efficiently be queried without loading it all into memory (e.g. a simple database).

Node Streams to Mysql

I have to parse large csvs approx 1gb, map the header to the database columns, and format every row. I.E the csv has "Gender" Male but my database only accepts enum('M', 'F', 'U').
Since the files are so large I have to use node streams, to transform the file and then use load data infile to upload it all at once.
I would like granular control over the inserts, which load data infile doesn't provide. If a single line has incorrect data the whole upload fails. I am currently using mysqljs, which doesn't provide an api to check if the pool has reached queueLimit and therefore I can't pause the stream reliably.
I am wondering if I can use apache kafka or spark to stream the instructions and it will be added to the database sequentially. I have skimmed through the docs and read some tutorials but none of them show how to connect them to the database. It is mostly consumer/producer examples.
I know there are multiple ways of solving this problem but I am very much interested in a way to seamlessly integrate streams with databases. If streams can work with I.O why not databases? I am pretty sure big companies don't use load data infile or add chunks of data to array repeatedly and insert to database.

How does sqlite3 edit a big file?

Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically: What's the most efficient way of adding data in the middle of a huge file?
This question was previously asked here:
How to edit a big file
where the answer suggest using sqlite3 istead of a direct file. That makes me curious, how does sqlite3 solve this problem?
SQLite is a relational database. Its primary editing means is btree tables and btree indices. BTrees are designed to be edited in place even as records grow. In addition, SQLite uses the .journal file to recover from crashes while saving files.
BTrees pay only log (N) lookup time for any record by its primary key or any indexed column (this works out much faster even than sorting records because the log base is huge). Because BTrees use block pointers almost everywhere, the middle of the ordered list can be updated relatively painlessly.
As RichN points out, SQLite builds up wasted space in the file. Run VACUUM periodically to free it.
Incidentally I have written BTrees by hand. They are a pain to write but worth it if you must for some reason.
The contents of an SQLite database file is made up of records and data structures to access those records. SQLite keeps track of the used portions of the file along with the unused portions (made available when records are deleted.) When you add a new record and it fits in an unused segment, that becomes its location. Otherwise it is appended to the file. Any indices are updated to point to the new data. Updating the indices may append further index records. SQLite (and database managers, in general) don't move any content when inserting new records.
Note that, over time, the contents become scattered across the disk. Sequential records won't be located near each other, which could affect the performance of some queries.
The SQLite VACUUM command can remove unused space in the file, as well as fix locality problems in the data. See VACUUM Command

Best way to copy 20Gb csv file to cassandra

I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?

Resources