attach existing in-memory sqlite3 databases using python3 - python-3.x

I have many sqlite3 databases (actual files) I wish to merge using a binary merge algorithm.
To do this, I use the python3 sqlite3 API to load them into memory (avoid I/O overhead) and start merging them in pairs (binary merge).
However, after creating, for instance, two in-memory databases:
conn_mem1 = connect('file:memdb1?mode=memory&cache=shared', uri=True)
conn_mem2 = connect('file:memdb2?mode=memory&cache=shared', uri=True)
How can I attach one to the other? I tried
cur_mem1 = conn_mem1.cursor()
cur_mem1.execute('ATTACH DATABASE 'file:memdb2?mode=memory&cache=shared' AS memdb2')
But that just seems to attach an empty database, and also creates an empty file with name
memdb2?mode=memory&cache=shared
on disk in the current path.
How can I attach an existing in-memory database to another in-memory database?

Related

How to create pdf from huge dataset of mongodb data about 10million data rows

I want to create pdf from huge dataset of mongodb. (about 10million rows)
No specific format of data. you can assume employee database
Iam using MEAN stack(open to new tech if applicable)
Approches tried:
Use nodejs lib. like pdfkit to convert mongodb result (arr of obj) to pdf by doing a for loop in result. (it causes heap out of storage issue + very slow)
Create temporary collection -> do a mongoexport to csv -> csv to html using awk -> html to pdf using wkhtmltopdf tool. (this still is very slow)
After i do mongo query -> i cannot store this data in some variable because it will cause heap out of storage issue, so i cannot do any further processing on this data.
I can query using limit and skip to get data in chunks and create html and then pdf from it. but it seems very slow process.
Possible approach i think could be to create small pdfs and then merge them together, or by using streams.
What is the most efficient way to create pdf from huge datasets?

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

MongoDB backup -> tar -> gz -> gpg

I have a MongoDB server and I am using mongodump command to create backup. I run command mongodump --out ./mongo-backup then tar -czf ./mongo-backup.tar.gz ./mongo-backup then gpg --encrypt ./mongo-backup.tar.gz > ./mongo-backup.tar.gz.gpg and send this file to backup server.
My MongoDB database has 20GB with MongoDB show dbs command, MongoDB mongodump backup directory has only 3.8GB, MongoDB gzipped-tarball has only 118MB and my gpg file has only 119MB in size.
How is this possible to reduce 20GB database to 119MB file? Is it fault tolerant?
I tried to create new server ( clone of production ), enabled firewall to ensure that noone could connect and run this backup procedure. I create fresh new server and import data and there are some differences:
I ran same command from mongo shell use db1; db.db1_collection1.count(); and use db2; db.db2_collection1.count(); and results are:
807843 vs. 807831 ( db1.collection1 source server vs. db1.collection1 restored server )
3044401 vs. 3044284 ( db2.collection1 source server vs. db2.collection1 restored server )
If you have validated the counts and size of documents/collections in your restored data, this scenario is possible although atypical in the ratios described.
My MongoDB database has 20GB with MongoDB show dbs command
This shows you the size of files on disk, including preallocated space that exists from deletion of previous data. Preallocated space is available for reuse, but some MongoDB storage engines are more efficient than others.
MongoDB mongodumpbackup directory has only 3.8GB
The mongodump tool (as at v3.2.11, which you mention using) exports an uncompressed copy of your data unless you specify the --gzip option. This total should represent your actual data size but does not include storage used for indexes. The index definitions are exported by mongodump and the indexes will be rebuilt when the dump is reloaded via mongorestore.
With WiredTiger the uncompressed mongodump output is typically larger than the size of files on disk, which are compressed by default. For future backups I would consider using mongodump's built-in archiving and compression options to save yourself an extra step.
Since your mongodump output is significantly smaller than the storage size, your data files are either highly fragmented or there is some other data that you have not accounted for such as indexes or data in the local database. For example, if you have previously initialised this server as a replica set member the local database would contain a large preallocated replication oplog which will not be exported by mongodump.
You can potentially reclaim excessive unused space by running the compact command for a WiredTiger collection. However, there is an important caveat: running compact on a collection will block operations for the database being operated on so this should only be used during scheduled maintenance periods.
MongoDB gzipped-tarball has only 118MB and my gpg file has only 119MB in size.
Since mongodump output is uncompressed by default, compressing can make a significant difference depending on your data. However, 3.8GB to 119MB seems unreasonably good unless there is something special about your data (large number of small collections? repetitive data?). I would double check that your restored data matches the original in terms of collection counts, document counts, data size, and indexes.

Read and writing to multiple files concurrently?

I've got a software application where it can have over a million objects in memory using all the cores of the machine. Each object has a unique ID and has it's own internal StateObject that needs to be persisted temporarily somewhere, any changes to the StateObject will result in overwriting previous StateObject with new updated data.
I was wondering if I should be reading & writing the state to a database or should I just create text files locally on the machine, each named with uniqueId of the object and each object will read and write a json String of StateObject to the file.
Which option will yield better performance? database or to just writing to local file system? Should I write to multiple files with uniqueId or one file with multiple rows where the first column will be the unique id ? After doing some research I found that parallel read and writes are slower on HDD but is fast on SSD. So I guess I have to use SSD.
Update
The reason to write to disk is because there are too many Objectss (> 1Million) and having each object's all their StateObjects in memory is going to be expensive so I would rather persist the Object's internal State (StateObject) to disk if they are not being used. And guarantee of the writes is very important to process the next request by that object. If the write fails for some reason the StateObject will be built from a remote APIs before processing the next request which is more time consuming.

Cannot find efficient method for dumping leveldb to flat file(s)

I'm using LevelDB as part of a local process that, when all is done, has ~10-100 million JSON entries.
I need to get these into a portable format, ideally as one or more csv (or even line delimited json) files to import into a separate mongodb system.
I did a quick test in node.js to stream the db contents to a file (using node-levelup, and on my machine it took about 18.5 minutes for 10 million pairs. Seems pretty slow.
Looking for suggestions on quicker dump/export from leveldb.
I've considered using mongodb as the store for the local processing, because mongoexport is much quicker, but there is a lot more overhead in setup as I'd need several shards to get more speed on my writes.
The fastest way to retrieve all entries in leveldb is using it's iterator which may node-levelup already did for that.
Since you are still need a tool to parse the exported file, I suggest you just copy leveldb's data dir as the exported file. You can open it and iterate it in python/ruby/..., just any script which have a leveldb wrapper.

Resources