attach the database in memory to database in disk - python-3.x

Good evening, I would like to attach a database created in memory with database created and saved on the disk. I reached the first part (create the DB in memory) but I have difficults to attach it with DB on the disk.
import sqlite3
# set up a database in memory
c = sqlite3.connect(':memory:')
c.execute('CREATE TABLE my_table (id int, name text);')
c.execute("INSERT INTO my_table VALUES (1, 'bruce'), (2, 'wayne'), (3, 'bat');")
c.commit()
I tried this code below but it doesn't work :/ :
ATTACH DATABASE 'file::memory:?cache=shared' AS db_disk

In Python, there is no direct way to copy the contents of an in-memory database to disk.
But the database forces writes to disk only when a transaction commits, so you can get the same speed by using a disk database and writing everything in a single transaction, i.e., by not calling commit() before you are finished.
(But you might want to increase the cache size.)

Related

how to avoid errors when querying hive table being loaded by Spark at the same time

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
The error is:
java.io.FileNotFoundException: File does not exist [HDFS PATH]
Which i think happens because the metadata says there is a file A that gets deleted during the job execution.
The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)
The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.
Have a set of two tables and a view over them:
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;
First iteration of ETL process needs to:
Synchronize foo_a -> foo_b
Do the work on foo_b
Drop view foo and recreate it pointing to foo_b
Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.
This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:
user is unlucky enough to hit a short time between dropping and
recreating the view
user submits a query that's heavy enough to run across two iterations of ETL
not sure but check it out
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);

attach existing in-memory sqlite3 databases using python3

I have many sqlite3 databases (actual files) I wish to merge using a binary merge algorithm.
To do this, I use the python3 sqlite3 API to load them into memory (avoid I/O overhead) and start merging them in pairs (binary merge).
However, after creating, for instance, two in-memory databases:
conn_mem1 = connect('file:memdb1?mode=memory&cache=shared', uri=True)
conn_mem2 = connect('file:memdb2?mode=memory&cache=shared', uri=True)
How can I attach one to the other? I tried
cur_mem1 = conn_mem1.cursor()
cur_mem1.execute('ATTACH DATABASE 'file:memdb2?mode=memory&cache=shared' AS memdb2')
But that just seems to attach an empty database, and also creates an empty file with name
memdb2?mode=memory&cache=shared
on disk in the current path.
How can I attach an existing in-memory database to another in-memory database?

Cassandra: Issue with blob creation for large file

We are trying to load a file in to a blob column in Cassandra. When we load files of 1-2 MB files, it goes through fine. While loading large file, say around 50 MB, getting following error:
Cassandra failure during write query at consistency LOCAL_QUORUM (1 responses were required but only 0 replica responded, 1 failed)
It is a single node development DB. Any hints or support will be appreciated.
50mb is pretty big for a cell. Although a little out of date its still accurate: http://cassandra.apache.org/doc/4.0/faq/#can-large-blob
There is no mechanism for streaming out of cells in Cassandra so the cells content needs to be serialized in as single response, in memory. Your probably hitting a limit or bug somewhere thats throwing an exception and causing the failed query (check cassandras system.log, may be an exception in there that will describe whats occuring better).
If you have a CQL collection or logged batch there are additional lower limits.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
You can try chunking your blobs into parts. Id actually recommend like 64kb, and on client side, iterate through them and generate a stream (to also prevent loading it completely in memory on your side).
CREATE TABLE exampleblob (
blobid text,
chunkid int,
data blob,
PRIMARY KEY (blobid, chunkid));
Then just SELECT * FROM exampleblob WHERE blobid = 'myblob'; and iterate through results. Inserting gets more complex though since you have to have logic to split up your file, this can also be done in streaming fashion though and be memory efficient on your app side.
Another alternative is to just upload the blob to S3 or some distributed file store, use a hash of the file as the bucket/filename. In Cassandra just store the filename as a reference to it.

Global Temporary Table

Help me understand how Global temporary table works
I have process which is going to be threaded and requires data visible only to that thread session. So we opted for Global Temporary Table.
Is it better to leave global temporary table not being dropped after all threads are completed or is it wise to drop the table. Call to this process can happen once or twice in a day.
Around 4 tables are required
Oracle Temp tables are NOT like SQL Server #temp tables. I can't see any reason to continuously drop/create the tables. The data is gone on a per session basis anyways once the transaction or session is completed (depends on table creation options). If you have multiple threads using the same db session, they will see each other's data. If you have one session per thread, then the data is limited in scope as you mentioned. See example here.
If you drop global temporary table and recreate it then it is not impacting to any database activities and server disk io activities because global temporary tables are created in temp tablespace where no archive is generating and not checkpoint is updating header of tempfile. Purpose of temporary table is only accurately maintained in this case.

sqlite database design with millions of 'url' strings - slow bulk import from csv

I'm trying to create an sqlite database by importing a csv file with urls. The file has about 6 million strings. Here are the commands I've used
create table urltable (url text primary key);
.import csvfile urldatabase
After about 3 million urls the speed slows down a lot and my hard disk keeps spinning continuously. I've tried splitting the csv file into 1/4th chunks but I run into the same problem.
I read similar posts on stackoverflow and tried using BEGIN...COMMIT blocks and PRAGMA synchronous=OFF but none of them helped. The only way I was able to create the database was by removing the primary key constraint from url. But then, when I run a select command to find a particular url, it takes 2-3 seconds which won't work for my application.
With the primary key set on url, the select is instantaneous. Please advise me what am I doing wrong.
[Edit]
Summary of suggestions that helped :
Reduce the number of transactions
Increase page size & cache size
Add the index later
Remove redundancy from url
Still, with a primary index, the database size is more than double the original csv file that I was trying to import. Any way to reduce that?
Increase your cache size to something large enough to contain all of the data in memory. The default values for page size and cache size are relatively small and if this is a desktop application then you can easily increase the cache size many times.
PRAGMA page_size = 4096;
PRAGMA cache_size = 72500;
Will give you a cache size of just under 300mb. Remember page size must be set before the database is created. The default page size is 1024 and default cache size is 2000.
Alternatively (or almost equivalently really) you can create the database entirely in an in-memory database and then use the backup API to move it to an on-disk database.
A PRIMARY KEY or UNIQUE constraint will automatically generate an index. An index will dramatically speed up SELECTs, at the expense of slowing down INSERTs.
Try importing your data into a non-indexed table, and then explicitly CREATE UNIQUE INDEX _index_name ON urltable(url). It may be faster to build the index all at once than one row at a time.

Resources