How does sqlite3 edit a big file? - linux

Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically: What's the most efficient way of adding data in the middle of a huge file?
This question was previously asked here:
How to edit a big file
where the answer suggest using sqlite3 istead of a direct file. That makes me curious, how does sqlite3 solve this problem?

SQLite is a relational database. Its primary editing means is btree tables and btree indices. BTrees are designed to be edited in place even as records grow. In addition, SQLite uses the .journal file to recover from crashes while saving files.
BTrees pay only log (N) lookup time for any record by its primary key or any indexed column (this works out much faster even than sorting records because the log base is huge). Because BTrees use block pointers almost everywhere, the middle of the ordered list can be updated relatively painlessly.
As RichN points out, SQLite builds up wasted space in the file. Run VACUUM periodically to free it.
Incidentally I have written BTrees by hand. They are a pain to write but worth it if you must for some reason.

The contents of an SQLite database file is made up of records and data structures to access those records. SQLite keeps track of the used portions of the file along with the unused portions (made available when records are deleted.) When you add a new record and it fits in an unused segment, that becomes its location. Otherwise it is appended to the file. Any indices are updated to point to the new data. Updating the indices may append further index records. SQLite (and database managers, in general) don't move any content when inserting new records.
Note that, over time, the contents become scattered across the disk. Sequential records won't be located near each other, which could affect the performance of some queries.
The SQLite VACUUM command can remove unused space in the file, as well as fix locality problems in the data. See VACUUM Command

Related

NodeJS: How to do positional insert (without overwriting) data into a file?

I want to insert a string in any position (beginning, end, middle) without overlapping/overwriting the existing data.
I've tried using
fs.createWriteStream(file, {start: 0, flags: 'r+'}) but this overwrites the existing data and it does not insert it literally.
I've seen solutions of reading data into buffers then rewrite it again into the file, which won't work for me because I need to handle even large data and buffer has its limits.
Any thoughts on this?
The usual operating systems (Windows, Unix, Mac) do not have file systems that support inserting data into a file without rewriting everything that comes after. So, you cannot directly do what you're asking in a regular file.
Some of the technical choices you have are:
You rewrite all the data that comes after to new locations in the file essentially moving it up in the file and then you can write your new content at the desired position. Note, it's a bit tricky to do this without overwriting data you need to keep. You essentially have to start at the end of the file, read a block, write it to a new higher location, position back a block, repeat, until you get to your insert location.
You create your own little file system where a logical file consists of multiple files linked together by some sort of index. Then, to insert some data, you split a file into two (which involves rewriting some data) you can then insert at the end of one of the split files. This can very complicated very quickly.
You keep your data in a database where it's easier to insert new data elements and keep some sort of index that establishes their order that you can query by. Databases are in the business of managing how to store data on disk while offering you views of the data that are not directly linked to how it's stored on the disk.
Related answer: What is the optimal way of merge few lines or few words in the large file using NodeJS?

Non-blocking insert into database with node js

Part of my Node Js app includes reading a file and after some (lightweight, row by row) processing, insert these records into the database.
Original code did just that. The problem is that the file may contain a crazy number of records which are inserted row by row. According to some tests I did, a file of 10000 rows blocks completely the app for some 10 seconds.
My considerations were:
Bulk create the whole object at once. This means reading the file, preparing the object by doing for each row some calculation, pushing it to the final object and in the end using Sequelize's bulkcreate. There were two downsides:
A huge insert can be as blocking as thousands of single-row inserts.
This may make it hard to generate reports for rows that were not inserted.
Bulk create in smaller, reasonable objects. This means reading the file, iterating each n (ex. 2000) rows by doing the calculations and adding it to an object, then using Sequelize's bulkcreate for the object. Object preparation and the bulkcreate would run asyncroniously. The downside:
Setting the object length seems arbitrary.
Also it seems like an artifice on my side, while there might be existing and proven solutions for this particular situation.
Moving this part of the code in another proccess. Ideally limiting cpu usage to reasonable levels for this process (idk. if it can be done or if it is smart).
Simply creating a new process for this (and other blocking parts of the code).
This is not the 'help me write some code' type of question. I have already looked around and it seems there is enough documentation. But I would like to invest on an efficient solution, using the proper tools. Other ideas are welcomed.

CouchDB 2 global_changes system table is getting insanely big

We have a system that basically writes 250MB of data into our CouchDB 2 instance, which generates ~50GB/day in the global_changes database.
This makes CouchDB2 consumes all the disk.
Once you get to this state, CouchDB2 goes and never comes back.
We would like to know if there is any way of limiting the size of the global_changes table, or if there is a way of managing this table, like a set of best practices.
tldr: just delete it
http://docs.couchdb.org/en/latest/install/setup.html#single-node-setup
States:
Note that the last of these (referring to _global_changes) is not necessary if you do not expect to be using the global changes feed. Feel free to delete this database if you have created it, it has grown in size, and you do not need the function (and do not wish to waste system resources on compacting it regularly.

Data retrieval - Database VS Programming language

I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.

Replicating CouchDB to local couch reduces size - why?

I recently started using Couch for a large app I'm working on.
I database with 7907 documents, and wanted to rename the database. I poked around for a bit, but couldn't figure out how to rename it, so I figured I would just replicate it to a local database of the name I wanted.
The first time I tried, the replication failed, I believe the error was a timeout. I tried again, and it worked very quickly, which was a little disconcerting.
After the replication, I'm showing that the new database has the correct amount of records, but the database size is about 1/3 of the original.
Also a little odd is that if I refresh futon, the size of the original fluctuates between 94.6 and 95.5 mb
This leaves me with a few questions:
Is the 2nd database storing references to the first? If so, can I delete the first without causing harm?
Why would the size be so different? Had the original built indexes that the new one eventually will?
Why is the size fluctuating?
edit:
A few things that might be helpful:
This is on a cloudant couchdb install
I checked the first and last record of the new db, and they match, so I don't believe futon is underreporting.
Replicating to a new database is similar to compaction. Both involve certain side-effects (incidentally, and intentionally, respectively) which reduce the size of the new .couch file.
The b-tree indexes get balanced
Data from old document revisions is discarded.
Metadata from previous updates to the DB is discarded.
Replications store to/from checkpoints, so if you re-replicate from the same source, to the same location (i.e. re-run a replication that timed out), it will pick up where it left off.
Answers:
Replication does not create a reference to another database. You can delete the first without causing harm.
Replicating (and compacting) generally reduces disk usage. If you have any views in any design documents, those will re-build when you first query them. View indexes use their own .view file which also consumes space.
I am not sure why the size is fluctuating. Browser and proxy caches are the bane of CouchDB (and web) development. But perhaps it is also a result of internal Cloudant behavior (for example, different nodes in the cluster reporting slightly different sizes).

Resources