I want to insert a string in any position (beginning, end, middle) without overlapping/overwriting the existing data.
I've tried using
fs.createWriteStream(file, {start: 0, flags: 'r+'}) but this overwrites the existing data and it does not insert it literally.
I've seen solutions of reading data into buffers then rewrite it again into the file, which won't work for me because I need to handle even large data and buffer has its limits.
Any thoughts on this?
The usual operating systems (Windows, Unix, Mac) do not have file systems that support inserting data into a file without rewriting everything that comes after. So, you cannot directly do what you're asking in a regular file.
Some of the technical choices you have are:
You rewrite all the data that comes after to new locations in the file essentially moving it up in the file and then you can write your new content at the desired position. Note, it's a bit tricky to do this without overwriting data you need to keep. You essentially have to start at the end of the file, read a block, write it to a new higher location, position back a block, repeat, until you get to your insert location.
You create your own little file system where a logical file consists of multiple files linked together by some sort of index. Then, to insert some data, you split a file into two (which involves rewriting some data) you can then insert at the end of one of the split files. This can very complicated very quickly.
You keep your data in a database where it's easier to insert new data elements and keep some sort of index that establishes their order that you can query by. Databases are in the business of managing how to store data on disk while offering you views of the data that are not directly linked to how it's stored on the disk.
Related answer: What is the optimal way of merge few lines or few words in the large file using NodeJS?
Related
I am using Flatbuffers as a way to store data and the meta-tags with the data. I am using Python and in order to simulate dictionaries, I have two table structures: One for dictionary entries and one to hold a vector of entries. Here is an example of the schema:
// Define dictionary structure
table tokenEntry{
key:string;
value:int;
}
table TokenDict{
Entries:[tokenEntry];
}
root_type TokenDict;
I wish to write two dictionaries to a single file using Flatbuffers. I want to also read the dictionaries one at a time from the file, and not load both into memory at the same time. I am able to write both to file, one at a time. However, when I read from the file, I get both of the structures at once. The buffer holds all the data from the file. This is not what I want, because later I will have a much larger amount of data in the files. Is there a way to read in just one at a time?
As an example, if I were to use pickle structures, I can write multiple pickles to a file and read them back one at a time. i wish to do the same with Flatbuffers.
Thank you.
Best to write a file as a sequence of individual FlatBuffers, each prefixed with a size. You can do that in Python using FinishSizePrefixed (see builder.py).
I have a .csv file that has around 2 million rows, and I want to add a new column. Problem is, I could manage to that by losing a lot of data (basicly everything above ~1,1m rows). When I used connection to the external file (so that I could read all rows), and made changes to it in Power Query, the changes was not saved to the .csv file.
You can apply one of several solutions:
Using a text editor which can handle huge files, save the csv files into smaller chunks. Apply the modifications to each chunk. Join chunks again to get the desired file.
Create a "small" program yourself, which loads the csv line by line and applies the modification, adding the resulting data to a second file.
Maybe some other software can handle that size of a csv. Patch the LibreOffice for this purpose, to handle 2000000+ lines - the source code is available :)
Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically: What's the most efficient way of adding data in the middle of a huge file?
This question was previously asked here:
How to edit a big file
where the answer suggest using sqlite3 istead of a direct file. That makes me curious, how does sqlite3 solve this problem?
SQLite is a relational database. Its primary editing means is btree tables and btree indices. BTrees are designed to be edited in place even as records grow. In addition, SQLite uses the .journal file to recover from crashes while saving files.
BTrees pay only log (N) lookup time for any record by its primary key or any indexed column (this works out much faster even than sorting records because the log base is huge). Because BTrees use block pointers almost everywhere, the middle of the ordered list can be updated relatively painlessly.
As RichN points out, SQLite builds up wasted space in the file. Run VACUUM periodically to free it.
Incidentally I have written BTrees by hand. They are a pain to write but worth it if you must for some reason.
The contents of an SQLite database file is made up of records and data structures to access those records. SQLite keeps track of the used portions of the file along with the unused portions (made available when records are deleted.) When you add a new record and it fits in an unused segment, that becomes its location. Otherwise it is appended to the file. Any indices are updated to point to the new data. Updating the indices may append further index records. SQLite (and database managers, in general) don't move any content when inserting new records.
Note that, over time, the contents become scattered across the disk. Sequential records won't be located near each other, which could affect the performance of some queries.
The SQLite VACUUM command can remove unused space in the file, as well as fix locality problems in the data. See VACUUM Command
I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.
I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.