Most efficient data structure to add styles to text - text

I'm looking for the best data structure to add styles to a text (say in a text editor). The structure should allow the following operations:
Quick lookup of all styles at absolute position X
Quick insert of text at any position (styles after that position must be moved).
Every position of the text must support an arbitrary number of styles (overlapping).
I've considered lists/arrays which contain text ranges but they don't allow quick insert without recalculating the positions of all styles after the insert point.
A tree structure with relative offsets supports #2 but the tree will degenerate fast when I add lots of styles to the text.
Any other options?

I have never developped an editor, but how about this:
I believe it would be possible to expand the scheme that is used to store the text characters themeselves, depending of course on the details of your implementation (language, toolkits etc) and your performance and resource usage requirements.
Rather than use a separate data structure for the styles, I'd prefer having a reference that would accompany each character and point to an array or list with the applicable characters. Characters with the same set of styles could point to the same array or list, so that one could be shared.
Character insertions and deletions would not affect the styles themeselves, apart from changing the number of references to them, which could be handled with a bit of reference counting.
Depending on your programming language you could even compress things a bit more by pointing halfway into a list, although the additional bookkeeping for this might in fact make it more inefficient.
The main issue with this suggestion is the memory usage. In an ASCII editor written in C, bundling a pointer with each char would raise its effective memory usage from 1 byte to 12 bytes on a 64 bit system, due to struct alignment padding.
I would look about breaking the text into small variable size blocks that would allow you to efficiently compress the pointers. E.g. a 32-character block might look like this in C:
struct _BLK_ {
unsigned char size;
unsigned int styles;
char content[];
}
The interesting part is the metadata processing on the variable part of the struct, which contains both the stored text and any style pointers. The size element would indicate the number of characters. The styles integer (hence the 32-character limit) would be seen as a set of 32 1-bit fields, with each one indicating whether a character has its own style pointer, or whether it should use the same style as the previous character. This way a 32-char block with a single style would only have the additional overhead of the size char, the styles mask and a single pointer, along with any padding bytes. Inserting and deleting characters into a small array like this should be quite fast.
As for the text storage itself, a tree sounds like a good idea. Perhaps a binary tree where each node value would be the sum of the children values, with the leaf nodes eventually pointing to text blocks with their size as their node value? The root node value would be the total size of the text, with each subtree ideally holding half of your text. You'd still have to auto-balance it, though, with sometimes having to merge half-empty text blocks.
And in case you missed it, I am no expert in trees :-)
EDIT:
Apparently what I suggested is a modified version of this data structure:
http://en.wikipedia.org/wiki/Rope_%28computer_science%29
as referenced in this post:
Data structure for text editor
EDIT 2:
Deletion in the proposed data structure should be relatively fast, as it would come down to byte shifting in an array and a few bitwise operations on the styles mask. Insertion is pretty much the same, unless a block fills up. It might make sense to reserve some space (i.e. some bits in the styles mask) within each block to allow for future insertions directly in the blocks, without having to alter the tree itself for relatively small amounts of new text.
Another advantage of bundling characters and styles in blocks like this is that its inherent data locality should allow for more efficient use of the CPU cache than other alternatives, thus improving the processing speed to some extent.
Much like any complex data structure, though, you'd probably need either profiling with representative test cases or an adaptive algorithm to determine the optimal parameters for its operation (block size, any reserved space etc).

Related

Why are BufferViews and Accessors separate in glTF?

The GLTF format specifies that meshes reference their vertex and index data via accessors, which in turn reference BufferViews. Both of them have an offset and a length.
The main difference seems to be that BufferViews are format-agnostic, they just reference a bunch of bytes, while the accessors are adding type information.
What I do not understand is:
Why do both of them need an offset and a length? Which use case is there where the offset in an accessor is not zero and the count of an accessor does not correspond to the length of the view?
Why isn't the type data directly contained in the buffer view? In what use case does it make sense to interpret the same data with different formats?
The format is designed to support interleaved vertex attributes, initially from WebGL (in glTF 1.0) but now more generally across graphics APIs (in glTF 2.0).
For example, POSITION data may be a vec3 of FLOAT, but TEXCOORD_0 data may be a vec2 of FLOAT, and there could even be custom attributes of different types, all interleaved within a single GPU buffer.
So the BufferView defines a given byte stride, and the individual accessors into that view may have different types and amounts but will all share the same byte stride.
You're not required to interleave of course, but the format is designed to allow it, and to enforce the byte stride sharing when it happens.
Here's a diagram from the Data Interleaving section of the glTF tutorial. It's a little small here but you can click for a larger view. In this example, there are two accessors, one for POSITION and one for NORMAL, sharing a single BufferView.

Brotli compression multithreading

It's my understanding Brotli stores blocksize information in a meta-block header with only the final uncompressed size of the block, and no information about the compression length (9.2). I'm guessing that a wrapper would be need to be created in order to use it with multiple threads, or possibly something similar to Mark Adler's pigz.
Would the same threading principles apply to Brotli as they do with gzip in this case, or are there any foreseeable issues to be aware of when it comes to multithreading implementations?
You can use the brotli format as is for this purpose. I got them to add the option of putting metadata in empty meta-blocks (where "empty" means that the meta-block produces zero uncompressed data). You can put markers in metadata to aid in finding meta-blocks. An inserted empty meta-block also starts the next meta-block at a byte boundary.
Each meta-block can be independent of the other meta-blocks. If the stream is constructed that way, then there is no issue with combining them when compressing or separately decompressing them. The areas of possible dependency are the ring buffer of the four last distances used, and backwards references past the beginning of the current meta-block. For parallel use, a meta-block can and must be constructed so as to not depend on the last four distances, not referring to the ring buffer until it has been filled with distances from the current meta-block. In addition, distances that reach back before the current meta-block would not be allowed (which includes no static references). Lastly you would append an empty or metadata meta-block to bring the sequence to a byte boundary for easy concatenation.
By the way, it looks like you're linking to an older version of the draft format. Here is a link to the current version.

How to delete elements by value in a map structure restricted with having one key

The main problem is that I'm working in a functional language with immutable types so thing like pointers and deletion are a bit harder. I would prefer if this was implementable primarily in Haskell.
Let's imagine we have a single dimensional field
[x,x,x,x,x,x,x,x,x]
So I have a map with keys being SIZES and values being ADDRESSES because each entry starts from a certain ADDRESS and has a certain SIZE.
[(x,x,x),x,x,(x,x,x,x)]
I want to be able to add an element by SIZE to a map and then check if the entries are touching so that I can merge them.
Since my map is by SIZEs I have to iterate through the whole map to find the ones with the bordering ADDRESSes.
Do I really have to chose between implementing a 2 key map and O(n) for merger?
Welp, in essence, this looks like computer memory. Do you want it to be efficient? Because you know, "things like pointers" exist and work in Haskell perfectly well.
Since my map is by SIZEs I have to iterate through the whole map to find the ones with the bordering ADDRESSes.
No, if you store the ranges in a separate data structure. I think for such non-overlapping subsets, there was something called a spanning tree (or as suggested by #Daniel, IntervalMap), but I'm not exactly an expert on those. Otherwise, why don't you simply hold memory blocks like that?
data Block = Block { start :: Int, data :: [Byte] }
type Memory = [Block]
You could cache the block length or use a data structure where length is O(1), to make merges O(nBlocks).
Sure, that doesn't make it obvious at the type level that they won't ever overlap, but that's an invariant you can keep for yourself.

Meta-information in DAWG/DAFSA

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:
c
/ \
a i
\ /
t
|
e
and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.
Questions:
1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?
2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.
In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.
DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.
Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.
Hope this helps!

Is it possible to efficiently search a bit-trie for values less than the key?

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.
We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.
If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.
A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

Resources