Global cache of strings in Rust - rust

I'm trying to learn the "right" (or idiomatic) way to do some things in Rust, after a long long time using only languages with garbage collection. I've been trying to translate some Python code as an exercise, and I've across something that has left me wondering how to do it.
I have a program that fetches and processes a large amount of data - usually tens of millions of entries. Each entry has a tag (a small string), and there are hundreds of thousands of possible tags (it's a fixed set). But in this program, we never process data having more than like 50 different tags each time. So what I started think was: instead of duplicating these strings millions of times, could I have some kind of "global" cache? So that I could just have references to them, and avoid moving these strings all around as I move the entries?
I say global, but I don't necessarily mean exactly global, but maybe something inside some struct that lives for the duration of the whole program, which could have a HashSet with all the strings observed so far, and we could just reference these strings (a reference or a Rc pointer).
Since I know that my tags are quite small (less than 10 ascii characters most of the time), I'll probably just copy them, shouldn't be the bottleneck in my program. But what if they were a bit larger? Would it make sense to do something like I said?
Also, I know that since I'm parsing external data, I'll actually be parsing and creating these duplicated strings all the time, so just keeping it instead of discarding it wouldn't make much of a difference in terms of performance, but maybe I could reduce memory usage a lot by not duplicating these for each entry (if for example I'm keeping a lot of these in memory).
Sorry for the long text with lots of hypothetical ideas, I'm just more and more curious about memory management the more I learn about Rust!

Related

When is reference-counting needed in a single threaded application that doesn't model circular data structures?

Rust can handle reference counting very elegantly with Rc. It seems many members of the community prefer not to use this and instead use the ownership/borrowing semantics of the language. This results in simpler programs to write, but aside from circular references is it ever necessary?
Clearly across threads things get more complex, so to simplify what I'm trying to learn, "in a single threaded application is reference-counting ever required except as a write-time optimization?" Is this an anti-pattern at the higher levels of Rust skill?
I think that #kmdreko is basically correct, but an example that's harder to model without Rc would be something like filtering queries with "heavy" data, down into multiple owners. Basically, any time you deal with a "set" into multiple other filtered sets, where the filters may overlap.
For example, if you gathered 1000 images, and you invoked OpenCV (or whatever) to find in the original set which images had birds, and which images had a tree in them, it's plausible that the result sets overlap. So if you don't use Rc (or similar), you're either copying these potentially huge images (probably not wanted) or keeping references to them. OK, but then you can't ever free the original set of images, as it's what owns the images that the result sets are referencing. And while there are workarounds to that (only preserve an image when one or more filter says to), then that leads to deeply integrating potentially separate parts of your program because of the memory model you use. Or just use Rc and it's "accessible" by others easily, and whatever brought it in can be "done" with the set and free everything not referenced, while the parts of the set still used are still preserved.
I agree that Rc and related can be over-used. I encounter that all the freaking time in C++ and people throwing shared_ptr all over the place and think they're "good" at that point (until the application or DLL shuts down, and the calls of "why is my application crashing on exit all the time??" start coming). That said, even single-threaded, sometimes ownership needs to be shared, so it can be passed on to an unknown number of consumers for later.
Like with most tools, there are ways to work around certain parts, but that may cause more pain instead.
You should use Rc when you need shared ownership. Using Arc is no different, but perhaps its more common since standard threading mechanisms like std::thread::spawn require ownership.
You would still use Rc in single-threaded environments if your data is more graph-like (which could have circular dependencies but doesn't have to). In such cases, you cannot use basic references. You could represent graph-like relationships in a normalized fashion and use indexes or ids in lieu of references, but that may not always be favorable or possible. Of course, if you're using Rc when a structure could be represented hierarchically, using Rc would be unnecessary.
In addition, there's many reasons why a external function or structure would require a generic type or function to be 'static. In those situations, either moving, clone()-ing or sharing (with Rc) is the solution.
I wouldn't say Rc itself an anti-pattern, its simply another tool, but there are sometimes ways to avoid it that would result in clearer relationships and/or better performance. I often see Rc used liberally by Rust newcomers who haven't fully grasped the borrow-checker.

Sharing NetworkX graph between processes with no additional memory cost (read-only)

I am using python's multiprocessing module. I have a networkx graph which I wish to share between many sub processes. These subprocesses do not modify the graph in any way, and only read its attributes (nodes, edges, etc). Right now every subprocess has its own copy of the graph, but I am looking for a way to share the graph between all of them, which will result in the memory footprint of the entire program being reduced. Since the computations are very CPU-intensive, I would want this to be done in a way that would not cause big performance issues (avoiding locks if possible, etc).
Note: I want this to work on various operating systems, including Windows, which means COW does not help (if I understand this correctly, it probably wouldn't have helped regardless, due to reference counting)
I found https://docs.python.org/3/library/multiprocessing.html#proxy-objects and
https://docs.python.org/3/library/multiprocessing.shared_memory.html, but I'm not sure which (or if either) is suitable. What is the right way to go about this? I'm using python 3.8, but can use later versions if helpful.
There are a few options for sharing data in python during multiprocessing but you may not be able to do exactly what you want to.
In C++ you could use simple shared memory for ints, floats, structs, etc.. Python's shared memory manager does allow for this type of sharing for simple objects but it doesn't work for classes or anything more complex than a list of base types. For shared complex python objects, you really only have a few choices...
Create a copy of the object in your forked process (which it sounds like you don't want to do).
Put the object in a centralized process (ie.. python's Manager / proxy objects) and interact with it via pipes and pickled data.
Convert your networkX graph to a list of simple ints and put it in shared memory.
What works for you is going to depend on some specifics. Option #2 has a bit of overhead because every time you need to access the object, data has to be pickled and piped to the centralized process and the result pickled/piped for return. This works well if you only need a small portion of the centralized data at a time and your processing steps are relatively long (compared to the pickle/pipe time).
Option #3 could be a lot of work. You would fundamentally be changing the data format from networkX objects to a list of ints so it's going to change the way you do processing a lot.
A while back I put together PythonDataServe which allows you to server your data to multiple processes from another process. It's a very similar solution to #2 above. This type of approach works if you only need a small portion of the data at a time but it you need it all, it's much easier to just create a local copy.

Strings and Strands in MoarVM

When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program? In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
If there isn't any way to access this info at runtime, is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
Finally, does MoarVM manage the number of Strands in a given Str? I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n). For example,
(^$n).map({~rand}).join
seems like it would create a Str with a length proportional to $n that consists of $n Strands – and, if I'm understanding the datastructure correctly, that means that into this Str would require checking the length of each Strand, for a time complexity of O(n). But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case? Or have I misunderstood something more basic?
When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program?
My educated guess is yes, as described below for App::MoarVM modules. That said, my education came from a degree I started at the Unseen University, and a wizard had me expelled for guessing too much, so...
In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
I'm 99.99% sure strands are purely an implementation detail of the backend, and there'll be no Raku or NQP access to that information without MoarVM specific tricks. That said, read on.
If there isn't any way to access this info at runtime
I can see there is access at runtime via MoarVM.
is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
I'm 99.99% sure there are multiple ways.
For example, there's a bunch of strand debugging code in MoarVM's ops.c file starting with #define MVM_DEBUG_STRANDS ....
Perhaps more interesting are what appears to be a veritable goldmine of sophisticated debugging and profiling features built into MoarVM. Plus what appear to be Rakudo specific modules that drive those features, presumably via Raku code. For a dozen or so articles discussing some aspects of those features, I suggest reading timotimo's blog. Browsing github I see ongoing commits related to MoarVM's debugging features for years and on into 2021.
Finally, does MoarVM manage the number of Strands in a given Str?
Yes. I can see that the string handling code (some links are below), which was written by samcv (extremely smart and careful) and, I believe, reviewed by jnthn, has logic limiting the number of strands.
I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n).
Yes, if a backend that supported strands did not manage the number of strands.
But for MoarVM I think the intent is to set an absolute upper bound with #define MVM_STRING_MAX_STRANDS 64 in MoarVM's MVMString.h file, and logic that checks against that (and other characteristics of strings; see this else if statement as an exemplar). But the logic is sufficiently complex, and my C chops sufficiently meagre, that I am nowhere near being able to express confidence in that, even if I can say that that appears to be the intent.
For example, (^$n).map({~rand}).join seems like it would create a Str with a length proportional to $n that consists of $n Strands
I'm 95% confident that the strings constructed by simple joins like that will be O(1).
This is based on me thinking that a Raku/NQP level string join operation is handled by MVM_string_join, and my attempts to understand what that code does.
But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case?
If you read the code you will find it's doing very sophisticated handling.
Or have I misunderstood something more basic?
I'm pretty sure I will have misunderstood something basic so I sure ain't gonna comment on whether you have. :)
As far as I understand it, the fact that MoarVM implements strands (aka, a concatenating two strings will only result in creation of a strand that consists of "references" to the original strings), is really that: an implementation detail.
You can implement the Raku Programming Language without needing to implement strands. Therefore there is no way to introspect this, at least to my knowledge.
There has been a PR to expose the nqp:: op that would actually concatenate strands into a single string, but that has been refused / closed: https://github.com/rakudo/rakudo/pull/3975

Massive number of XML edits

I need to load a mid-sized XML file into memory, make many random access modifications to the file (perhaps hundreds of thousands), then write the result out to STDIO. Most of these modifications will be node insertion/deletions, as well as character insertion/deletions within the text nodes. These XML files will be small enough to fit into memory, but large enough that I won't want to keep multiple copies around.
I am trying to settle on the architecture/libraries and am looking for suggestions.
Here is what I have come up with so far-
I am looking for the ideal XML library for this, and so far, I haven't found anything that seems to fit the bill. The libraries generally store nodes in Haskell lists, and text in Haskell Data.Text objects. This only allows linear Node and Text inserts, and I believe that the Text inserts will have to do full rewrite on every insert/delete.
I think storing both nodes and text in sequences seems to be the way to go.... It supports log(N) inserts and deletes, and only needs to rewrite a small fraction of the tree on each alteration. None of the XML libs are based on this though, so I will have to either write my own lib, or just use one of the other libs to parse then convert it to my own form (given how easy it is to parse XML, I would almost just as soon do the former, rather than have a shadow parse of everything).
I had briefly considered the possibility that this might be a rare case where Haskell might not be the best tool.... But then I realized that mutability doesn't offer much of an advantage here, because my modifications aren't char replacements, but rather add/deletes. If I wrote this in C, I would still need to store the strings/nodes in some sort of tree structure to avoid large byte moves for each insert/delete. (Actually, Haskell probably has some of the best tools to deal with this, but I would be open to suggestions of a better choice of language for this task if you feel there is one).
To summarize-
Is Haskell the right choice for this?
Does any Haskell lib support fast node/text insert/deletes (log(N))?
Is sequence the best data structure to store a list of items (in my case, Nodes and Chars) for fast insert and deletes?
I will answer my own question-
I chose to wrap an Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Because haskell is lazy, I believe it only temporarily holds the Text.XML data in memory, node by node as the data streams in, then it is garbage collected before I actually start any real work modifying the Sequence trees.
(It would be nice if someone here could verify that this is how Haskell would work internally, but I've implemented things, and the performance seems to be reasonable, not great- about 30k insert/deletes per second, but this should do).

nodejs buffers vs typed arrays

What is more efficient - nodejs buffers or typed arrays? What should I use for better performance?
I think that only those who know interiors of V8 and NodeJs could answer this question.
A Node.js buffer should be more efficient than a typed array. The reason is simply because when a new Node.js Buffer is created it does not need to be initialized to all 0's. Whereas, the HTML5 spec states that initialization of typed arrays must have their values set to 0. Allocating the memory and then setting all of the memory to 0's takes more time.
In most applications picking either one won't matter. As always, the devil lies in the benchmarks :) However, I recommend that you pick one and stick with it. If you're often converting back and forth between the two, you'll take a performance hit.
Nice discussion here: https://github.com/joyent/node/issues/4884
There are a few things that I think are worth mentioning:
Buffer instances are Uint8Array instances but there are subtle incompatibilities with the TypedArray specification in ECMAScript 2015. For example, while ArrayBuffer#slice() creates a copy of the slice, the implementation of Buffer#slice() creates a view over the existing Buffer without copying, making Buffer#slice() far more efficient.
When using Buffer.allocUnsafe() and Buffer.allocUnsafeSlow() the memory isn't zeroed-out (as many have pointed out already). So make sure you completely overwrite the allocated memory or you can allow the old data to be leaked when the Buffer memory is read.
TypedArrays are not readable right away, you'll need a DataView for that. Which means you might need to rewrite your code if you were to migrate back to Buffer. Adapter pattern could help here.
You can use for-of on Buffer. You cannot on TypedArrays. Also you won't have the classic entries(), values(), keys() and length support.
Buffer is not supported in the frontend while TypedArray may well be. So if your code is shared between frontend or backend you might consider sticking to one.
More info in the docs here.
This is a tough one, but I think that it will depend on what are you planning to do with them and how much data you are planning to work with?
typed arrays themselves need node buffers, but are easier to play with and you can overcome the 1GB limit (kMaxLength = 0x3fffffff).
If you are doing common stuff such as iterations, setting, getting, slicing, etc... then typed arrays should be your best shot for performance, not memory ( specially if you are dealing with float and 64bits integer types ).
In the end, probably only a good benchmark with what you want to do can shed real light on this doubt.

Resources