Massive number of XML edits - haskell

I need to load a mid-sized XML file into memory, make many random access modifications to the file (perhaps hundreds of thousands), then write the result out to STDIO. Most of these modifications will be node insertion/deletions, as well as character insertion/deletions within the text nodes. These XML files will be small enough to fit into memory, but large enough that I won't want to keep multiple copies around.
I am trying to settle on the architecture/libraries and am looking for suggestions.
Here is what I have come up with so far-
I am looking for the ideal XML library for this, and so far, I haven't found anything that seems to fit the bill. The libraries generally store nodes in Haskell lists, and text in Haskell Data.Text objects. This only allows linear Node and Text inserts, and I believe that the Text inserts will have to do full rewrite on every insert/delete.
I think storing both nodes and text in sequences seems to be the way to go.... It supports log(N) inserts and deletes, and only needs to rewrite a small fraction of the tree on each alteration. None of the XML libs are based on this though, so I will have to either write my own lib, or just use one of the other libs to parse then convert it to my own form (given how easy it is to parse XML, I would almost just as soon do the former, rather than have a shadow parse of everything).
I had briefly considered the possibility that this might be a rare case where Haskell might not be the best tool.... But then I realized that mutability doesn't offer much of an advantage here, because my modifications aren't char replacements, but rather add/deletes. If I wrote this in C, I would still need to store the strings/nodes in some sort of tree structure to avoid large byte moves for each insert/delete. (Actually, Haskell probably has some of the best tools to deal with this, but I would be open to suggestions of a better choice of language for this task if you feel there is one).
To summarize-
Is Haskell the right choice for this?
Does any Haskell lib support fast node/text insert/deletes (log(N))?
Is sequence the best data structure to store a list of items (in my case, Nodes and Chars) for fast insert and deletes?

I will answer my own question-
I chose to wrap an Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Because haskell is lazy, I believe it only temporarily holds the Text.XML data in memory, node by node as the data streams in, then it is garbage collected before I actually start any real work modifying the Sequence trees.
(It would be nice if someone here could verify that this is how Haskell would work internally, but I've implemented things, and the performance seems to be reasonable, not great- about 30k insert/deletes per second, but this should do).


Global cache of strings in Rust

I'm trying to learn the "right" (or idiomatic) way to do some things in Rust, after a long long time using only languages with garbage collection. I've been trying to translate some Python code as an exercise, and I've across something that has left me wondering how to do it.
I have a program that fetches and processes a large amount of data - usually tens of millions of entries. Each entry has a tag (a small string), and there are hundreds of thousands of possible tags (it's a fixed set). But in this program, we never process data having more than like 50 different tags each time. So what I started think was: instead of duplicating these strings millions of times, could I have some kind of "global" cache? So that I could just have references to them, and avoid moving these strings all around as I move the entries?
I say global, but I don't necessarily mean exactly global, but maybe something inside some struct that lives for the duration of the whole program, which could have a HashSet with all the strings observed so far, and we could just reference these strings (a reference or a Rc pointer).
Since I know that my tags are quite small (less than 10 ascii characters most of the time), I'll probably just copy them, shouldn't be the bottleneck in my program. But what if they were a bit larger? Would it make sense to do something like I said?
Also, I know that since I'm parsing external data, I'll actually be parsing and creating these duplicated strings all the time, so just keeping it instead of discarding it wouldn't make much of a difference in terms of performance, but maybe I could reduce memory usage a lot by not duplicating these for each entry (if for example I'm keeping a lot of these in memory).
Sorry for the long text with lots of hypothetical ideas, I'm just more and more curious about memory management the more I learn about Rust!

OpenMDAO version 2.x File Variable Workaround

I'm new to OpenMDAO and started off with the newest version (version 2.3.1 at the time of this post).
I'm working on the setup to a fairly complicated aero-structural optimization using several external codes, specifically NASTRAN and several executables (compiled C++) that post process NASTRAN results.
Ideally I would like to break these down into multiple components to generate my model, run NASTRAN, post process the results, and then extract my objective and constraints from text files. All of my existing interfaces are through text file inputs and outputs. According to the GitHub page, the file variable feature that existed in an old version (v1.7.4) has not yet been implemented in version 2.
Is there a good workaround for this until the feature is added?
So far the best solution I've come up with is to group everything into one large component that maps input variables to final output by running everything instead of multiple smaller components that break up the process.
File variables themselves are no longer implemented in OpenMDAO. They caused a lot of headaches and didn't fundamentally offer useful functionality because they requires serializing the whole file into memory and passing it around as string buffers. The whole process was just duplicative and inefficient, since the files were ultimately getting written and read from disk far more times than were necessary.
In your case since you're setting up an aerostructural problem, you really wouldn't want to use them anyway. You will want to have access to either analytic or at least semi-analytic total derivatives for efficient execution. So what that means is that the boundary of each component must composed of only floating point variables or arrays of floating point variables.
What you want to do is wrap your analysis tools using ExternalCodeImplicitComp, which tells openmdao that the underlying analysis is actually implicit. Then, even if you use finite-differences to compute the partial derivatives you only need to FD across the residual evaluation. For NASTRAN, this might be a bit tricky to set up, since I don't know if it directly exposes the residual evaluation, but if you can get to the stiffness matrix then you should be able to compute it. You'll be rewarded for your efforts with a greatly improved efficiency and accuracy.
Inside each wrapper, you can use the built in file wrapping tools to read through the files that were written and pull out the numerical values, which you then push into the outputs vector. For NASTRAN you might consider using pyNASTRAN, instead of the file wrapping tools, to save yourself some work.
If you can't expose the residual evaluation, then you can use ExternalCodeComp instead and treat the analysis as if it was explicit. This will make your FD more costly and less accurate, but for linear analyses you should be ok (still not ideal, but better than nothing).
The key idea here is that you're not asking OpenMDAO to pass around file objects. You are wrapping each component with only numerical data at its boundaries. This has the advantage of allowing OpenMDAO's automatic derivatives features to work (even if you use FD to compute the partial derivatives). It also has a secondary advantage that if you (hopefully) graduate to in-memory wrappers for your codes then you won't have to update your models. Only the component's internal code will change.

Best way to search through a very big dataset?

I have text files that contain about 12gbs worth of tweets and need to search through this dataset off of keywords. What is the best way to go about doing this?
Familiar with Java, Python, R. I don't think my computer can handle the files if, for example, I do some sort of script that goes through each text file in python
"Oh, Python, or any other language, can most-certainly do it." Might take a few seconds, but the job will get done. I suggest that the best approach to your problem is: "straight ahead." Write scripts that process the files one line at a time.
Although "12 gigabytes" sounds enormous to us, to any modern-day machine it's really not that big at all.
Build hashes (associative arrays) in memory as needed. Generally avoid database-operations (other than "SQLite" database files, maybe ...), but, if you happen to find yourself needing "indexed file storage," SQLite is a terrific tool.
. . . with one very-important caveat: "when using SQLite, use transactions, even when reading." By default, SQLite will physically-commit every write and physically-verify every read, unless you are in a transaction. Then, and only then, it will "lazy read/write," as you might have expected it to do all the time. (And then, "that sucker's f-a-s-t...!")
If you want to be exact, then you need to see at every file once, so if your computer can't take that load, then say goodbye to exactness.
Another approach, would be to use approximation algorithms which are faster than the exact ones, but come in the expense of loosing accuracy.
That should get you started and I will stop my answer here, since the topic is just too broad to continue with from here.

How should I make my parser concurrent?

I'm working on implementing a music programming language parser in Clojure. The idea is that you run the parser program with a text file as a command-line argument; the text file contains code in this music language I'm developing; the parser interprets the code and figures out what "instrument instances" have been declared, and for each instrument instance, it parses the code and returns a sequence of musical "events" (notes, chords, rests, etc.) that the instrument does. So before that last step, we have multiple strings of "music code," one string per instrument instance.
I'm somewhat new to Clojure and still learning the nuances of how to use reference types and threads/concurrency. My parser is going to be doing some complex parsing, so I figured it would benefit from using concurrency to boost performance. Here are my questions:
The simplest way to do this, it seems, would be to save the concurrency for after the instruments are "split up" by the initial parse (a single-thread operation), then parse each instrument's code on a different thread at the same time (rather than wait for each instrument to finish parsing before moving onto the next). Am I on the right track, or is there a more efficient and/or logical way to structure my "concurrency plan"?
What options do I have for how to implement this concurrent parsing, and which one might work the best, either from a performance or a code maintenance standpoint? It seems like it could be as simple as: (map #(future (process-music-code %)) instrument-instances), but I'm not sure if there is a better way to do it like with an agent, or manual threads via Java interop, or what. I'm new to concurrent programming, so any input on different ways to do this would be great.
From what I've read, it seems that Clojure's reference types play an important role in concurrent programming, and I can see why, but is it always necessary to use them when working with multiple threads? Should I worry about making some of my data mutable? If so, what in particular should be mutable in the code for the parser I'm writing? and what reference type(s) would be best suited for what I'm doing? The nature of the way my program will work (user runs the program with a text file as an argument -- program processes it and turns it into audio) makes it seem like I don't need anything to be mutable, since the input data never changes, so my gut tells me I won't need to use any reference types, but then again, I might not fully understand the relationship between reference types and concurrency in Clojure.
I would suggest that you might be distracting yourself from more important things (like working out the details of your music language) by premature optimization. It would be better to write the simplest, easiest-to-code parser which you can first, to get up and running. If you find it too slow, then you can look at how to optimize for better performance.
The parser should be fairly self-contained, and will probably not take a whole lot of code anyways, so even if you later throw it out and rewrite it, it will not be a big loss. And the experience of writing the first parser will help if and when you write the second one.
Other points:
You are absolutely right about reference types -- you probably won't need any. Your program is a compiler -- it takes input, transforms it, writes output, then exits. That is the ideal situation for pure functional programming, with nothing mutable and all flow of data going purely through function arguments and return values.
Using a parser generator is usually the quickest way to get a working parser, but I haven't found a really good parser generator for Clojure. Parsley has a really nice API, but it generates LR(0) parsers, which are almost useless for anything which does not have clear, unambiguous markers for the beginning/end of each "section". (Like the way S-expressions open and close with parens.) There are a couple parser combinator libraries out there, like squarepeg, but I don't like their APIs and prefer to write my own hand-coded, recursive-descent parsers using my own implementation of something like parser combinators. (They're not fast, but the code reads really well.)
I can only support Alex Ds point that writing parsers is an excellent exercise. You should definitely do it in C one time. From my own experience, it's a lot of debugging training at least.
Aside from that, given that you are in the beautiful world of Clojure notice the following:
Your parser will transform ordinary strings to data structures, like
{:command :declare,
:args {:name "bazooka-violin",
In Clojure you can read such data structures easily from EDN files. Possibly it would be a more valuable approach to play around with finding suitable structures directly before you constrain the syntax of your language too much for it to be flexible for later changes in the way your language works.
Don't ever think about writing for performance. Unless your user describes the collected works of Bach in a file, it's unlikely that it will take more than a second to parse.
If you write your interpreter in a functional, modular and concise way, it should be easy to decompose it into steps that can be parallelized using various techniques from pmap to core.reducers. The same of course goes for all other code and your parser as well (if multi-threading is a necessity there).
Even Clojure is not compiled in parallel. However it supports recompilation (on the JVM) which in contrast is a way more valuable feature to think about.
As an aside, I've been reading The Joy of Clojure, and I just learned that there is a nifty clojure.core function called pmap (parallel map) that provides a nice, easy way to perform an operation in parallel on a sequence of data. It's syntax is just like map, but the difference is that it performs the function on each item of the sequence in parallel and returns a lazy sequence of the results! This can generally give a performance boost, but it depends on the inherent performance cost of coordinating the sequence result, so whether or not pmap gives a performance boost will depend on the situation.
At this stage in my MPL parser, my plan is to map a function over a sequence of instruments/music data, transforming each instrument's music data from a parse tree into audio. I have no idea how costly this transformation will be, but if it turns out that it takes a while to generate the audio for each instrument individually, I suppose I could try changing my map to pmap and see if that improves performance.

what is the fastest word search on index?

i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.
You may want to look into Tries, and in many cases they are faster than binary search.
If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to.
In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).
after your comment on David's answer, I'd say that you need two different indexes:
the 'inverted index', where you keep all the words, each with a list of places found
an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.
