String sorting - std::set or std::vector? - string

I'm going to have around 1000 strings that need to be sorted alphabetically.
std::set, from what I've read, is sorted. std::vector is not. std::set seems to be a simpler solution, but if I were to use a std::vector, all I would need to do is use is std::sort to alphabetize the strings.
My application may or may not be performance critical, so performance isn't necessarily the issue here (yet), but since I'll need to iterate through the container to write the strings to the file, I've read that iterating through a std::set is a tad bit slower than iterating through a std::vector.
I know it probably won't matter, but I'd like to hear which one you all would go with in this situation.
Which stl container would best suit my needs? Thanks.

std::vector with a one-time call to std::sort seems like the simplest way to do what you are after, computationally speaking. std::set provides dynamic lookups by key, which you don't really need here, and things will get more complicated if you have to deal with duplicates.
Make sure you use reserve to pre-allocate the memory for the vector, since you know the size ahead of time in this case. This prevents memory reallocation as you add to the vector (very expensive).
Also, if you use [] notation instead of push_back() you may save a little time avoiding bounds checking of push_back(), but that is really academic and insubstantial and perhaps not even true with compiler optimizations. :-)

Related

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

Efficient algorithm for grouping array of strings by prefixes

I wonder what is the best way to group an array of strings according to a list of prefixes (of arbitrary length).
For example, if we have this:
prefixes = ['GENERAL', 'COMMON', 'HY-PHE-NATED', 'UNDERSCORED_']
Then
tasks = ['COMMONA', 'COMMONB', 'GENERALA', 'HY-PHE-NATEDA', 'UNDERESCORED_A', 'HY-PHE-NATEDB']
Should be grouped this way:
[['GENERALA'], ['COMMONA', 'COMMONB'], ['HY-PHE-NATEDA', 'HY-PHE-NATEDB'], ['UNDERESCORED_A'] ]
The naïve approach is to loop through all the tasks and inner loop through prefixes (or vice versa, whatever) and test each task for each prefix.
Can one give me a hint how to make this in a more efficient way?
It depends a bit on the size of your problem, of course, but your naive approach should be okay if you sort both your prefixes and your tasks and then build your sub-arrays by traversing both sorted lists only forwards.
There are a few options, but you might be interested in looking into the trie data structure.
http://en.wikipedia.org/wiki/Trie
The trie data structure is easy to understand and implement and works well for this type of problem. If you find that this works for your situation you can also look at Patricia Tries which achieve the similar performance characteristics but typically have better memory utilization. They are a little more involved to implement but not overly complex.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)
A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.
You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Haskell arrays vs lists

I'm playing with Haskell and Project Euler's 23rd problem. After solving it with lists I went here where I saw some array work. This solution was much faster than mine.
So here's the question. When should I use arrays in Haskell? Is their performance better than lists' and in which cases?
The most obvious difference is the same as in other languages: arrays have O(1) lookup and lists have O(n). Attaching something to the head of a list (:) takes O(1); appending (++) takes O(n).
Arrays have some other potential advantages as well. You can have unboxed arrays, which means the entries are just stored contiguously in memory without having a pointer to each one (if I understand the concept correctly). This has several benefits--it takes less memory and could improve cache performance.
Modifying immutable arrays is difficult--you'd have to copy the entire array which takes O(n). If you're using mutable arrays, you can modify them in O(1) time, but you'd have to give up the advantages of having a purely functional solution.
Finally, lists are just much easier to work with if performance doesn't matter. For small amounts of data, I would always use a list.
And if you're doing much indexing as well as much updating,you can
use Maps (or IntMaps), O(log size) indexing and update, good enough for most uses, code easy to get right
or, if Maps are too slow, use mutable (unboxed) arrays (STUArray from Data.Array.ST or STVectors from the vector package; O(1) indexing and update, but the code is easier to get wrong and generally not as nice.
For specific uses, functions like accumArray give very good performance too (uses mutable arrays under the hood).
Arrays have O(1) indexing (this used to be part of the Haskell definition), whereas lists have O(n) indexing. On the other hand, any kind of modification of arrays is O(n) since it has to copy the array.
So if you're going to do a lot of indexing but little updating, use arrays.

Creating strings in D without allocating memory?

Is there any typesafe way to create a string in D, using information only available at runtime, without allocating memory?
A simple example of what I might want to do:
void renderText(string text) { ... }
void renderScore(int score)
{
char[16] text;
int n = sprintf(text.ptr, "Score: %d", score);
renderText(text[0..n]); // ERROR
}
Using this, you'd get an error because the slice of text is not immutable, and is therefore not a string (i.e. immutable(char)[])
I can only think of three ways around this:
Cast the slice to a string. It works, but is ugly.
Allocate a new string using the slice. This works, but I'd rather not have to allocate memory.
Change renderText to take a const(char)[]. This works here, but (a) it's ugly, and (b) many functions in Phobos require string, so if I want to use those in the same manner then this doesn't work.
None of these are particularly nice. Am I missing something? How does everyone else get around this problem?
You have static array of char. You want to pass it to a function that takes immutable(char)[]. The only way to do that without any allocation is to cast. Think about it. What you want is one type to act like it's another. That's what casting does. You could choose to use assumeUnique to do it, since that does exactly the cast that you're looking for, but whether that really gains you anything is debatable. Its main purpose is to document that what you're doing by the cast is to make the value being cast be treated as immutable and that there are no other references to it. Looking at your example, that's essentially true, since it's the last thing in the function, but whether you want to do that in general is up to you. Given that it's a static array which risks memory problems if you screw up and you pass it to a function that allows a reference to it to leak, I'm not sure that assumeUnique is the best choice. But again, it's up to you.
Regardless, if you're doing a cast (be it explicitly or with assumeUnique), you need to be certain that the function that you're passing it to is not going to leak references to the data that you're passing to it. If it does, then you're asking for trouble.
The other solution, of course, is to change the function so that it takes const(char)[], but that still runs the risk of leaking references to the data that you're passing in. So, you still need to be certain of what the function is actually going to do. If it's pure, doesn't return const(char)[] (or anything that could contain a const(char)[]), and there's no way that it could leak through any of the function's other arguments, then you're safe, but if any of those aren't true, then you're going to have to be careful. So, ultimately, I believe that all that using const(char)[] instead of casting to string really buys you is that you don't have to cast. That's still better, since it avoids the risk of screwing up the cast (and it's just better in general to avoid casting when you can), but you still have all of the same things to worry about with regards to escaping references.
Of course, that also requires that you be able to change the function to have the signature that you want. If you can't do that, then you're going to have to cast. I believe that at this point, most of Phobos' string-based functions have been changed so that they're templated on the string type. So, this should be less of a problem now with Phobos than it used to be. Some functions (in particular, those in std.file), still need to be templatized, but ultimately, functions in Phobos that require string specifically should be fairly rare and will have a good reason for requiring it.
Ultimately however, the problem is that you're trying to treat a static array as if it were a dynamic array, and while D definitely lets you do that, you're taking a definite risk in doing so, and you need to be certain that the functions that you're using don't leak any references to the local data that you're passing to them.
Check out assumeUnique from std.exception Jonathan's answer.
No, you cannot create a string without allocation. Did you mean access? To avoid allocation, you have to either use slice or pointer to access a previously created string. Not sure about cast though, it may or may not allocate new memory space for the new string.
One way to get around this would be to copy the mutable chars into a new immutable version then slice that:
void renderScore(int score)
{
char[16] text;
int n = sprintf(text.ptr, "Score: %d", score);
immutable(char)[16] itext = text;
renderText(itext[0..n]);
}
However:
DMD currently doesn't allow this due to a bug.
You're creating an unnecessary copy (better than a GC allocation, but still not great).

Resources