Calculate Length of XElement that is not Wasteful? - string

I have a large XElement property, and want to know the byte size of it for logging purposes. I don't want to just ToString() it because I have concerns about potentially big strings (not) getting GC'd.
What is a smart/compact way to calculate the XML content of an XElement (.NET 4.0).
Thanks.

In C#, there is no easy way to know an object's size. You can recursively compute it knowing the size of the primitive types and using reflection, but it's not an easy job.
I assume you don't really care about the XElement size, but rather about the serialized XML content (they are quite different sizes naturally). I think to get that you need to serialize it (i.e. call ToString()) - there is no way around it.

Related

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

V8 JavaScript Object vs Binary Tree

Is there a faster way to search data in JavaScript (specifically on V8 via node.js, but without c/c++ modules) than using the JavaScript Object?
This may be outdated but it suggests a new class is dynamically generated for every single property. Which made me wonder if a binary tree implementation might be faster, however this does not appear to be the case.
The binary tree implementation isn't well balanced so it might get better with balancing (only the first 26 values are roughly balanced by hand.)
Does anyone have an idea on why or how it might be improved? On another note: does the dynamic class notion mean there are actually ~260,000 properties (in the jsperf benchmark test of the second link) and subsequently chains of dynamic class definitions held in memory?
V8 uses the concepts of 'maps', which describe the layout of the data in an object.
These maps can be "fast maps" which specify a fixed offset from the start of the object at which a particular property can be found, or they can be "dictionary map", which use a hashtable to provide a lookup mechanism.
Each object has a pointer to the map that describes it.
Generally, objects start off with a fast map. When a property is added to an object with a fast map, the map is transitioned to a new one which describes the location of the new property within the object. The object is re-allocated with enough space for the new data item if necessary, and the object's map pointer is set to the new map.
The old map keeps a record of the transitions from it, including a pointer to the new map and a description of the property whose addition caused the map transition.
If another object which has the old map gets the same property added (which is very common, since objects of the same type tend to get used in the same way), that object will just use the new map - V8 doesn't create a new map in this case.
However, once the number of properties goes over a certain theshold (in fact, the current metric is to do with the storage space used, not the actual number of properties), the object is changed to use a dictionary map. At this point the object is re-written using a hashtable. In general, it won't undergo any more map transitions - any more properties that are added will just go in the hashtable.
Fast maps allow V8 to generate optimized code (using Crankshaft) where the offset of a property within an object is hard-coded into the machine code. This makes it very fast for cases where it can do this - it avoids the need for doing any lookup.
Obviously, the generated machine code is then dependent on the map - if the object's data layout changes, the code has to be discarded and re-optimized when necessary. V8 has a type profiling mechanism which collects information about what the types of various objects are during execution of unoptimized code. It doesn't trigger optimization of the code until certain stability constraints are met - one of these is that the maps of objects used in the function aren't changing frequently.
Here's a more detailed description of how this stuff works.
Here's a video where one of the lead developers of V8 describes stuff like map transitions and lots more.
For your particular test case, I would think that it goes through a few hundred map transitions while properties are being added in the preparation loop, then it will eventually transition to a dictionary based object. It certainly won't go through 260,000 of them.
Regarding your question about binary trees: a properly sized hashtable (with a sensible hash function and a significant number of objects in it) will always outperform a binary tree for a use-case where you're just searching, as your test code seems to do (all of the insertion is done in the setup phase).

Are there any advantages of using Chronicle Map for java.util.String over .`intern` method for the purpose of lowering heap usage?

Intention is to reduce old gen size in order to lower gc pauses.
In my understanding Chronicle Map will store objects in native space and (starting from java 8) String#intern will do the same because interned string are in metaspace.
I was curious whenever I need to use chronicle map, or it's ok to stick with intern method.
ChronicleMap couldn't serve as a direct replacement of String.intern() because java.lang.String instances are always on-heap. So you won't actually win anything, even storing strings in ChronicleMap, because before using them you will deserialize them to on-heap object.
ChronicleMap as a data structure, (not necessarily with Java implementation, maybe C++) indeed could be used for some sort of caching textual data, especially inter-process. But I suspect it is very far from what you are seeking. For example on Java side, it will require a separate value class (not String and not StringBuilder), implementing CharSequence at best.
Also, very likely you don't need intern, but deduplication that could be much more effective, see java.lang.String Catechism talk, "Intern" section.

Is it a reasonable practice to serialize Haskell data structures to disk just using Show/Read

I've played around with the Text.Show.Pretty module, and it makes it possible to serialize out Haskell data structures like records into a nice human-readable format & still be able to deserialize them easily using read. The output format is even more readable than YAML and JSON.
Example serialized output for a Haskell record using Text.Show.Pretty:
Book
{ author = "Plato"
, title = "Republic"
, numbers = [ 123
, 1234
]
}
Coming from the Ruby world, I know that YAML and JSON are most Rubyists' preferred format for serializing data structures. Are Haskell Show and Read instances used often to achieve the same end in Haskell?
For big structures, I wouldn't recommend it. read is slower than molasses. Anecdote time: I have a program named yeganesh. Conceptually, it's pretty simple: read in a [(String,Double)] with about 2000 elements and dump out the keys sorted by their elements. I used to store this using Show/Read, but found that switching to a custom printer and parser sped up the program by a factor of 8. (Note: it's not that the parsing sped up by a factor of eight. The whole program sped up by a factor of eight. That means the parsing sped up by a bigger factor than that.) That made the difference between uncomfortably long pauses and instant gratification.
I agree with Daniel Wagner but if you want file that a user can manipulate with simple text editors you could use the read/show for a small set of data, aka config files.
I don't think that is a common way amongst haskellers though, I usually use parsec instead of read 'config data' and a custom class /instance instead of Show.
If you got alot of data one usually use Data.Binary or Data.Serialize.

Delphi String Sharing Question

I have a large amout of objects that all have a filename stored inside. All file names are within a given base directory (let's call it C:\BaseDir\). I am now considering two alternatives:
Store absolute paths in the objects
Store relative paths in the object and store the base path additionally
If I understand Delphi strings correctly the second approach will need much less memory because the base path string is shared - given that I pass the same string field to all the objects like this:
TDataObject.Create (FBasePath, RelFileName);
Is that assumption true? Will there be only one string instance of the base path in memory?
If anybody knows a better way to handle situations like this, feel free to comment on that as well.
Thanks!
You are correct. When you write s1 := s2 with two string variables, there is one string in memory with (at least two) references to it.
You also ask whether trying to reduce the number of strings in memory is a good idea. That depends on how many strings you have in comparison to other memory consuming objects. Only you can really answer that.
As David said, the common string would be shared (unless you use ie UniqueString()).
Having said that, it looks like premature optimisation. If you actually need to work with full paths and never need the dir and filename part separately then you should think about splitting them up only when you really run into memory problems.
Constantly concatenating the base and filename parts could significantly slow down your program and cause memory fragmentation.

Resources