in-memory string deduplication - string

Context: I'm writing something to process log data which involves loading several GB of data into memory and cross checking various things, finding correlations in data and writing the results out to another file. (This is essentially a cooking/denormalization step before loading into a Druid.io cluster.) I want to avoid having to write the information to a database for both performance and code simplicity - it is assumed that in the foreseeable future the volume of data processed at one time can be handled by adding memory to the machine.
My question is if it is a good idea to attempt to explicitly deduplicate strings in my code; and if so, what is a good approach. Many of the values in these log files are the same exact pieces of text (probably about 25% of the total text values in the file are unique, rough guess).
Since we're talking about gigs of data, and while ram is cheap and swap is possible, there is still a limit and if I'm careless I will very likely hit it. If I do something like this:
strstore := make(map[string]string)
// do some work that involves slicing and dicing some text, resulting in:
var a = "some string that we figured out that has about a 75% chance of being duplicate"
// note that there are about 10 other such variables that are calculated here, only "a" shown for simplicity
if _, ok := strstore[a]; !ok {
strstore[a] = a
} else {
a = strstore[a]
}
// now do some more stuff with "a" and keep it in a struct which is in
// some map some place
It would seem to me that this would have the effect of "reusing" existing strings, at the cost of a hash lookup and compare. Seemingly a good trade off.
However, this might not be that helpful if the strings that are in fact new cause memory to be fragmented and have various holes that are left unclaimed.
I could also try to keep one big byte array/slice which has the character data and index into that, but it would make the code hard to write (esp having to mess around with conversion between []byte and strings, which involves it's own allocation) and I would probably just be doing a poor job of something that is really the Go runtime's domain anyway.
Looking for any advice on approaches to this problem, or if anyone's experience with this sort of thing has yielded particularly useful mechanisms to address this.

There are a lot of interesting data structures and algorithms that you could use here. It depends on what you are trying do in the stats and processing stages.
I am not sure how compressible your logs are but you could pre-process the data, again depending on your uses cases : https://github.com/alecthomas/mph/blob/master/README.md
Take a look at some of these data structures as well for background :
https://github.com/Workiva/go-datastructures

Related

Why GBuffers need to be created for each frame in D3D12?

I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".
The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

How to partially read from a TStringStream, free the read data from the stream and keep the rest (the unread data)?

What I want to do: lets suppose I have a TStringStream that just read a string with 100 characters. If I call .ReadString(50), I will get the first 50 characters of this stream and its cursor is going to be placed on the position 51.
My question is: how do I toss the characters 1 to 50 in this stream in a fast and clean way? I want to read the rest (51 to 100) later.
Thanks in advance.
You cannot do what you are hoping to do. The string stream's data is a Delphi string which is stored as a single memory block. Memory blocks are atomic, they cannot be split. You cannot free some part of a memory block.
If you really need to return memory to the memory manager then you should create a new string with the already processed data removed. You can then re-create your string stream with this new input and destroy the previous string stream.
Having said that, it's hard to see that doing much other than increasing your memory fragmentation. If the sizes of memory involved are large enough, and if the string stream persists for long enough, then this just might be a sensible approach. Otherwise it sounds like an attempt to optimise that actually would hinder performance.
Perhaps some class other than string stream could be more appropriate but it's very hard to advise without knowing more details.
You can't do this. If you really need to do this, you should write your own class that implements the stream-interface and which would let you process some data a little bit at a time and free whatever you want to free. Note that you would only be able to go through the data once, since you've now deleted your data. That is, seeking to the beginning again would become impossible, and your current stream "position" would be a lie.
In short, sounds like you're confused.
If I understand correctly you which to skip forward in the stream?
You can do:
Str.Position := Str.Position + 50;
Or like this:
Str.Seek(50,TSeekOrigin.soCurrent);

How can I increase memory security in Delphi?

Is it possible to "wipe" strings in Delphi? Let me explain:
I am writing an application that will include a DLL to authorise users. It will read an encrypted file into an XML DOM, use the information there, and then release the DOM.
It is obvious that the unencrypted XML is still sitting in the memory of the DLL, and therefore vulnerable to examination. Now, I'm not going to go overboard in protecting this - the user could create another DLL - but I'd like to take a basic step to preventing user names from sitting in memory for ages. However, I don't think I can easily wipe the memory anyway because of references. If I traverse my DOM (which is a TNativeXML class) and find every string instance and then make it into something like "aaaaa", then will it not actually assign the new string pointer to the DOM reference, and then leave the old string sitting there in memory awaiting re-allocation? Is there a way to be sure I am killing the only and original copy?
Or is there in D2007 a means to tell it to wipe all unused memory from the heap? So I could release the DOM, and then tell it to wipe.
Or should I just get on with my next task and forget this because it is really not worth bothering.
I don't think it is worth bothering with, because if a user can read the memory of the process using the DLL, the same user can also halt the execution at any given point in time. Halting the execution before the memory is wiped will still give the user full access to the unencrypted data.
IMO any user sufficiently interested and able to do what you describe will not be seriously inconvenienced by your DLL wiping the memory.
Two general points about this:
First, this is one of those areas where "if you have to ask, you probably shouldn't be doing this." And please don't take that the wrong way; I mean no disrespect to your programming skills. It's just that writing secure, cryptographically strong software is something that either you're an expert at or you aren't. Very much in the same way that knowing "a little bit of karate" is much more dangerous than knowing no karate at all. There are a number of third-party tools for writing secure software in Delphi which have expert support available; I would strongly encourage anyone without a deep knowledge of cryptographic services in Windows, the mathematical foundations of cryptography, and experience in defeating side channel attacks to use them instead of attempting to "roll their own."
To answer your specific question: The Windows API has a number of functions which are helpful, such as CryptProtectMemory. However, this will bring a false sense of security if you encrypt your memory, but have a hole elsewhere in the system, or expose a side channel. It can be like putting a lock on your door but leaving the window open.
How about something like this?
procedure WipeString(const str: String);
var
i:Integer;
iSize:Integer;
pData:PChar;
begin
iSize := Length(str);
pData := PChar(str);
for i := 0 to 7 do
begin
ZeroMemory(pData, iSize);
FillMemory(pData, iSize, $FF); // 1111 1111
FillMemory(pData, iSize, $AA); // 1010 1010
FillMemory(pData, iSize, $55); // 0101 0101
ZeroMemory(pData, iSize);
end;
end;
DLLs don't own allocated memory, processes do. The memory allocated by your specific process will be discarded once the process terminates, whether the DLL hangs around (because it is in use by another process) or not.
How about decrypting the file to a stream, using a SAX processor instead of an XML DOM to do your verification and then overwriting the decrypted stream before freeing it?
If you use the FastMM memory manager in Full Debug mode, then you can force it to overwrite memory when it is being freed.
Normally that behaviour is used to detect wild pointers, but it can also be used for what your want.
On the other hand, make sure you understand what Craig Stuntz writes: do not write this authentication and authorization stuff yourself, use the underlying operating system whenever possible.
BTW: Hallvard Vassbotn wrote a nice blog about FastMM:
http://hallvards.blogspot.com/2007/05/use-full-fastmm-consider-donating.html
Regards,
Jeroen Pluimers
Messy but you could make a note of the heap size that you've used while you've got the heap filled with sensitive data then when that is released do a GetMem to allocate you a large chunk spanning (say) 200% of that. do a Fill on that chunk and make the assumption that any fragmentation is unlinkely to be of much use to an examiner.
Bri
How about keeping the password as a hash value in the XML and verify by comparing the hash of the input password with the hashed password in your XML.
Edit: You can keep all the sensitive data encrypted and decrypt only at the last possible moment.
Would it be possible to load the decrypted XML into an array of char or byte rather than a string? Then there would be no copy-on-write handling, so you would be able to backfill the memory with #0's before freeing?
Be careful if assigning array of char to string, as Delphi has some smart handling here for compatibility with traditional packed array[1..x] of char.
Also, could you use ShortString?
If your using XML, even encrypted, to store passwords your putting your users at risk. A better approach would be to store the hash values of the passwords instead, and then compare the hash against the entered password. The advantage of this approach is that even in knowing the hash value, you won't know the password which makes the hash. Adding a brute force identifier (count invalid password attempts, and lock the account after a certain number) will increase security even further.
There are several methods you can use to create a hash of a string. A good starting point would be to look at the turbo power open source project "LockBox", I believe it has several examples of creating one way hash keys.
EDIT
But how does knowing the hash value if its one way help? If your really paranoid, you can modify the hash value by something prediticable that only you would know... say, a random number using a specific seed value plus the date. You could then store only enough of the hash in your xml so you can use it as a starting point for comparison. The nice thing about psuedo random number generators is that they always generate the same series of "random" numbers given the same seed.
Be careful of functions that try to treat a string as a pointer, and try to use FillChar or ZeroMemory to wipe the string contents.
this is both wrong (strings are shared; you're screwing someone else who's currently using the string)
and can cause an access violation (if the string happens to have been a constant, it is sitting on a read-only data page in the process address space; and trying to write to it is an access violation)
procedure BurnString(var s: UnicodeString);
begin
{
If the string is actually constant (reference count of -1), then any attempt to burn it will be
an access violation; as the memory is sitting in a read-only data page.
But Delphi provides no supported way to get the reference count of a string.
It's also an issue if someone else is currently using the string (i.e. Reference Count > 1).
If the string were only referenced by the caller (with a reference count of 1), then
our function here, which received the string through a var reference would also have the string with
a reference count of one.
Either way, we can only burn the string if there's no other reference.
The use of UniqueString, while counter-intuitiave, is the best approach.
If you pass an unencrypted password to BurnString as a var parameter, and there were another reference,
the string would still contain the password on exit. You can argue that what's the point of making a *copy*
of a string only to burn the copy. Two things:
- if you're debugging it, the string you passed will now be burned (i.e. your local variable will be empty)
- most of the time the RefCount will be 1. When RefCount is one, UniqueString does nothing, so we *are* burning
the only string
}
if Length(s) > 0 then
begin
System.UniqueString(s); //ensure the passed in string has a reference count of one
ZeroMemory(Pointer(s), System.Length(s)*SizeOf(WideChar));
{
By not calling UniqueString, we only save on a memory allocation and wipe if RefCnt <> 1
It's an unsafe micro-optimization because we're using undocumented offsets to reference counts.
And i'm really uncomfortable using it because it really is undocumented.
It is absolutely a given that it won't change. And we'd have stopping using Delphi long before
it changes. But i just can't do it.
}
//if PLongInt(PByte(S) - 8)^ = 1 then //RefCnt=1
// ZeroMemory(Pointer(s), System.Length(s)*SizeOf(WideChar));
s := ''; //We want the callee to see their passed string come back as empty (even if it was shared with other variables)
end;
end;
Once you have the UnicodeString version, you can create the AnsiString and WideString versions:
procedure BurnString(var s: AnsiString); overload;
begin
if Length(s) > 0 then
begin
System.UniqueString(s);
ZeroMemory(Pointer(s), System.Length(s)*SizeOf(AnsiChar));
//if PLongInt(PByte(S) - 8)^ = 1 then //RefCount=1
// ZeroMemory(Pointer(s), System.Length(s)*SizeOf(AnsiChar));
s := '';
end;
end;
procedure BurnString(var s: WideString);
begin
//WideStrings (i.e. COM BSTRs) are not reference counted, but they are modifiable
if Length(s) > 0 then
begin
ZeroMemory(Pointer(s), System.Length(s)*SizeOf(WideChar));
//if PLongInt(PByte(S) - 8)^ = 1 then //RefCount=1
// ZeroMemory(Pointer(s), System.Length(s)*SizeOf(AnsiChar));
s := '';
end;
end;

Resources