does J2ME have something similar to RandomAccessFile class, or is there any way to emulate this particular (random access) functionality?
The problem is this: I have a rather large binary data file (~600 KB) and would like to create a mobile application for using that data. Format of that data is home-made and contains many index blocks and data blocks. Reading the data on other platforms (like PHP or C) usually goes like this:
Read 2 bytes for index key (K), another 2 for index value (V) for the data type needed
Skip V bytes from the start of the file to seek to a file position there the data for index key K starts
Read the data
Profit :)
This happens many times during the program flow.
Um, and I'm investigating possibility of doing the very same on J2ME, and while I admit I'm quite new to the whole Java thing, I can't seem to be able to find anything beyond InputStream (DataInputStream) classes which don't have the basic seeking/skipping to byte/returning position functions I need.
So, what are my chances?
You should have something like this
try {
DataInputStream di = new DataInputStream(is);
di.marke(9999);
short key = di.readShort();
short val = di.readShort();
di.reset();
di.skip(val);
byte[] b= new byte[255];
di.read(b);
}catch(Exception ex ) {
ex.printStackTrace();
}
I prefer not to use the marke/reset methods, I think it is better to save the offset from the val location not from the start of the file so you can skip these methods. I think they have som issues on some devices.
One more note, I don't recommend to open a 600 KB file, it will crash the application on many low end devices, you should split this file to multiple files.
Related
I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)
Context: I'm writing something to process log data which involves loading several GB of data into memory and cross checking various things, finding correlations in data and writing the results out to another file. (This is essentially a cooking/denormalization step before loading into a Druid.io cluster.) I want to avoid having to write the information to a database for both performance and code simplicity - it is assumed that in the foreseeable future the volume of data processed at one time can be handled by adding memory to the machine.
My question is if it is a good idea to attempt to explicitly deduplicate strings in my code; and if so, what is a good approach. Many of the values in these log files are the same exact pieces of text (probably about 25% of the total text values in the file are unique, rough guess).
Since we're talking about gigs of data, and while ram is cheap and swap is possible, there is still a limit and if I'm careless I will very likely hit it. If I do something like this:
strstore := make(map[string]string)
// do some work that involves slicing and dicing some text, resulting in:
var a = "some string that we figured out that has about a 75% chance of being duplicate"
// note that there are about 10 other such variables that are calculated here, only "a" shown for simplicity
if _, ok := strstore[a]; !ok {
strstore[a] = a
} else {
a = strstore[a]
}
// now do some more stuff with "a" and keep it in a struct which is in
// some map some place
It would seem to me that this would have the effect of "reusing" existing strings, at the cost of a hash lookup and compare. Seemingly a good trade off.
However, this might not be that helpful if the strings that are in fact new cause memory to be fragmented and have various holes that are left unclaimed.
I could also try to keep one big byte array/slice which has the character data and index into that, but it would make the code hard to write (esp having to mess around with conversion between []byte and strings, which involves it's own allocation) and I would probably just be doing a poor job of something that is really the Go runtime's domain anyway.
Looking for any advice on approaches to this problem, or if anyone's experience with this sort of thing has yielded particularly useful mechanisms to address this.
There are a lot of interesting data structures and algorithms that you could use here. It depends on what you are trying do in the stats and processing stages.
I am not sure how compressible your logs are but you could pre-process the data, again depending on your uses cases : https://github.com/alecthomas/mph/blob/master/README.md
Take a look at some of these data structures as well for background :
https://github.com/Workiva/go-datastructures
My program reads a file, interleaving it as below:
The file to be read is large. It is split into four parts that are then split into many blocks. My program first reads block 1 of part 1, then jumps to block 1 of part 2, and so on. Then back to the block 2 of part 1, ..., as such.
The performance drops in tests. I believe the reason is that the page cache feature of kernel doesn't work efficiently in such situations. But the file is too large to mmap(), and the file is located in NFS.
How do I speed up reading in such a situation? Any comments and suggestions are welcome.
You may want to use posix_fadvise() to give the system hints on your usage, eg. use POSIX_FADV_RANDOM to disable readahead, and possibly use POSIX_FADV_WILLNEED to have the system try to read the next block into the page cache before you need it (if you can predict this).
You could also try to use POSIX_FADV_DONTNEED once you are done reading a block to have the system free the underlying cache pages, although this might not be necessary
For each pair of blocks, read both in, process the first, and push the second on to a stack. When you come to the end of the file, start shifting values off the bottom of the stack, processing them one by one.
You can break up the reading into linear chunks. For example, if your code looks like this:
int index = 0;
for (int block=0; block<n_blocks; ++block) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+block);
data[part] = readChar(file);
}
send(data);
}
change it to this:
for (int chunk=0; chunk<n_chunks; ++chunk) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+chunk*n_blocks_per_chunk);
for (int block=0; block<n_blocks_per_chunk; ++block) {
data[block*n_parts+part] = readChar(file);
}
}
send(data);
}
Then optimize n_blocks_per_chunk for your cache.
What I want to do: lets suppose I have a TStringStream that just read a string with 100 characters. If I call .ReadString(50), I will get the first 50 characters of this stream and its cursor is going to be placed on the position 51.
My question is: how do I toss the characters 1 to 50 in this stream in a fast and clean way? I want to read the rest (51 to 100) later.
Thanks in advance.
You cannot do what you are hoping to do. The string stream's data is a Delphi string which is stored as a single memory block. Memory blocks are atomic, they cannot be split. You cannot free some part of a memory block.
If you really need to return memory to the memory manager then you should create a new string with the already processed data removed. You can then re-create your string stream with this new input and destroy the previous string stream.
Having said that, it's hard to see that doing much other than increasing your memory fragmentation. If the sizes of memory involved are large enough, and if the string stream persists for long enough, then this just might be a sensible approach. Otherwise it sounds like an attempt to optimise that actually would hinder performance.
Perhaps some class other than string stream could be more appropriate but it's very hard to advise without knowing more details.
You can't do this. If you really need to do this, you should write your own class that implements the stream-interface and which would let you process some data a little bit at a time and free whatever you want to free. Note that you would only be able to go through the data once, since you've now deleted your data. That is, seeking to the beginning again would become impossible, and your current stream "position" would be a lie.
In short, sounds like you're confused.
If I understand correctly you which to skip forward in the stream?
You can do:
Str.Position := Str.Position + 50;
Or like this:
Str.Seek(50,TSeekOrigin.soCurrent);
I have to pass an InputStream as a parameter to a 3rd party library, which will read the complete contents from the InputStream and do its job.
My problem is, some of my files are Zip files - with more than one ZipEntry. From what I understand, one can read one zipEntry at a time and then do a zipInputStream.getNextEntry() and then again read and so on. But, the 3rd party library doesn't understand this and expects a single InputStream. All the zipEntries of a zip file should be available as a single inputStream.
Please enlighten me as to how to do it. I cannot use ZipFile as the file is not stored locally (in a different server). I also cannot read all zipEntries and construct a ByteArrayOutputStream or a string as the files can be very big and that will spike memory usage.
I want a way to let one inputStream read from multiple zip entries of a single zip file transparantly.
thanks in advance,
Prasanna
I'm no expert but I see two ways of doing what you are asking for:
While I wouldn't call it the most efficient way, the fairly foolproof method is to write them all to a buffer and then use that buffer to feed an inputstream. You could look into using a ByteArrayOutputStream, feed it your ZipEntries, and then create a ByteArrayInputStream with the same buf
Alternatively the SequenceInputStream sounds like it does what you are after, but I've not had a chance to look into it in much detail.
K.Barad