Groovy: Appending to large files

Groovy: Appending to large files - groovy

How do I append to large files efficiently. I have a process that has to continually append to a file and as the file size grows the performance seems to slow down as well. Is there anyway to specify a large buffer size with the append

While Don's approach is valid, in general (it will throw an exception, though, because of a syntax error, and you need to flush() a BufferedOutputStream), I've been planning to elaborate further (which takes its time).
Groovy does not provide special objects for I/O operations. Thus, you would use Java's FileOutputStream (for writing bytes) or FileWriter (for writing strings). Both provide a constructor that takes a boolean append parameter.
For both, there exist decorators (BufferedOutputStream and BufferedWriter), which are buffered. "Buffering", in this scope, means, that contents are not necessarily written instantly to the underlying stream, and thus, there's a potential for I/O optimization.
Don already provided a sample for the BufferedOutputStream, and here's one for the BufferedWriter:
File file = new File("foo")
if (file.exists()) {
assert file.delete()
assert file.createNewFile()
}
boolean append = true
FileWriter fileWriter = new FileWriter(file, append)
BufferedWriter buffWriter = new BufferedWriter(fileWriter)
100.times { buffWriter.write "foo" }
buffWriter.flush()
buffWriter.close()
While Groovy does not provide its own I/O objects, the Groovy JDK (GDK) enhances several Java types by adding convenience methods. In the scope of I/O outputting, the OutputStream and the File types are relevant.
So, finally, you can work with those the "Groovy way":
new File("foo").newOutputStream().withWriter("UTF-8") { writer ->
100.times { writer.write "foo" + it }
}
EDIT: As per your further inquiry:
None of the GDK methods allows for setting a buffer size.
The above "Groovy" code will overwrite the file if called repeatedly. - In contrast, the following piece of code will append the string to an existing file and, thus, can be called repeatedly:
new File("foo").withWriterAppend("UTF-8") { it.write("bar") }

def file = new File('/path/to/file')
// Create an output stream that writes to the file in 'append' mode
def fileOutput = new FileOutputStream(file, true)
// Buffer the output - set bufferSize to whatever size buffer you want to use
def bufferSize = 512
def fileOutput = new BufferedOutputStream(fileOutput, bufferSize)
try {
byte[] contentToAppend = // Get the content to write to the file
fileOutput.write(contentToAppend)
} finally {
fileOutput.close()
}

In the JVM on Windows the append flag has been implemented inefficently with a seek operation.
This is neighter atomic nor very performant when opening the file multiple times. It is supposed to be fixed somewhere in the Java VM 7: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6631352

Related

Python: Writing a bytestream to overwrite an existing Microsoft Structured Storage OLE Stream

Some background to what I am doing:
I am writing a program in Python 3 in the hopes to develop a process to read and write to Microsoft OLE Structured Storage file types.
I am able to create a simple GUI that allows the user to choose which storages and streams that they would like to read and write to using tkinter, PySimpleGUI.
I am using the olefile, pandas, and numpy packages to perform most of my programs legwork, but I have encountered a known issue with olefile, which is:
That the size of the bytestream that is being written must be the same size as the existing bytestream in the OLE file. This became an issue for me relatively quickly after I began debugging my program.
What am I needing to do?
After some extensive research on the main programming sites and buying the book, Python Programming on Win32 (specifically reading Ch12 on COM storage); I have ran myself into a dead end.
https://github.com/joxeankoret/nightmare/blob/master/mutators/OleFileIO_PL.py
https://github.com/decalage2/olefile/issues/6
https://github.com/decalage2/olefile/issues/95
https://github.com/decalage2/olefile/issues/99
The following is the watered down code I am using:
file_path = values[0]
xl_path = values[1]
data = olefile.OleFileIO(file_path)
storages = olefile.OleFileIO.listdir(data, streams=False, storages=True)
streams = olefile.OleFileIO.listdir(data, streams=True, storages=False)
stmdata = data.openstream(streams[index])
readData = data.openstream(streams[index]).read()
#Send the data into Excel to be manipulated by User
with pd.ExcelWriter(xl_path, engine='openpyxl') as ew:
ew.book = xl.load_workbook(xl_path)
df.to_excel(ew, sheet_name=tabNames)
Data is manipulated, now read it back.
Use Pandas to read the data into a DataFrame
df1 = pd.read_excel(xls, x, encoding='utf-8', header=None)
newDF = newDF[0].str.encode(encoding="utf-8")
byteString = newDF[0]
The following statement only allows equal size ByteStrings
data.write_stream(streams[setIndex], byteString)
ValueError: write_stream: data must be the same size as the existing stream
EDIT:
This question was answered by Decalade in the comments below.
Here is the code I used to solve my problem:
istorage = pythoncom.StgOpenStorageEx(file_path, mode, STGFMT_STORAGE, 0, pythoncom.IID_IStorage)
istorage1 = istorage.OpenStorage(stgRelays, None, mode, None, 0)
istorage2 = istorage1.OpenStorage(storage_choice, None, mode, None, 0)
for x in set_compArr:
set_STM = x + '.TXT'
istream = istorage2.OpenStream(set_STM, None, mode, 0)
istream.Write(byteString)

A way to modify OLE/CFB files is to use pythoncom from the pywin32 extensions on Windows (and maybe Linux with WINE): https://github.com/mhammond/pywin32
First, open the OLE file using pythoncom.StgOpenStorageEx: http://timgolden.me.uk/pywin32-docs/pythoncom__StgOpenStorageEx_meth.html
Example:
import pythoncom
from win32com.storagecon import *
mode = STGM_READWRITE|STGM_SHARE_EXCLUSIVE
istorage = pythoncom.StgOpenStorageEx(filename, mode, STGFMT_STORAGE, 0, pythoncom.IID_IStorage)
Then use the methods of the PyIStorage object: http://timgolden.me.uk/pywin32-docs/PyIStorage.html
OpenStream returns a PyIStream object: http://timgolden.me.uk/pywin32-docs/PyIStorage__OpenStream_meth.html
You can use its methods to read, write and change the size of a stream: http://timgolden.me.uk/pywin32-docs/PyIStream.html

How to read a file in Groovy into a string, without knowing the path to the file?

In extension to this question.
Is it possible to read a file into a string without knowing the path to the file? - I only have the file as a 'def'/type-less parameter, which is why I can't just do a .getAbsolutePath()
To elaborate on this, this is how I import the file (which is from a temporary .jar file)
def getExportInfo(path) {
def zipFile = new java.util.zip.ZipFile(new File(path))
zipFile.entries().each { entry ->
def name = entry.name
if (!entry.directory && name == "ExportInfo") {
return entry
}
}
}

A ZipEntry is not a file, but a ZipEntry.
Those have almost nothing in common.
With def is = zipFile.getInputStream(entry) you get the input stream to the zip entry contents.
Then you can use is.text to get the contents as String in the default platform encoding or is.getText('<theFilesEncoding>') to get the contents as String in the specified encoding, exactly the same as you can do on a File object.

Reopening a closed stringIO object in Python 3

So, I create a StringIO object to treat my string as a file:
>>> a = 'Me, you and them\n'
>>> import io
>>> f = io.StringIO(a)
>>> f.read(1)
'M'
And then I proceed to close the 'file':
>>> f.close()
>>> f.closed
True
Now, when I try to open the 'file' again, Python does not permit me to do so:
>>> p = open(f)
Traceback (most recent call last):
File "<pyshell#166>", line 1, in <module>
p = open(f)
TypeError: invalid file: <_io.StringIO object at 0x0325D4E0>
Is there a way to 'reopen' a closed StringIO object? Or should it be declared again using the io.StringIO() method?
Thanks!

I have a nice hack, which I am currently using for testing (Since my code can make I/O operations, and giving it StringIO is a nice get-around).
If this problem is kind of one time thing:
st = StringIO()
close = st.close
st.close = lambda: None
f(st) # Some function which can make I/O changes and finally close st
st.getvalue() # This is available now
close() # If you don't want to store the close function you can also:
StringIO.close(st)
If this is recurring thing, you can also define a context-manager:
#contextlib.contextmanager
def uncloseable(fd):
"""
Context manager which turns the fd's close operation to no-op for the duration of the context.
"""
close = fd.close
fd.close = lambda: None
yield fd
fd.close = close
which can be used in the following way:
st = StringIO()
with uncloseable(st):
f(st)
# Now st is still open!!!
I hope this helps you with your problem, and if not, I hope you will find the solution you are looking for.
Note: This should work exactly the same for other file-like objects.

No, there is no way to re-open an io.StringIO object. Instead, just create a new object with io.StringIO().
Calling close() on an io.StringIO object throws away the "file contents" data, so re-opening couldn't give access to that anyways.
If you need the data, call getvalue() before closing.
See also the StringIO documentation here:
The text buffer is discarded when the close() method is called.
and here:
getvalue()
Return a str containing the entire contents of the buffer.

The builtin open() creates a file object (i.e. a stream), but in your example, f is already a stream.
That's the reason why you get TypeError: invalid file
After the method close() has executed, any stream operation will raise ValueError.
And the documentation does not mention about how to reopen a closed stream.
Maybe you need not close() the stream yet if you want to use (reopen) it again later.

When you f.close() you remove it from memory. You're basically doing a deref x, call x; you're looking for a memory location that doesn't exist.
Here is what you could do in stead:
import io
a = 'Me, you and them\n'
f = io.StringIO(a)
f.read(1)
f.close()
# Put the text form a without the first char into StringIO.
p = io.StringIO(a[1:]).
# do some work with p.
I think your confusion comes form thinking of io.StringIO as a file on the block device. If you used open() and not StringIO, then you would be correct in your example and you could reopen the file. StringIO is not a file. It's the idea of a file object in memory. A file object does have a StringIO, but It also exists physically on the block device. A StringIO is just a buffer, a staging area in memory of the data with in it. When you call open() a buffer is created, but there is still the data on block device.
Perhaps this is more what you want
fo = open('f.txt','w+')
fo.write('Me, you and them\n')
fo.read(1)
fo.close()
# reopen the now closed file `f`
p = open('f.txt','r')
# do stuff with p
p.close()
Here we are writing the string to the block device, so that when we close the file, the information written to it will remain after it's closed. Because this is creating a file in the directory the progarm is run in, it may be a good idea to give the file an extension. For example, you could name the file f.txt instead of f.

Why does scala hang evaluating a by-name parameter in a Future?

The below (contrived) code attempts to print a by-name String parameter within a future, and return when the printing is complete.
import scala.concurrent._
import concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
class PrintValueAndWait {
def printIt(param: => String): Unit = {
val printingComplete = future {
println(param); // why does this hang?
}
Await.result(printingComplete, Duration.Inf)
}
}
object Go {
val str = "Rabbits"
new PrintValueAndWait().printIt(str)
}
object RunMe extends App {
Go
}
However, when running RunMe, it simply hangs while trying to evaluate param. Changing printIt to take in its parameter by-value makes the application return as expected. Alternatively, changing printIt to simply print the value and return synchronously (in the same thread) seems to work fine also.
What's happening exactly here? Is this somehow related to the Go object not having been fully constructed yet, and so the str field not being visible yet to the thread attempting to print it? Is hanging the expected behaviour here?
I've tested with Scala 2.10.3 on both Mac OS Mavericks and Windows 7, on Java 1.7.

Your code is deadlocking on the initialization of the Go object. This is a known issue, see e.g. SI-7646 and this SO question
Objects in scala are lazily initialized and a lock is taken during this time to prevent two threads from racing to initialize the object. However, if two threads simultaneously try and initialize an object and one depends on the other to complete, there will be a circular dependency and a deadlock.
In this particular case, the initialization of the Go object can only complete once new PrintValueAndWait().printIt(str) has completed. However, when param is a by name argument, essentially a code block gets passed in which is evaluated when it is used. In this case the str argument in new PrintValueAndWait().printIt(str) is shorthand for Go.str, so when the thread the future runs on tries to evaluate param it is essentially calling Go.str. But since Go hasn't completed initialization yet, it will try to initialize the Go object too. The other thread initializing Go has a lock on its initialization, so the future thread blocks. So the first thread is waiting on the future to complete before it finishes initializing, and the future thread is waiting for the first thread to finish initializing: deadlock.
In the by value case, the string value of str is passed in directly, so the future thread doesn't try to initialize Go and there is no deadlock.
Similarly, if you leave param as by name, but change Go as follows:
object Go {
val str = "Rabbits"
{
val s = str
new PrintValueAndWait().printIt(s)
}
}
it won't deadlock, since the already evaluated local string value s is passed in, instead of Go.str, so the future thread won't try and initialize Go.

Linux - Named pipes - losing data

I am using named pipes for IPC. At times the data sent between the process can be large and frequent. During these time I see lots of data loss. Are there any obvious problems in the code below that could cause this?
Thanks
#!/usr/bin/env groovy
import java.io.FileOutputStream;
def bytes = new File('/etc/passwd').bytes
def pipe = new File('/home/mohadib/pipe')
1000.times{
def fos = new FileOutputStream(pipe)
fos.write(bytes)
fos.flush()
fos.close()
}
#!/usr/bin/env groovy
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
def pipe = new File('/home/mohadib/pipe')
def bos = new ByteArrayOutputStream()
def len = -1
byte[] buff = new byte[8192]
def i = 0
while(true)
{
def fis = new FileInputStream(pipe)
while((len = fis.read(buff)) != -1) bos.write(buff, 0, len)
fis.close()
bos.reset()
i++
println i
}

Named pipes lose their contents when the last process closes them. In your example, this can happen if the writer process does another iteration while the reader process is about to do fis.close(). No error is reported in this case.
A possible fix is to arrange that the reader process never closes the fifo. To get rid of the EOF condition when the last writer disconnects, open the fifo for writing, close the read end, reopen the read end and close the temporary write end.

This section gives me worries:
1000.times{
def fos = new FileOutputStream(pipe)
fos.write(bytes)
fos.flush()
fos.close()
}
I know that the underlying Unix write() system call does not always write the requested number of bytes. You have to check the return value to see what number was actually written.
I checked the docs for Java and it appears fos.write() has no return value, it just throws an IOException if anything goes wrong. What does Groovy do with exceptions? Are there any exceptions happening?
If you can, run this under strace and view the results of the read and write system calls. It's possible that the Java VM isn't doing the right thing with the write() system call. I know this can happen because I caught glibc's fwrite implementation doing that (ignoring the return value) two years ago.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string