Performance Impact with FileOutputStream using OpenCSV - fileoutputstream

We are using OpenCSV
(http://opencsv.sourceforge.net/apidocs/au/com/bytecode/opencsv/CSVWriter.html)
to write a report from a file with xml content.
There are two ways to go about this ->
i) Write using FileOutputStream
FileOutputStream fos = new FileOutputStream(file);
OutputStreamWriter osr= new OutputStreamWriter(fos);
writer = new CSVWriter(osr);
ii) Write using BufferedWriter
BufferedWriter out = new BufferedWriter(new FileWriter(file));
writer = new CSVWriter(out);
Does anybody know how the performance of the writing of this report gets affected by choosing one option over another ?
To my understanding OpenCSV does not care as long as it gets a stream that it can use.
The delta (difference) in performance would be the step before it, where the outputstream is created from the file.
What is the performance impact of using OutputStreamWriter versus BufferedWriter ?

After running some benchmarks with Google Caliper, it appears that the BufferedWriter option is the fastest (but there's really not much of a difference, so I'd just use the option that you're comfortable with).
How to interpret results:
The FileOutputStreamWriter scenario corresponds with option i
The BufferedWriter scenario corresponds with option ii
The FileWriter scenario is one I added which just uses a plain old FileWriter.
Each benchmark was run 3 times: writing 1000, 10,000, and 100,000 rows.
The tests were run on Linux Mint, i5-2500k (1.6GHz) CPU, 8GB RAM, with Oracle JDK7 (writing to a SATA green HDD). Results would vary with a different setup, but this should be good for comparison purposes.
rows benchmark ms linear runtime
1000 FileOutputStreamWriter 6.10 =
1000 BufferedWriter 5.89 =
1000 FileWriter 5.96 =
10000 FileOutputStreamWriter 50.55 ==
10000 BufferedWriter 50.71 ==
10000 FileWriter 51.64 ==
100000 FileOutputStreamWriter 525.13 =============================
100000 BufferedWriter 505.05 ============================
100000 FileWriter 535.20 ==============================
FYI opencsv wraps the Writer you give it in a PrintWriter.

Related

Loading data from a file in GCS to GCP firestore

I have written a script which loops through each record in the file and does write to the firestore collection.
Firestore Schema {COLLECTION.DOCUMENT.SUBCOLLECTION.DOCUMENT.SUBCOLLECTION}
'{"KEY":"1234","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1234,"SUB_DOC":{"KEY1" : :"VAL1"}}'
'{"KEY":"1235","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1235,"SUB_DOC":{"KEY1" : :"VAL1"}}'
'{"KEY":"1236","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1236,"SUB_DOC":{"KEY1" : :"VAL1"}}'
...
File is read in the below line
read_file = filename.download_as_string()
converted to a list of strings
fire_client = firestore.Client(project=PROJECT)
dict_str = read_file.decode("UTF-8");
dict_str = dict_str.split('\n');
dict_str = dict_str.split('\n');
for i in range(0,len(dict_str)-1):
i = json.loads(dict_str[i])
doc_ref = fire_client.collection('STATIC_COLLECTION_NAME').document(i['KEY'])
doc_ref.set({"KEY" : int(i['KEY']), "DATE" : i['DATE']})
sub_ref = doc_ref.collection('STATIC_SUB_COLLECTION_NAME').document('STATIC_SUB_DOC_NAME')
sub_ref.set(i['SUB_COLLECTION'])
However, this job is consuming hours to complete a file size of 100 MB. Is there a way I could do this using multiple writes at a time, example batch processing of X number of records from the file and write those to X documents and sub-collections in the firestore.
Finding a way to make this more efficient instead of looping over millions of records, my current script ended up with
503 The datastore operation timed out, or the data was temporarily unavailable.
You'll want to use the bulk_writer to accumulate & send writes to Firestore

Contents of large file getting corrupted while reading records sequentially

I have a file, with around 85 million json records. The file size is around 110 Gb. I want to read from this file in batches of 1 million (in sequence). I am trying to read from this file line by line using a scanner, and appending these 1 million records. Here is the code gist of what I am doing:
var rawBatch []string
batchSize := 1000000
file, err := os.Open(filePath)
if err != nil {
// error handling
}
scanner = bufio.NewScanner(file)
for scanner.Scan() {
rec := string(scanner.Bytes())
rawBatch = append(rawBatch, string(recBytes))
if len(rawBatch) == batchSize {
for i := 0; i < batchSize ; i++ {
var tRec parsers.TRecord
err := json.Unmarshal(rawBatch[i], &tRec)
if err != nil {
// Error thrown here
}
}
//process
rawBatch = nil
}
}
file.Close()
Sample of correct record:
type TRecord struct {
Key1 string `json:"key1"`
key2 string `json:"key2"`
}
{"key1":"15","key2":"21"}
The issue I am facing here is that while reading these records, some of these records are getting corrupted, example: changing a colon to semi colon, or double quote to #. Getting this error:
Unable to load Record: Unable to load record in:
{"key1":#15","key2":"21"}
invalid character '#' looking for beginning of value
Some observations:
Once we start reading, the contents of the file itself get corrupted.
For every batch of 1 million, I saw 1 (or max 2) records getting corrupted. Out of 84 million records, a total of 95 records were corrupted.
My code is working for for a file with size around 42Gb (23 million records). With a higher sized data file, my code is behaving erroneously.
':' are changing to ';'. Double quotes are changing to '#'. Space is changing to '!'. All these combinations, in their binary representations, have a single bit difference. Any chance that we might have some accidental bit manipulation?
Any ideas on why this is happening? And how can I fix it?
Details:
Go version used: go1.15.6 darwin/amd64
Hardware details: Debian GNU/Linux 9.12 (stretch), 224Gb RAM, 896Gb Hard disk
As suggested by #icza in the comments,
That occasional, very rare 1 bit change suggests hardware failure (memory, processor cache, hard disk). I do recommend to test it on another computer.
I tested my code on some other machines. The code is running perfectly fine now. Looks like this occasional rare bit change, due to some hard failure, was causing this issue.

How to play a single audio file from a collection of 100k audio files split into two folders?

I have placed two media folders into a single zip folder, and the total is 100k media files in the zip folder. I need to play a single file from particular folder of the zip folder. The problem is, first the total content of the Zip folder is completely read and then the necessary file is accessed. So, it takes more than 55 seconds to play a single file. I need a solution to reduce the time consumed to play the audio files.
Below is my code :
long lStartTime = System.currentTimeMillis();
System.out.println("Time started");
String filePath = File.separator+"sdcard"+File.separator+"Android"+File.separator+"obb"+File.separator+"com.mobifusion.android.ldoce5"+File.separator+"main.9.com.mobifusion.android.ldoce5.zip";
FileInputStream fis = new FileInputStream(filePath);
ZipInputStream zis = new ZipInputStream(fis);
String zipFileName = "media"+File.separator+"aus"+File.separator+fileName;
String usMediaPath = "media"+File.separator+"auk"+File.separator;
ZipEntry entry;
int UnzipCounter = 0;
while((entry = zis.getNextEntry())!=null){
UnzipCounter++;
System.out.println(UnzipCounter);
if(entry.getName().endsWith(zipFileName)){
File Mytemp = File.createTempFile("TCL", "mp3", getActivity().getCacheDir());
Mytemp.deleteOnExit();
FileOutputStream fos = new FileOutputStream(Mytemp);
for (int c = zis.read(); c!=-1; c= zis.read()){
fos.write(c);
}
if(fos!=null){
mediaPlayer = new MediaPlayer();
}
fos.close();
FileInputStream MyFile = new FileInputStream(Mytemp);
mediaPlayer.setDataSource(MyFile.getFD());
mediaPlayer.prepare();
mediaPlayer.start();
mediaPlayer.setOnCompletionListener(this);
long lEndTime = System.currentTimeMillis();
long difference = lEndTime - lStartTime;
System.out.println("Elapsed milliseconds: " + difference);
mediaPlayer.setOnErrorListener(this);
}
zis.closeEntry();
}
zis.close();
Try to not re-unzip the file because it consume too much time.
Instead of re-unzip the file, you can follow the following step:
Unzip file when first time app is launch. Set flag that we have launch the app before, use preferences.
On the next launch, check the flag. If app never launched before, goto first step. If it had launched before find the file and play.
If you really can't use those step, because it's your requirement, you can try using truezip vfs. But be aware that I've never use it before.
Here the library:
https://truezip.java.net/, https://github.com/jruesga/android_external_libtruezip

Apache-POI / XSSF - Read big file (5 MB)

I've a big file with 10000 rows... Opening it, lasts an eternity... After 10mins I stopped the programm...
OPCPackage opcPackage = OPCPackage.open(item.getFilePath());
workbook = new XSSFWorkbook(opcPackage);
sheet = workbook.getSheet(item.sheet);
evaluator = new XSSFFormulaEvaluator(workbook);
itRows = sheet.rowIterator();
the itRows = sheet.rowIterator(); newer finishes...it needs so much time...
How can I read such a big file?

MessagePack slower than native node.js JSON

I just installed node-msgpack and tested it against native JSON. MessagePack is much slower. Anyone know why?
Using the authors' own benchmark...
node ~/node_modules/msgpack/bench.js
msgpack pack: 4165 ms
msgpack unpack: 1589 ms
json pack: 1352 ms
json unpack: 761 ms
I'll assume you're talking about https://github.com/pgriess/node-msgpack.
Just looking at the source, I'm not sure how it could be. For example in src/msgpack.cc they have the following:
Buffer *bp = Buffer::New(sb._sbuf.size);
memcpy(Buffer::Data(bp), sb._sbuf.data, sb._sbuf.size);
In node terms, they are allocating and filling a new SlowBuffer for every request. You can benchmark the allocation part by doing following:
var msgpack = require('msgpack');
var SB = require('buffer').SlowBuffer;
var tmpl = {'abcdef' : 1, 'qqq' : 13, '19' : [1, 2, 3, 4]};
console.time('SlowBuffer');
for (var i = 0; i < 1e6; i++)
// 20 is the resulting size of their "DATA_TEMPLATE"
new SB(20);
console.timeEnd('SlowBuffer');
console.time('msgpack.pack');
for (var i = 0; i < 1e6; i++)
msgpack.pack(tmpl);
console.timeEnd('msgpack.pack');
console.time('stringify');
for (var i = 0; i < 1e6; i++)
JSON.stringify(tmpl);
console.timeEnd('stringify');
// result - SlowBuffer: 915ms
// result - msgpack.pack: 5144ms
// result - stringify: 1524ms
So by just allocating memory for the message they've already spent 60% of stringify time. There's just one reason why it's so much slower.
Also take into account that JSON.stringify has gotten a lot of love from Google. It's highly optimized and would be difficult to beat.
I decided to benchmark all popular Node.js modules for binary encoding Msgpack, along with the PSON (protocol JSON) encoding library, versus JSON, and the results are as follows:
JSON fastest for encoding unless it includes a binary array
msgpack second fastest normally and fastest when including a binary array
msgpack-js - consistently second to msgpack
pson - consistently slower than msgpack-js
msgpack5 - dog slow always
I have published the benchmarking repository and detailed results at https://github.com/mattheworiordan/nodejs-encoding-benchmarks

Resources