NIFI excel file - excel

I have an Excel file that has several sheets. Preprocessor Convertexcelto CSV processor can not convey it. How to solve this error?
org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.error
IOException thrown from ConvertExcelToCSVProcessor[id=a0aa3e82-1d3c-18c0-b1a7-b9d4429d5958]:
java.io.IOException: Zip bomb detected! The file would exceed the max.
ratio of compressed file size to the size of the expanded data. This
may indicate that the file is used to inflate memory usage and thus
could pose a security risk. You can adjust this limit via
ZipSecureFile.setMinInflateRatio() if you need to work with files
which exceed this limit. Uncompressed size: 358510, Raw/compressed
size: 3584, ratio: 0.009997 Limits: MIN_INFLATE_RATIO: 0.010000,
Entry: xl/styles.xml

Related

Increase the maximum file size to be ingested in the Azure Data Explorer (ADX, Kusto)

I am currently ingesting multiple TB of data into the DB of an Azure Data Explorer Cluster (ADX aka Kusto DB). In total, I iterate over about 30k files. Some of them are a few kB, but some as big as many GB.
With some big files I am running into errors due to their file sizes:
FailureMessage(
{
...
"Details":"Blob size in bytes: '4460639075' has exceeded the size limit allowed for ingestion ('4294967296' B)",
"ErrorCode":"BadRequest_FileTooLarge",
"FailureStatus":"Permanent",
"OriginatesFromUpdatePolicy":false,
"ShouldRetry":false
})
Is there anything I can do to increase the allowed ingestion size?
There's a non-configurable 4GB limit.
you should split your source file(s) (ideally, so that each file has between 100MB-1GB of uncompressed data).
see: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/api/netfx/kusto-ingest-best-practices#optimizing-for-throughput

Cassandra batch prepared statement size warning

I see this error continuously in the debug.log in cassandra,
WARN [SharedPool-Worker-2] 2018-05-16 08:33:48,585 BatchStatement.java:287 - Batch of prepared statements for [test, test1] is of size 6419, exceeding specified threshold of 5120 by 1299.
In this
where,
6419 - Input payload size (Batch)
5120 - Threshold size
1299 - Byte size above threshold value
so as per this ticket in Cassandra, https://github.com/krasserm/akka-persistence-cassandra/issues/33 I see that it is due to the increase in input payload size so I Increased the commitlog_segment_size_in_mb in cassandra.yml to 60mb and we are not facing this warning anymore.
Is this Warning harmful? Increasing the commitlog_segment_size_in_mb will it affect anything in performance?
This is not related to the commit log size directly, and I wonder why its change lead to disappearing of the warning...
The batch size threshold is controlled by batch_size_warn_threshold_in_kb parameter that is default to 5kb (5120 bytes).
You can increase this parameter to higher value, but you really need to have good reason for using batches - it would be nice to understand the context of their usage...
commit_log_segment_size_in_mb represents your block size for commit log archiving or point-in-time backup. These are only active if you have configured archive_command or restore_command in your commitlog_archiving.properties file.
Default size is 32mb.
As per Expert Apache Cassandra Administration book:
you must ensure that value of commitlog_segment_size_in_mb must be twice the value of max_mutation_size_in_kb.
you can take reference of this:
Mutation of 17076203 bytes is too large for the maxiumum size of 16777216

Size on disk of a partly filled HDF5 dataset

I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:
import numpy as np
import h5py
f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB
dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB
After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:
dset[...] = np.arange(1024**3)
The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?
Version info:
Python 3.6.1 (OSX)
h5py 2.7.0
If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.
If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)
The chunks=True lets h5py determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )
The best chunk size depends on the reading and writing patterns of your applications. Note that:
Chunking has performance implications. It’s recommended to keep the
total size of your chunks between 10 KiB and 1 MiB, larger for larger
datasets. Also keep in mind that when any element in a chunk is
accessed, the entire chunk is read from disk.
See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

Mutation of 17076203 bytes is too large for the maxiumum size of 16777216

I have "commitlog_segment_size_in_mb: 32" in the cassandra settings but the error below indicates maximum size is 16777216, which is about 16mb. Am I looking at the correct setting for fixing the error below?
I am referring to this setting based on the suggestion provided at http://mail-archives.apache.org/mod_mbox/cassandra-user/201406.mbox/%3C53A40144.2020808#gmail.com%3E
I am using 2.1.0-2 for Cassandra.
I am using Kairosdb, and the write buffer max size is 0.5Mb.
WARN [SharedPool-Worker-1] 2014-10-22 17:31:03,163 AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread Thread[SharedPool-Worker-1,5,main]: {}
java.lang.IllegalArgumentException: Mutation of 17076203 bytes is too large for the maxiumum size of 16777216
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:216) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:203) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:371) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:351) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:54) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[apache-cassandra-2.1.0.jar:2.1.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_67]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:163) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:103) [apache-cassandra-2.1.0.jar:2.1.0]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
You are looking at the correct parameter in your .yaml. The maximum write size C* will allow is half of the commit_log_segment_size_in_mb, default is 32mb so the default max size will be 16mb.
Background
commit_log_segment_size_in_mb represents your block size for commit log archiving or point-in-time backup. These are only active if you have configured archive_command or restore_command in your commitlog_archiving.properties file.
Its the correct setting.. This means Cassandra will discard this write as it exceeds 50% of the configured commit log segment size.
So set the parameter commitlog_segment_size_in_mb: 64 in Cassandra.yaml of each node in cluster and restart each node to take effect the changes.
Cause:
By design intent the maximum allowed segment size is 50% of the configured commit_log_segment_size_in_mb. This is so Cassandra avoids writing segments with large amounts of empty space.
To elaborate; up to two 32MB segments will fit into 64MB, however 40MB will only fit once leaving a larger amount of unused space.
reference link from datastax:
https://support.datastax.com/hc/en-us/articles/207267063-Mutation-of-x-bytes-is-too-large-for-the-maxiumum-size-of-y-

Large Zip file is corrupted on 32 bit JDK

We are creating single Zip file from multiple files using ZipOutputStream (on 32 bit jdk).
If we create Zip file using 5 pdf files(each pdf is of 1 GB) than it creates corrupt Zip file. If I create Zip file using (4 pdf files - each pdf is of 1 GB) than it creates correct Zip file.
Is there any limitation in Zip file size on 32 bit JDK?
The original ZIP format had a number of limits (uncompressed size of a file, compressed size of a file and total size of the archive) at 4GB.
More info: http://en.wikipedia.org/wiki/ZIP_(file_format)
The original zip format had a 4 GiB limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive), as well as a limit of 65535 entries in a zip archive. In version 4.5 of the specification (which is not the same as v4.5 of any particular tool), PKWARE introduced the "ZIP64" format extensions to get around these limitations, increasing the limitation to 16 EiB (264 bytes).
The File Explorer in Windows XP does not support ZIP64, but the Explorer in Windows Vista does. Likewise, some libraries, such as DotNetZip and IO::Compress::Zip in Perl, support ZIP64. Java's built-in java.util.zip supports ZIP64 from version Java 7.[29]

Resources