In Apache NiFi I can have an input with compressed data that's unpacked using the UnpackContent processor and then connect the output to further record processing or otherwise.
Is it possible to operate directly on the compressed input? In a normal programming environment, one might easily wrap the record processor in a container that more or less transparently unpacks the data in a stream-processing fashion.
If this is not supported out of the box, would it be reasonable to implement a processor that extends for example ConvertRecord to accept a compressed input?
The motivation for this is to work efficiently with large CSV data files, converting it into a binary record format without having to spill the uncompressed CSV data to disk.
Compressed input for record processing is not supported currently, but is a great idea for improvement.
Instead of implementing it at a particular processor (e.g. ConvertRecord), I'd suggest following two approaches:
Create CompressedRecordReaderFactory implementing RecordReaderFactory
Like Java compressed stream such as GZIPInputStreawm, CompressedRecordReaderFactory will wrap another RecordReaderFactory, user specifies compression type (or the reader factory may be able to implement auto-detect capability by looking at FlowFile attributes ... etc)
Benefit of this approach is once we add this, we can support reading compressed input stream at any existing RecordReader and Processors using Record api, not only CSV, but also XML, JSON ... etc
Wrap InputStream at each RecordReaderFactory (e.g. CSVReader)
We could implement the same thing at each RecordReaderFactory and supporting compressed input gradually
This may provide a better UX because no additional ControllerService has to be configured
How do you think? For further discussion, I suggest creating a NiFi JIRA ticket. If you're willing to contribute, that would be even better.
Related
If my raw data is in CSV format and I would like to store it in the Bronze layer as Delta tables then I would end up with four layers like Raw+Bronze+Silver+Gold. Which approach should I consider?
A bit of an open question, however with respect to retaining the "raw" data in CSV I would normally recommend this as storage of these data is usually cheap relative to the utility of being able to re-process if there are problems or for purpose of data audit/traceability.
I would normally take the approach of compressing the raw files after processing and perhaps tar-balling the files. In addition moving these files to colder/cheaper storage.
I’m new on Spark and I’m trying to understand if it can fit my use case.
I have the following scenario.
I have a file (it can be a log file, .txt, .csv, .xml or .json, I can
produce the data in whatever format I prefer) with some data, e.g.:
Event “X”, City “Y”, Zone “Z”
with different events, cities and zones. This data can be represented by
string (like the one I wrote) in a .txt, or by XML , CSV, or JSON, as I
wish. I can also send this data through TCP Socket, if I need it.
What I really want to do is to correlate each single entry with other
similar entries by declaring rules.
For example, I want to declare some rules on the data flow: if I received
event X1 and event X2 in same city and same zone, I’ll want to do something
(execute a .bat script, write a log file, etc). Same thing if I received the
same string multiple times, or whatever rule I want to produce with these
data strings.
I’m trying to understand if Apache Spark can fit my use case. The only input
data will be these strings from this file.
Can I correlate these events and how? Is there a GUI to do it?
Any hints and advices will be appreciated.
Yes, it can:
spark.read.csv("your_file")
.groupBy($"Y", $"Z")
.agg(collect_list($"X").as("events"))
.as[(String, String, Seq[String])]
.filter(r => r._3.contains("X1") && r._3.contains("X2"))
.foreach(r => {
//do something with the relevant records
})
There isn't really a GUI to speak of, unless you consider notebook type software a GUI, you'd be writing code either way.
Apache Spark is very powerful but has a bit of a learning curve. It's easy to start running in local mode for learning, but you won't have a performance benefit unless your data size requires you to scale to multiple nodes, and that comes with lots of admin overhead.
It's clear and well documented that the ability to split zip files has a big impact on the performance and parallelisation of jobs within Hadoop.
However Azure is built upon Hadoop and there is no mention of this impact anywhere that I can find in the Microsoft documentation.
Is this not an issue for ADL?
Is, for example, GZipping large files an acceptable approach now or am I going to run into the same issues of inability to parallelise my jobs due to choice of compression codec?
Thanks
Please note that Azure Data Lake Analytics is not based on Hadoop.
RojoSam is correct that GZip is a bad compression format to parallelize over.
U-SQL does recognize .gz files automatically and does decompress them. However, there is a 4GB limit on the size of the compressed file (since we cannot split and parallelize processing it) and we recommend that you use files in the area of a few 100MB to 1GB.
We are working on adding Parquet support. If you need other compression formats such as BZip: please file a request at http://aka.ms/adlfeedback.
It is not possible to start reading a GZip file from a random position. It is necessary to start always reading from the beginning.
Then, if you have a big GZip (or other not splittable compression format), you can not read/process blocks of it in parallel, ending processing all the file sequential in only one machine.
The main idea of Hadoop (and other Big data alternatives) relies on process data in parallel in different machines. A big GZip file doesn't match with this approach.
There are some data formats that allows compress data pages using Gzip and keep the file splittable (each page can be processed in different machines, but each GZip block continues requiring be processed in only one machine) like Parquet.
I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.
I am curious whether VoltDB compresses the data on disk/at rest.
If it does, what is the algorithm used and are there options for 3rd party compression methods (e.g. a loss permitted proprietary video stream compression algorithm)?
VoltDB uses Snappy compression when writing Snapshots to disk. Snappy is an algorithm optimized for speed, but it still has pretty good compression. There aren't any options for configuring or customizing a different compression method.
Data stored in VoltDB (e.g. when you insert records) is stored 100% in RAM and is not compressed. There is a sizing worksheet built in to the web interface that can help estimate the RAM required based on the specific datatypes of the tables, and whatever indexes you may define.
One of the datatypes that is supported is VARBINARY which stores byte arrays, i.e. any binary data. You could store pre-compressed data in VARBINARY columns, or use a third-party java compression library within stored procedures to compress and decompress inputs. There is a maximum size limit of 1MB per column, and 2MB per record, however a procedure could store larger sized binary data by splitting it across multiple records. There is a maximum size of 50MB for the inputs to or the results from a stored procedure. You could potentially store and retrieve larger sized binary data using multiple transactions.
I saw you posted the same question in our forum, if you'd like to discuss more back and forth, that is the best place. We'd also like to talk to you about your solution, so if you like I can contact you at the email address from your VoltDB Forum account.