Binary file conversion in distributed manner - Spark, Flume or any other option? - apache-spark

We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format, say XML or JSON, and write to a different location. I was wondering what would be the best architectural design to handle this kind of problem? I know we could use Spark cluster for CSV, JSON, parquet kind of files but I'm not sure we could use it for binary file processing, or we could use Apache Flume to move files from one place to another and even use interceptor to convert the contents.
It's ideal if we can switch the ASN.1 decoder whenever we have performance considerations without changing the underlying framework of distributed processing (ex: to use C++ based or python based or Java based decoder library).

In terms of scalability, reliability and future-proofing your solution, I'd look at Apache NiFi rather than Flume. You can start by developing your own ASN.1 Processor or try using the patch thats already available but not part of a released version yet.

Related

Distributed Rules Engine

We have been using Drools engine for a few years now, but our data has grown, and we need to find a new distributed solution that can handle a large amount of data.
We have complex rules that look over a few days of data and that why Drools was a great fit for us because we just had our data in memory.
Do you have any suggestions for something similar to drools but distributed/scalable?
I did perform a research on the matter, and I couldn't find anything that answers our requirement.
Thanks.
Spark provides a faster application of Drools rules to the data than traditional single-node applications. The reference architecture for the Drools - Spark integration could be along the following lines. In addition, HACEP is a Scalable and Highly Available architecture for Drools Complex Event Processing. HACEP combines Infinispan, Camel, and ActiveMQ. Please refer to the following article for on HACEP using Drools.
You can find a reference implementation of Drools - Spark integration in the following GitHub repository.
In the first place, I can see for huge voluminous data as well we can apply Drools efficiently out of my experiences with it (may be some tuning is needed based on your kind of requirements). and is easily integrated with Apache Spark. loading your rule file in memory for spark processing will take minute memory... and Drools can be used with spark streaming as well as spark batch jobs as well...
See my complete article for your reference and try.
Alternative to it might be ....
JESS implements the Rete Engine and accepts rules in multiple formats including CLIPS and XML.
Jess uses an enhanced version of the Rete algorithm to process rules. Rete is a very efficient mechanism for solving the difficult many-to-many matching problem
Jess has many unique features including backwards chaining and working memory queries, and of course Jess can directly manipulate and reason about Java objects. Jess is also a powerful Java scripting environment, from which you can create Java objects, call Java methods, and implement Java interfaces without compiling any Java code.
Try it yourself.
Maybe this could be helpful to you. It is a new project developed as part of the Drools ecosystem. https://github.com/kiegroup/openshift-drools-hacep
It seems like Databricks is also working on Rules Engine. So if you are using the Databricks version of Spark, something to look into.
https://github.com/databrickslabs/dataframe-rules-engine
Take a look at https://www.elastic.co/blog/percolator
What you can do is convert your rule to an elasticsearch query. Now you can percolate your data against the percolator which will return you the rules that match the provided data

Connecting to the logical replication/streaming from node or go?

Is there a way to connect/subscribe to Postgres logical replication/streaming replication using node or go? I know its a TCP/IP connection but not exactly where to start. I also know there is a package for this, was wondering for more of a vanilla/understanding solution.
I'm not certain what you want, but maybe you are looking for “logical decoding”.
If you want to directly speak the replication protocol with the server, you'll have to implement it in your code, but that information is pretty useless, as it only contains the physical changes to the data files.
If you want logical decoding, there is the test_decoding module provided by PostgreSQL, and here are some examples how it can be used.
Mind that test_decoding is for testing. For real-world use cases, you will want to use a logical decoding plugin that fits your needs, for example wal2json.
If that's what you want to consume, you'll have to look up the documentation for the logical decoding plugin you want to use to learn the format in which it provides the information.

Design application using Apache Spark

Its a bit architectural kind of question. I need to design an application using Spark and Scala as the primary tool. I want to minimise the manual intervention as much as possible.
I will receive a zip with multiple files having different structures as an input at a regular interval of time say, daily. I need to process it using Spark . After transformations need to move the data to a back-end database.
Wanted to understand the best way I can use to design the application.
What would be the best way to process the zip ?
Is the Spark Streaming can be considered as an option looking at the frequency of the file ?
What other options should I take into consideration?
Any guidance will be really appreciable.
Its a broad question, there are batch options and stream options not sure your exact requirements. you may start your research here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html

Bulk load XML files into Cassandra

I'm looking into using Cassandra to store 50M+ documents that I currently have in XML format. I've been hunting around but I can't seem to find anything I can really follow on how to bulk load this data into Cassandra without needing to write some Java (not high on my list of language skills!).
I can happily write a script to convert this data into any format if it would make the loading easier although CSV might be tricky given the body of the document could contain just about anything!
Any suggestions welcome.
Thanks
Si
If you're willing to convert the XML to a delimited format of some kind (i.e. CSV), then here are a couple options:
The COPY command in cqlsh. This actually got a big performance boost in a recent version of Cassandra.
The cassandra-loader utility. This is a lot more flexible and has a bunch of different options you can tweak depending on the file format.
If you're willing to write code other than Java (for example, Python), there are Cassandra drivers available for a bunch of programming languages. No need to learn Java if you've got another language you're better with.

Spark on a single node: speed improvement

Is there any sense to use Spark (in particular, MLlib) on a single node (besides the goal of learning this technology)?
Is there any improvement in speed?
Are you comparing this to using a non-Spark machine learning system?
It really depends what the capabilities are of the other library you might use.
If, for example, you've got all your training data stored in Parquet files, then Spark makes it very easy to read in those files and work with, whether that's on 1 machine or 100.

Resources