Processing log files: Apache Storm or Spark - apache-spark

I have a requirement to process log file data. It is relatively trivial. I have 4 servers with 2 web applications running on each for a total of 8 log files. These get rotated on a regular basis. I'm writing data in the following format into these log files
Source Timestamp :9340398;39048039;930483;3940830
Where the numbers are identifiers in a data store. I want to set up a process to read these logs and for each id it will update a count depending on the number of times its id has been logged. It can either be real time or batch. My interface language to the datastore is Java. The process runs in production so needs to be robust but also needs to have a relatively simple architecture so it is maintainable. We also run zookeeper.
My initial thought was to do this in a batch whenever the log file is rotated running an Apache spark on each server. However I then got to looking at log agregators such as Apache Flume, Kafka and Storm, but this seems like overkill.
Given the multitude of choices has anyone got any good suggestions as to which tools to use to handle this problem based on experience?

8 log files don't seem to warrant any "big data" technology. If you do want a play/get started with these type of technology I'd recommend you'd start with Spark and/or Flink - both have relatively similar programming model both both can handle "business real-time" (Flink is better at streaming but both would seem to work in your case). Storm is relatively rigid (hard to change topologies) and has a more complex programming model

Related

Can Kafka-Spark Streaming pair be used for both batch+real time data?

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)
Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

Transparent Streaming & Batch processing

I'm still quite new to the world of stream and batch processing and trying to understnad concepts and speach. It is admitedly very possible that the answer to my question well known, easy to find or even answered a hundred times here at SO, but I was not able to find it.
The background:
I am working in a big scientific project (nuclear fusion research), and we are producing tons of measurement data during experiment runs. Those data are mostly streams of samples tagged with a nanosecond timestamp, where samples can be anything from a single by ADC value, via an array of such, via deeply structured data (with up to hundreds of entries from 1 bit booleans to 64bit double precision floats) to raw HD video frames or even string text messages. If I understand the common terminologies right, I would regard our data as "tabular data", for the most part.
We are working with mostly selfmade software solutions from data acquisition over simple online (streaming) analysis (like scaling, subsampling and such) to our own data sotrage, management and access facilities.
In view of the scale of the operation and the effort for maintaining all those implementations, we are investigating the possibilities to use standard frameworks and tools for more of our tasks.
My question:
In particular at this stage, we are facing the need for more and more sofisticated (automated and manual) data analytics on live/online/realtime data as well as "after the fact" offline/batch analytics of "historic" data. In this endavor, I am trying to understand if and how existing analytics frameworks like Spark, Flink, Storm etc. (possibly supported by message queues like Kafka, Pulsar,...) can support a scenario, where
data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Simply streaming the online data into storage and querying from there seems no option as we need both raw and analysed data for live monitoring and realtime feedback control of the experiment.
Also, letting the user query either a live input signal or a historic batch from storage differently would not be ideal, as our physicists mostly are no data scientists and we would like to keep such "technicalities" away from them and idealy the exact same algorithms should be used for analysing new real time data and old stored data from previous experiments.
Sitenotes:
we are talking about peek data loads in the range of 10th of gigabits per second coming in bursts of increasing length of seconds up to minutes - could this be handled by the candidates?
we are using timestamps in nanosecond resolution, even thinking about pico - this poses some limitations on the list of possible candidates if I unserstand correctly?
I would be very greatfull if anyone would be able to understand my question and to shed some light on the topic for me :-)
Many Thanks and kind regards,
Beppo
I don't think anyone can say "yes, framework X can definitely handle your workload", because it depends a lot on what you need out of your message processing, e.g. regarding messaging reliability, and how your data streams can be partitioned.
You may be interested in BenchmarkingDistributedStreamProcessingEngines. The paper is using versions of Storm/Flink/Spark that are a few years old (looks like they were released in 2016), but maybe the authors would be willing to let you use their benchmark to evaluate newer versions of the three frameworks?
A very common setup for streaming analytics is to go data source -> Kafka/Pulsar -> analytics framework -> long term data store. This decouples processing from data ingest, and lets you do stuff like reprocessing historical data as if it were new.
I think the first step for you should be to see if you can get the data volume you need through Kafka/Pulsar. Either generate a test set manually, or grab some data you think could be representative from your production environment, and see if you can put it through Kafka/Pulsar at the throughput/latency you need.
Remember to consider partitioning of your data. If some of your data streams could be processed independently (i.e. ordering doesn't matter), you should not be putting them in the same partitions. For example, there is probably no reason to mix sensor measurements and the video feed streams. If you can separate your data into independent streams, you are less likely to run into bottlenecks both in Kafka/Pulsar and the analytics framework. Separate data streams would also allow you to parallelize processing in the analytics framework much better, as you could run e.g. video feed and sensor processing on different machines.
Once you know whether you can get enough throughput through Kafka/Pulsar, you should write a small example for each of the 3 frameworks. To start, I would just receive and drop the data from Kafka/Pulsar, which should let you know early whether there's a bottleneck in the Kafka/Pulsar -> analytics path. After that, you can extend the example to do something interesting with the example data, e.g. do a bit of processing like what you might want to do in production.
You also need to consider which kinds of processing guarantees you need for your data streams. Generally you will pay a performance penalty for guaranteeing at-least-once or exactly-once processing. For some types of data (e.g. the video feed), it might be okay to occasionally lose messages. Once you decide on a needed guarantee, you can configure the analytics frameworks appropriately (e.g. disable acking in Storm), and try benchmarking on your test data.
Just to answer some of your questions more explicitly:
The live data analysis/monitoring use case sounds like it fits the Storm/Flink systems fairly well. Hooking it up to Kafka/Pulsar directly, and then doing whatever analytics you need sounds like it could work for you.
Reprocessing of historical data is going to depend on what kind of queries you need to do. If you simply need a time interval + id, you can likely do that with Kafka plus a filter or appropriate partitioning. Kafka lets you start processing at a specific timestamp, and if you data is partitioned by id or you filter it as the first step in your analytics, you could start at the provided timestamp and stop processing when you hit a message outside the time window. This only applies if the timestamp you're interested in is when the message was added to Kafka though. I also don't believe Kafka supports below-millisecond resolution on the timestamps it generates.
If you need to do more advanced queries (e.g. you need to look at timestamps generated by your sensors), you could look at using Cassandra or Elasticsearch or Solr as your permanent data store. You will also want to investigate how to get the data from those systems back into your analytics system. For example, I believe Spark ships with a connector for reading from Elasticsearch, while Elasticsearch provides a connector for Storm. You should check whether such a connector exists for your data store/analytics system combination, or be willing to write your own.
Edit: Elaborating to answer your comment.
I was not aware that Kafka or Pulsar supported timestamps specified by the user, but sure enough, they both do. I don't see that Pulsar supports sub-millisecond timestamps though?
The idea you describe can definitely be supported by Kafka.
What you need is the ability to start a Kafka/Pulsar client at a specific timestamp, and read forward. Pulsar doesn't seem to support this yet, but Kafka does.
You need to guarantee that when you write data into a partition, they arrive in order of timestamp. This means that you are not allowed to e.g. write first message 1 with timestamp 10, and then message 2 with timestamp 5.
If you can make sure you write messages in order to Kafka, the example you describe will work. Then you can say "Start at timestamp 'last night at midnight'", and Kafka will start there. As live data comes in, it will receive it and add it to the end of its log. When the consumer/analytics framework has read all the data from last midnight to current time, it will start waiting for new (live) data to arrive, and process it as it comes in. You can then write custom code in your analytics framework to make sure it stops processing when it reaches the first message with timestamp 'tomorrow night'.
With regard to support of sub-millisecond timestamps, I don't think Kafka or Pulsar will support it out of the box, but you can work around it reasonably easily. Just put the sub-millisecond timestamp in the message as a custom field. When you want to start at e.g. timestamp 9ms 10ns, you ask Kafka to start at 9ms, and use a filter in the analytics framework to drop all messages between 9ms and 9ms 10ns.
Allow me to add the following suggestions on how Apache Pulsar might help address some of your requirements. Food for thought as it were.
"data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such"
You might want to look at Pulsar Functions, which allows you to write simple functions (In Java or Python) that gets executed on each individual message that is published to a topic. They are ideal for this type of data augmentation use case.
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
Pulsar has recently added tiered-storage, that allows you to retain event streams in S3, Azure Blob Store, or Google Cloud storage. This would allow you to keep the data for years in a cheap and reliable data store
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Apache Pulsar has also added integration with the Presto query engine, which would allow you to query the data over a given time period (including data from tiered-storage) and place it into a topic for processing.

Streaming analytics using Apache Kafka

We are collecting streaming data from device (Android , iOS). The data flow is , websocket -> logstash -> kafka -> spark -> cassandra. Ram is of 16 GB. Our app is based on OTT platform and when a video is streaming it will send events to kafka for analytics purpose. Current situation is, memory will be overflowed quickly while playing 4 or 5 videos in parallel.
What might be the issue? Is it any configuration mistake? Is there any other better approach for our requirement?
I'll answer your broad question with a broad answer.
Is Logstash / Kafka / Spark / Cassandra a 'correct' architecture?
There's nothing particularly wrong with that approach. It depends on what processing you're doing, and why you're landing it to Cassandra. You'll find plenty of people taking this approach, whilst others may use different stream processing e.g. Kafka Streams, as well as not always using a data store (since Apache Kafka persists data) - depends on what's consuming the data afterwards.
Can my system handle more than 10,000 user activities at a time with this architecture?
Yes. No. It depends, on way too many factors to give an answer. 10,000 users doing a simple activity with small volumes of data is hugely different from 10,000 users requiring complex processing on large volumes of data.
The only way to get an answer to this, and evaluate your architectural choice in general, is to analyse the behaviour of your system as you increase [simulated] user numbers. Do particular bottlenecks appear that indicate the requirement for greater hardware scale, or even different technology choices.

How read large number of large files on NFS and dump to HDFS

I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.

Parallelism of Streams in Spark Streaming Context

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks
The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Resources