Replace LogStash with Spark Streaming - apache-spark

My requirement is to read log data from multiple machines.
LogStash - As far as i understand, LogStash agents to be installed on all the machines and LogStash can push data to Kafka as and when it arrives i.e. even if a new line is added to a file, LogStash reads only that not the entire file again.
Questions
Now i it possible to achieve the same with Spark Streaming?
If So, whats the advantage\disadvantage of using Spark Streaming over
LogStash?

LogStash agents to be installed on all the machines
Yes, you need some agent on all machines. The solution in the ELK stack is actually FileBeat, not Logstash agents. Logstash is more of a server/message-bus in this scenario.
Similarly, some Spark job would need running to read a file. Personally, I would have anything else tail-ing a log file (even literally just tail -f file.log piping out a network socket). Needing to write and distribute a Spark JAR + config files is a clear disadvantage. Especially when you need to have Java installed on each of those machines you are collecting logs on.
Flume or Fluentd are other widely used options for distributed log collection with Kafka destinations
LogStash can push data to Kafka
The Beats framework has a Kafka Output, but you can also ship to Logstash first.
It's not clear if you are using LogStash purely for Kafka, or also using ElasticSearch here, but Kafka Connect provides a file-source (and Elasticsearch output).
reads only that not the entire file again
Whatever tool you use (including Spark Streaming's File source) will typically be watching directories of files (because if you aren't rotating log files, you're doing it wrong). As files come in, or bytes written to a file, that framework will need to commit some type of marker internally to indicate what elements have been consumed so far. To reset the agent, this metadata should be able to be removed/reset to start from the beginning

Related

integration of csv file with flume vs spark

I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.
To do that I found Flume and Spark can do it.
I know that Spark is preferred when you need to perform data transformations.
My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?
Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.
Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors
Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.
Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code

ELK apache spark application log

How to configure Filebeats to read apache spark application log. The logs generated is moved to history server, in non readable format as soon as the application is completed. What is the ideal way here.
You can configure Spark logging via Log4J. For a discussion around some edge cases for setting up log4j configuration, see SPARK-16784, but if you simply want to collect all application logs coming off a cluster (vs logs per job) you shouldn't need to consider any of that.
On the ELK side, there was a log4j input plugin for logstash, but it is deprecated.
Thankfully, the documentation for the deprecated plugin describes how to configure log4j to write data locally for FileBeat, and how to set up FileBeat to consume this data and sent it to a Logstash instance. This is now the recommended way to ship logs from systems using log4j.
So in summary, the recommended way to get logs from Spark into ELK is:
Set the Log4J configuration for your Spark cluster to write to local files
Run FileBeat to consume from these files and sent to logstash
Logstash will send data into Elastisearch
You can search through your indexed log data using Kibana

how to stream from kafka to cassandra and increment counters

I have apache access log file and i want to store access counts (total/daily/hourly) of each page in a cassandra table.
I am trying to do it by using kafka connect to stream from log file to a kafka topic. In order to increment metrics counters in Cassandra can I use Kafka Connect again? Otherwise which other tool should be used here e.g. kafka streams, spark, flink, kafka connect etc?
You're talking about doing stream processing, which Kafka can do - either with Kafka's Streams API, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to build the kind of aggregations that you're talking about.
Here's an example of doing aggregations of streams of data in KSQL
SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID
See more at : https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
You can take the output of KSQL which is actually just a Kafka topic, and stream that through Kafka Connect e.g. to Elasticsearch, Cassandra, and so on.
You mention other stream processing tools, they're valid too - depends in part on existing skills and language preferences (e.g. Kafka Streams is Java library, KSQL is … KSQL, Spark Streaming has Python as well as Java, etc), but also deployment preferences. Kafka Streams is just a Java library to deploy within your existing application. KSQL is deployable in a cluster, and so on.
This can be easily done with Flink, either as a batch or streaming job, and either with or without Kafka (Flink can read from files and write to Cassandra). This sort of time windowed aggregation is easily done with Flink's SQL api; see the examples here.

Apache Spark: Yarn logs Analysis

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
Thanks.
The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

stack for loading log files into cassandra

I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.

Resources