I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.
To do that I found Flume and Spark can do it.
I know that Spark is preferred when you need to perform data transformations.
My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?
Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.
Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors
Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.
Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code
Related
Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.
I have custom c++ binaries which reads raw data file and writes derived data file. The size of files are in 100Gbs. Moreover, I would like to process multiple 100Gb files in parallel and generated a materialized view of derived metadata. Hence, map-reduce paradigm seems more scalable.
I am a newbie in Hadoop ecosystem. I have used Ambari to setup a Hadoop cluster on AWS. I have built my custom C++ binaries on every data node and loaded the raw data files on HDFS. What are my options to execute this binary on HDFS files?
Hadoop streaming is the simplest way to run non-Java applications as MapReduce.
Refer to Hadoop Streaming for more details.
My requirement is to read log data from multiple machines.
LogStash - As far as i understand, LogStash agents to be installed on all the machines and LogStash can push data to Kafka as and when it arrives i.e. even if a new line is added to a file, LogStash reads only that not the entire file again.
Questions
Now i it possible to achieve the same with Spark Streaming?
If So, whats the advantage\disadvantage of using Spark Streaming over
LogStash?
LogStash agents to be installed on all the machines
Yes, you need some agent on all machines. The solution in the ELK stack is actually FileBeat, not Logstash agents. Logstash is more of a server/message-bus in this scenario.
Similarly, some Spark job would need running to read a file. Personally, I would have anything else tail-ing a log file (even literally just tail -f file.log piping out a network socket). Needing to write and distribute a Spark JAR + config files is a clear disadvantage. Especially when you need to have Java installed on each of those machines you are collecting logs on.
Flume or Fluentd are other widely used options for distributed log collection with Kafka destinations
LogStash can push data to Kafka
The Beats framework has a Kafka Output, but you can also ship to Logstash first.
It's not clear if you are using LogStash purely for Kafka, or also using ElasticSearch here, but Kafka Connect provides a file-source (and Elasticsearch output).
reads only that not the entire file again
Whatever tool you use (including Spark Streaming's File source) will typically be watching directories of files (because if you aren't rotating log files, you're doing it wrong). As files come in, or bytes written to a file, that framework will need to commit some type of marker internally to indicate what elements have been consumed so far. To reset the agent, this metadata should be able to be removed/reset to start from the beginning
While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.
I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.