How to execute custom C++ binary on HDFS file - apache-spark

I have custom c++ binaries which reads raw data file and writes derived data file. The size of files are in 100Gbs. Moreover, I would like to process multiple 100Gb files in parallel and generated a materialized view of derived metadata. Hence, map-reduce paradigm seems more scalable.
I am a newbie in Hadoop ecosystem. I have used Ambari to setup a Hadoop cluster on AWS. I have built my custom C++ binaries on every data node and loaded the raw data files on HDFS. What are my options to execute this binary on HDFS files?

Hadoop streaming is the simplest way to run non-Java applications as MapReduce.
Refer to Hadoop Streaming for more details.

Related

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

integration of csv file with flume vs spark

I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.
To do that I found Flume and Spark can do it.
I know that Spark is preferred when you need to perform data transformations.
My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?
Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.
Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors
Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.
Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Can we use Apache Spark to store Data? or is it only a Data processing tool?

I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Resources