Is Apache Spark recommended to run on windows? - apache-spark

I have a requirement to run Spark on Windows in a production environment. I would like to get advice in understanding if Apache Spark on Windows is recommended. If not, I would like to know the reason behind the same.

Related

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

How to use docker to setup kafka and spark-streaming on a Mac?

I just want to learn kafka and sparking streaming on my local machine (macOS Sierra).
Maybe Docker is a good idea?
Seems like what you need is described here
If you’ve always wanted to try Spark Streaming, but never found a time
to give it a shot, this post provides you with easy steps on how to
get development setup with Spark and Kafka using Docker
Example application here

Suggest a similar installer for apache spark and a notebook?

I am new to bigdata analytics. I am trying to install apache spark and a notebook to execute code like iPython. Is there an installer that comes with both spark set up and a good notebook tool inbuilt. I come from a back ground in PHP and Apache. I am used to tools like xampp, wamp that install multiple services in once click. Can any one suggest a similar installer for apache spark and a notebook? I have windows.
If iPython is not a mandatory requirement and if you can work with Zeppelin notebook with Apache spark I think you will need Sparklet. Its similar to what you seek a xampp like installer for spark engine and zeppelin tool.
You can see details here - Sparklet
It supports windows. Let me know if it solves your problem.

Running Spark on a Cluster of machines

I want to run Spark on four computers, and i read theory of running Spark on cluster using Mesos, Yarn and SSH, but i want a practical method and tutorial for this. the Operating System of these machines are Mac and Ubuntu. I've written my code on IntelliJIDEA using Scala.
Can anybody help me?

How to set up Spark cluster on Windows machines?

I am trying to set up a Spark cluster on Windows machines.
The way to go here is using the Standalone mode, right?
What are the concrete disadvantages of not using Mesos or YARN? And how much pain would it be to use either one of those? Does anyone have some experience here?
FYI, I got an answer in the user-group: https://groups.google.com/forum/#!topic/spark-users/SyBJhQXBqIs
The standalone mode is indeed the way to go. Mesos does not work under Windows and YARN probably neither.
Quick note, YARN should eventually work on Windows via the Hortonworks Data Platform (version 2.0 beta is on YARN but it is on Linux only at this time). Another potential route is to have it work against Hadoop 1.1 (Hortonworks Data Platform for Windows 1.1) - but your approach of having it run on Standalone mode is definitely the easiest to getting of the ground.

Resources