Running Python 3 streaming job in EMR without bootstraping - python-3.x

I haven't got a chance to try but I am wondering if EMR instances has Python 3 installed out of the box. From my experiment I do know it has Python 2 for sure. If there is an easy way of checking default packages (installed) given an AMI, that would be great to know.

According to "AMI versions supported", Python 3 is not supported yet.

Related

psycopg2 fails on aws glue on subpackage _psycopg

I am trying to get a Glue Spark job running with Python to talk to a Redshift cluster.
But I have trouble getting Psycopg2 to run ... anybody got this going? It complains about a sub-package _psycopg.
Help please! Thanks.
AWs glue has trouble with modules that arent pure python libraries. Try using pg8000 as an alternative
Now with Glue Version 2 you can pass in python libraries as parameters to Glue Jobs. I used pyscopg2-binary instead of pyscopg2 and it worked for me. Then in the code I did import psycopg2.
--additional-python-modules

Zookeeper fails to start after Native Library and Intel MKL Parcel Activation in CDH 6.0.0

I have 7 Node cluster managed by Cloudera Manager and CDH 6.0.0. I am trying to run matrix multiplication in Spark using a native library which uses BLAS. That is why I have downloaded parcels 1. GPLEXTRAS, and 2. Intel MKL.
However, whenever I activate two parcels across the cluster, the zookeeper failed to start with the following error
Error found before invoking supervisord: No parcel provided required tags: set([u'cdh'])
What is the reason for failing and how to get rid of this error?
Thanks in Advance!!!
Chandan, take a look at this forum post - the last few posts on it discuss a solution to Zookeeper failing to start properly. It's not a pretty work around, but may help get you going.

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Installing Hadoop in LinuxMint

I have started a course on Hadoop on Udemy. Now here the instructor is using windows OS and installs a virtual box and then runs a Horton Sandbox image for using Hadoop.
I am using LinuxMint and after doing some research on install hadoop on Linux I found(click for ref) out that we can install the VM on linux and download the Horton Sandbox image run it.
I also found another method which does not uses the VM (click for ref). I am confused as to which is the best way for install hadoop.
Should I use the VM or the second method. Which is better for learning and development?
Thanks a lot for help!
can install the VM on linux
You can use a VM on any host OS... That's the point of a VM.
The last link is only Hadoop, where Hortonworks has much, much more like Spark, Hive, Hbase, Pig, etc. Things you'd need to additionally install and configure yourself otherwise
Which is better for learning and development?
I would strongly suggest using a VM (or containers) overall
1) rather than messing up your local OS trying to get Hadoop working
2) The Hortonworks documentation has lots of tutorials that can really only be ran in the sandbox with the pre installed datasets

What is the difference between the package types of Spark on the download page?

what's the difference beetween the download packages type of spark :
1)pre-built for hadoop 2-6-0 and later and
2)Source code(can build several hadoop versions)
can i insatll a pre-built for hadoop 2-6-0 and later but i work without using (hadoop , hdfs , hbase)
ps :hadoop 2.6.0 is already installed on my machine .
Last answer only addressed Q1, so writing this.
Answer to your Q2 is Yes, you can work on spark without hadoop components installed, even if you use Spark prebuilt with specific hadoop version. Spark will throw bunch of errors while starting up master/workers, which you (and spark) can blissfully ignore as long as you see them up and running.
In terms of applications, its never a problem.
The difference is the version of the hadoop API they are built against. To interop with a Hadoop installation, Spark needs to be built against that API. e.g. the dreaded conflict of org.apache.hadoop.mapred vs org.apache.hadoop.mapreduce
If you're using Hadoop 2.6, get that binary version that matches your Hadoop installation.
You can also build spark from source. That's the Source Code download for. If you want to build from source, follow the instructions listed here: https://spark.apache.org/docs/latest/building-spark.html

Resources