How to upload files to Amazon EMR?

How to upload files to Amazon EMR? - apache-spark

My code is as follows:
# test2.py
from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
print("opened")
sc.stop()
It works when I run it locally with:
spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py
But after adding step to EMR claster:
spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py
I am getting error:
FileNotFoundError: [Errno 2] No such file or directory:
'/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'
Path to the file is correct, but it is not uploading from s3 for some reason.
I tried to load from executor:
from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
a = 0
with open(SparkFiles.get("test_warc.txt")) as f:
a += 1
print("opened")
return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2
sc.stop()
And it works without errors.
Looking like --files argument uploading files to executors only. How can I upload to master ?

Your understanding is correct.
--files argument is uploading files to executors only.
See this in the spark documentation
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
You can read more about this at advanced-dependency-management
Now coming back to your second question
How can I upload to master?
There is a concept of bootstrap-action in EMR. From the official documentation it means the following:
You can use a bootstrap action to install additional software or
customize the configuration of cluster instances. Bootstrap actions
are scripts that run on cluster after Amazon EMR launches the instance
using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions
run before Amazon EMR installs the applications that you specify when
you create the cluster and before cluster nodes begin processing data.
How do I use it in my case?
While spawning the cluster you can specify your script in BootstrapActions JSON Something like the following along with other custom configurations:
BootstrapActions=[
{'Name': 'Setup Environment for downloading my script',
'ScriptBootstrapAction':
{
'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
}
}]
The content of the download-file.sh should look something like below:
#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir
Now in your python script, you can use the file workingDir/test_warc.txt to read the file.
There is also an option to execute your bootstrap action on the master node only/task node only or a mix of both. bootstrap-actions/run-if is the script that we can use for this case. More reading on this can be done at emr-bootstrap-runif

Related

ModuleNotFoundError: No module named 'pyspark.sql' when using with EMR Serverless

I'm getting the following error when I'm trying to run a job on EMR serverless -
ModuleNotFoundError: No module named 'pyspark.sql'. Please refer to user guide on how to use python libraries with EMR Serverless.
It happens when I try to import pyspark.sql in a python file located within a zip package.
The file -
pyspark.zip
|--__init__.py
|--spark.py
The content -
#__init__.py
from .spark import *
#spark.py
from pyspark.sql import SparkSession
def run():
print("Create Spark Session")
spark_session = SparkSession\
.builder\
.appName("First pyspark project")\
.getOrCreate()
The spark property I gave to the job -
--conf spark.submit.pyFiles=s3://my-bucket/pyspark.zip
--conf spark.executorEnv.PYSPARK_PYTHON=python
I am afraid I missed something. Should I install it or something?
All I did was compress the project into a zip file and upload it to S3.

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.
Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.

The answer depends on where your SparkContext comes from.
If you are starting Jupyter with pyspark:
PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark
then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...
If you are constructing a SparkContext in Python
If you have code in your notebook like:
import pyspark
sc = pyspark.SparkContext()
then you can configure the Spark context before creating it:
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)
Configuring Spark from the command line:
It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):
echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.

After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).

I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...

Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Read files sent with spark-submit by the driver

I am sending a Spark job to run on a remote cluster by running
spark-submit ... --deploy-mode cluster --files some.properties ...
I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.
The ways around this problem that I know of are:
Upload the file to HDFS
Store the file in the app jar
Both are inconvenient since this file is frequently changed on the submitting dev machine.
Is there a way to read the file that was uploaded using the --files flag during the driver code main method?

Yes, you can access files uploaded via the --files argument.
This is how I'm able to access files passed in via --files:
./bin/spark-submit \
--class com.MyClass \
--master yarn-cluster \
--files /path/to/some/file.ext \
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \
/path/to/app.jar file.ext
and in my Spark code:
val filename = args(0)
val linecount = Source.fromFile(filename).getLines.size
I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.

After the investigation, I found one solution for above issue. Send the any.properties configuration during spark-submit and use it by spark driver before and after SparkSession initialization. Hope it will help you.
any.properties
spark.key=value
spark.app.name=MyApp
SparkTest.java
import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
public class SparkTest{
public Static void main(String[] args){
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
Config conf = loadConf();
System.out.println(conf.getString("spark.key"));
// Initialize SparkContext and use configuration from properties
SparkConf sparkConf = new SparkConf(true).setAppName(conf.getString("spark.app.name"));
SparkSession sparkSession =
SparkSession.builder().config(sparkConf).config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().getOrCreate();
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
}
public static Config loadConf() {
String configFileName = "any.properties";
System.out.println(configFileName);
Config configs = ConfigFactory.load(ConfigFactory.parseFile(new java.io.File(configFileName)));
System.out.println(configs.getString("spark.key")); // get value from properties file
return configs;
}
}
Spark Submit:
spark-submit --class SparkTest --master yarn --deploy-mode client --files any.properties,yy-site.xml --jars ...........

use spark-submit --help, will find that this option is only for working directory of executor not driver.
--files FILES: Comma-separated list of files to be placed in the working directory of each executor.

The --files and --archives options support specifying file names with the # , just like Hadoop.
For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into Spark worker directory, but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
this works for my spark streaming application in both yarn/client and yarn/cluster mode.

In pyspark, I find it really interesting to achieve this easily, first arrange your working directory like this:
/path/to/your/workdir/
|--code.py
|--file.txt
and then in your code.py main function, just read the file as usual:
if __name__ == "__main__":
content = open("./file.txt").read()
then submit it without any specific configurations as follows:
spark-submit code.py
it runs correctly which amazes me. I suppose the submit process archives any files and sub-dir files altogether and sends them to the driver in pyspark, while you should archive them yourself in scala version. By the way, both --files and --archives options are working in worker not the driver, which means you can only access these files in RDD transformations or actions.

Here's a nice solution I developed in Python Spark in order to integrate any data as a file from outside to your Big Data platform.
Have fun.
# Load from the Spark driver any local text file and return a RDD (really useful in YARN mode to integrate new data at the fly)
# (See https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html)
def parallelizeTextFileToRDD(sparkContext, localTextFilePath, splitChar):
localTextFilePath = localTextFilePath.strip(' ')
if (localTextFilePath.startswith("file://")):
localTextFilePath = localTextFilePath[7:]
import subprocess
dataBytes = subprocess.check_output("cat " + localTextFilePath, shell=True)
textRDD = sparkContext.parallelize(dataBytes.split(splitChar))
return textRDD
# Usage example
myRDD = parallelizeTextFileToRDD(sc, '~/myTextFile.txt', '\n') # Load my local file as a RDD
myRDD.saveAsTextFile('/user/foo/myTextFile') # Store my data to HDFS

A way around the problem is that you can create a temporary SparkContext simply by calling SparkContext.getOrCreate() and then read the file you passed in the --files with the help of SparkFiles.get('FILE').
Once you read the file retrieve all necessary configuration you required in a SparkConf() variable.
After that call this function:
SparkContext.stop(SparkContext.getOrCreate())
This will distroy the existing SparkContext and than in the next line simply initalize a new SparkContext with the necessary configurations like this.
sc = SparkContext(conf=conf).getOrCreate()
You got yourself a SparkContext with the desired settings

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string