How to distcp between a MAPR filesystem and a HDInsight Blob Storage - azure

I'm trying to execute the distcp command below, however it is throwing the exception:
hadoop distcp date_load=201901* wasb://dev3-spark#clusterdev.blob.core.windows.net/luiz/producao/performance/performance_report
The thrown exception is as follow:
I'm trying to execute the distcp command below, however it is throwing the exception:
hadoop distcp date_load=201901* wasb://dev3-spark#clusterdev.blob.core.windows.net/luiz/producao/performance/performance_report
The thrown exception is as follow:
19/02/06 13:34:53 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
19/02/06 13:34:53 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
19/02/06 13:34:53 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
19/02/06 13:34:53 ERROR tools.DistCp: Invalid arguments:
org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: Container dev3-spark in account clusterdev.blob.core.windows.net not found, and we can't create it using anoynomous credentials.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:938)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:438)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1048)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2693)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:98)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2773)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2755)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:411)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:309)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:216)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Caused by: org.apache.hadoop.fs.azure.AzureException: Container dev3-spark in account clusterdev.blob.core.windows.net not found, and we can't create it using anoynomous credentials.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:730)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:933)
... 12 more
Invalid arguments: org.apache.hadoop.fs.azure.AzureException: Container dev3-spark in account clusterdev.blob.core.windows.net not found, and we can't create it using anoynomous credentials.

You can distcp from your on-premise cluster to your Azure storage account
% hadoop distCP hdfs://<yourHostName>:9001/user/<yourUser>/<yourDirectory> wasbs://<yourStorageContainer>#<YourStorageAccount>.blob.core.windows.net/<yourDestinationDirectory>/
Hope this helps.

Related

gcloud spark submit:Path does not exist: hdfs://cluster-xxxx-m/user/root/--;

I am trying to use gsutil to submit my spark job from Airflow.
This is my gcloud command: gcloud dataproc jobs submit spark --cluster=xxx --region=us-central1 --class=com.xxx --jars=gs://xxx/xxx/xxx.jar -- xxx -- xxx -- xxx -- gs://xxx/xxx/xxx
I am getting this exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://cluster-xxxx-m/user/root/--;
Is anything wrong with my command?
This error could be solved by disabling the flat glob algorithm in the GCS connector, by setting these Hadoop properties during cluster creation core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false. Additionally, upgrade the GCS_CONNECTOR_VERSION to the latest with this command --metadata GCS_CONNECTOR_VERSION=2.2.6.

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

spark job submitted via Livy throws GSSException: No valid credentials provided (Mechanism level: Failed to find any kerberos tgt)

I am trying to launch my spark batch job using livy. From the logs , i see that the start running but fails when it tries to access hive metastore with the following kerberos error:
GSSException: No valid credentials provided (Mechanism level: Failed
to find any kerberos tgt)
The same job runs fine when i launch it using a spark-submit command. However in the spark-submit command i pass the keytab and principal (--keytab, --principal).
I tried passing the keytab and principal in the livy rest call using the parameters spark.yarn.keytab and spark.yarn.principal. adding these options throw the following error:
Error: only one of --proxy-user or --principal can be provided
even though I do not provide proxyUser parameter in my curl request.
kindly let me know if you know how to resolve this issue

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

Resources