how do we copy file from hadoop to abfs remotely - azure

how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?

In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html

the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore

Related

"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" when running spark-submit or PySpark

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.
This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging
I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.

Running Spark2.3 on Kubernetes with remote dependency on S3

I am running spark-submit to run on Kubernetes (Spark 2.3). My problem is that the InitContainer does not download my jar file if it's specified as an s3a:// path but does work if I put my jar on an HTTP server and use http://. The spark driver fails, of course, because it can't find my Class (and the jar file in fact is not in the image).
I have tried two approaches:
specifying the s3a path to jar as the argument to spark-submit and
using --jars to specify the jar file's location on s3a, but both fail in the same way.
edit: also, using local:///home/myuser/app.jar does not work with the same symptoms.
On a failed run (dependency on s3a), I logged into the container and found the directory /var/spark-data/spark-jars/ to be empty. The init-container logs don't indicate any type of error.
Questions:
What is the correct way to specify remote dependencies on S3A?
Is S3A not supported yet? Only http(s)?
Any suggestions on how to further debug the InitContainer to determine why the download doesn't happen?

How to configure Executor in Spark Local Mode

In Short
I want to configure my application to use lz4 compression instead of snappy, what I did is:
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.getOrCreate();
but looking at the console output, it's still using snappy in the executor
org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
and
[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]
According to this post, what I did here only configure the driver, but not the executor. The solution on the post is to change the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Some more detail:
I need to run the application in local mode (for the purpose of unit test). The tests works fine locally on my machine, but when I submit the test to a build engine(RHEL5_64), I got the error
snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found
I did some research and it seems the simplest fix is to use lz4 instead of snappy for codec, so I try the above solution.
I have been stuck in this issue for several hours, any help is appreciated, thank you.
what I did here only configure the driver, but not the executor.
In local mode there is only one JVM which hosts both driver and executor threads.
the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Mode is not relevant here. Spark in local mode uses the same configuration files. If you go to the directory where you store Spark binaries you should see conf directory:
spark-2.2.0-bin-hadoop2.7 $ ls
bin conf data examples jars LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
In this directory there is a bunch of template files:
spark-2.2.0-bin-hadoop2.7 $ ls conf
docker.properties.template log4j.properties.template slaves.template spark-env.sh.template
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template
If you want to set configuration option copy spark-defaults.conf.template to spark-defaults.conf and edit it according to your requirements.
Posting my solution here, #user8371915 does answered the question, but did not solve my problem, because in my case I can't modified the property files.
What I end up doing is adding another configuration
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.config("spark.sql.parquet.compression.codec", "uncompressed")
.getOrCreate();

How to access files in Hadoop HDFS?

I have a .jar file (containing a Java project that I want to modify) in my Hadoop HDFS that I want to open in Eclipse.
When I type hdfs dfs -ls /user/... I can see that the .jar file is there - however, when I open up Eclipse and try to import it I just can't seem to find it anywhere. I do see a hadoop/hdfs folder in my File System which takes me to 2 folders; namenode and namesecondary - none of these have the file that I'm looking for.
Any ideas? I have been stuck on this for a while. Thanks in advance for any help.
As HDFS is virtual storage it is spanned across the cluster so you can see only the metadata in your File system you can't see the actual data.
Try downloading the jar file from HDFS to your Local File system and do the required modifications.
Access the HDFS using its web UI.
Open your Browser and type localhost:50070 You can see the web UI of HDFS move to utilities tab which is on the right side and click on Browse the File system, you can see the list of files which are in your HDFS.
Follow the below steps to download the file to your local file system.
Open Browser-->localhost:50070-->Utilities-->Browse the file system-->Open your required file directory-->Click on the file(a pop up will open)-->Click on download
The file will be downloaded into your local File System and you can do your required modifications.
HDFS filesystem and local filesystem are both different.
You can copy the jar file from hdfs filesystem to your preferred location in your local filesytem by using this command:
bin/hadoop fs -copyToLocal locationOfFileInHDFS locationWhereYouWantToCopyFileInYourFileSystem
For example
bin/hadoop fs -copyToLocal file.jar /home/user/file.jar
I hope this helps you.
1) Get the file from HDFS to your local system
bin/hadoop fs -get /hdfs/source/path /localfs/destination/path
2) you can manage it in this way:
New Java Project -> Java settings -> Source -> Link source (Source folder).
You can install plugin to Eclipse that can browse HDFS:
http://hdt.incubator.apache.org
OR
you can mount HDFS via fuse:
https://wiki.apache.org/hadoop/MountableHDFS
You can not directly import the files present in HDFS to Eclipse. First you will have to move file from HDFS to local drive then only you can use it in any utility.
fs -copyToLocal hdfsLocation localDirectoryPath

Spark jar package dependency file

I want to do some ip to location computation on spark, after exploring the net ,find IPLocator https://github.com/miraclesu/IPLocator,
the IP to location need to use a file which contains the mapping information.
After packaging the jar, I can run it through on using local java, the package just runs with the IPLocator.jar and qqwry.dat in the same directory.
But I want to use this jar using spark , I tryed to use --jars IPLocator.jar qqwry.dat when starting spark-shell, but when launching , the functions still can not read get the file .
the file reading code is like
QQWryFile.class.getClassLoader().getResource("qqwry.dat")
I also tried to package qqwry.dat file into the jar, and It did not work.
You need to use --files and then SparkFiles.get inside of your program
Try to use comma delimitor and check if IPLocator.jar and qqwry.dat are distributed to spark staging folder(.sparkStaging/application_xxx).
--jars IPLocator.jar,qqwry.dat

Resources