Spark read csv file submitted from --files - apache-spark

I'm submitting a Spark job to a remote spark cluster on yarn and including a file in the spark-submit --file I want to read the submitted file as a dataframe. But I'm confused about how to go about this without having to put the file in HDFS:
spark-submit \
--class com.Employee \
--master yarn \
--files /User/employee.csv \
--jars SomeJar.jar
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("/User/employee.csv")

spark.sparkContext.addFile("file:///your local file path ")
Add file using addFile so that it can be available at your worker nodes. Since you want to read local file in cluster mode.
You may need to do a slight change according to scala and the spark version your are using.

employee.csv is in the work directory of executor, just reading it as follows:
val df = spark.read.csv("employee.csv")

Related

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

How to create Pyspark application

My requirement is to read the data from HDFS using pyspark, filter only required columns, remove the NULL values and then writing back the processed data to HDFS. Once the these steps are completed, we need to deleted the RAW Dirty data from HDFS. Here is my script for each operations .
Import the Libraries and dependencies
#Spark Version = > version 2.4.0-cdh6.3.1
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
import pyspark.sql.functions as F
Read the Data from HDFS
df_load_1 = sparkSession.read.csv('hdfs:///cdrs/file_path/*.csv', sep = ";")
Select only the required columns
col = [ '_c0', '_c1', '_c2', '_c3', '_c5', '_c7', '_c8', '_c9', '_c10', '_C11', '_c12', '_c13', '_c22', '_C32', '_c34', '_c38', '_c40',
'_c43', '_c46', '_c47', '_c50', '_c52', '_c53', '_c54', '_c56', '_c57', '_c59', '_c62', '_c63','_c77', '_c81','_c83']
df1=df_load_1.select(*[col])
Check for NULL values and we have any remove them
df_agg_1 = df1.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df1.columns])
df_agg_1.show()
df1 = df1.na.drop()
Writing the pre-processed data to HDFS, same cluster but different directory
df1.write.csv("hdfs://nm/pyspark_cleaned_data/py_in_gateway.csv")
Deleting the original raw data from HDFS
def delete_path(spark , path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
Executing below by passing the HDFS absolute path
delete_path(spark , '/cdrs//cdrs/file_path/')
pyspark and HDFS commands
I am able to do all the operations successfully from pyspark prompt .
Now i want to develop the application and submit the job using spark-submit
For example
spark-submit --master yarn --deploy-mode client project.py for local
spark-submit --master yarn --deploy-mode cluster project.py for cluster
At this point i am stuck, i am not sure what parameter i am supposed to pass in place yarn in spark-submit. i am not sure whether simply copying and pasting all above commands and make .py file will help. I am very new to this technology.
Basically your spark job will run on a cluster. Spark 2.4.4 supports yarn, kubernetes, mesos and spark-standalone cluster doc.
--master yarn specifies that you are submitting your spark job to a yarn cluster.
--deploy-mode specifies whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
spark-submit --master yarn --deploy-mode client project.py for client mode
spark-submit --master yarn --deploy-mode cluster project.py for cluster mode
spark-submit --master local project.py for local mode
You can provide other arguments while submitting your spark job like --driver-memory, --executor-memory, --num-executors etc check here.

How to pass external resouce yml /property file while running spark job on cluster?

I am using spark-sql 2.4.1 version, jackson jars & Java 8.
In my spark program/job I am reading few configurations/properties from external "conditions.yml" file which is place in "resource" folder of my Java Project as below
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
try {
driverConfig = mapper.readValue(
Configuration.class.getClassLoader().getResourceAsStream("conditions.yml"),Configuration.class);
}
If I want to pass "conditions.yml" file from outside while submitting spark-job how to pass this file ? where it should be placed?
In my program I am reading from "resouces" directory i.e. .getResourceAsStream("conditions.yml") ...if i pass this property file from spark-submit ...will the job takes from here from resouces or external path ?
If I want to pass as external file , do I need to change the code above ?
Updated Question:
In my spark driver program I am reading the property file as program arguments
Which is being loaded as below
Config props = ConfigFactory.parseFile(new File(args[0]));
While running my spark job in shell script
I am giving as below
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties condition.yml
Error :
Not loading the properties... what is wrong here ? What is the correct way to pass the Program Args to Spark-Job Java program?
you will have to use --file path to your file in spark-submit command to be able to pass any files. please note this is
syntax for that is
"--file /home/user/config/my-file.yml"
if it is on hdfs then provide the hdfs path
this should copy the file to class path and your code should be able find it from the driver.
the implementation of reading the file should be done with something like this
def readProperties(propertiesPath: String) = {
val url = getClass.getResource("/" + propertiesPath)
assert(url != null, s"Could not create URL to read $propertiesPath properties file")
val source = Source.fromURL(url)
val properties = new Properties
properties.load(source.bufferedReader)
properties
}
hope that is what you are looking for
You can add:
spec:
args:
--deploy-mode
cluster

replace default application.conf file in spark-submit

My code works like:
val config = ConfigFactory.load
It gets the key-value pairs from application.conf by default. Then I use -Dconfig.file= to point to another conf file.
It works fine for command below:
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf /home/user-jar-with-dependencies.jar
But now I need to split the userconfig.conf to 2 files. I tried command below. It doesn't work.
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf,env.conf
/home/user-jar-with-dependencies.jar
By default spark will look in defaults.conf but you can 1) specify another file using 'properties-file' 2) you can pass individual keu value properties using --conf or 3) you can set up the configuration programmatically in your code using the sparkConf object
Does this help or are you looking for the akka application.conf file?

Read files sent with spark-submit by the driver

I am sending a Spark job to run on a remote cluster by running
spark-submit ... --deploy-mode cluster --files some.properties ...
I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.
The ways around this problem that I know of are:
Upload the file to HDFS
Store the file in the app jar
Both are inconvenient since this file is frequently changed on the submitting dev machine.
Is there a way to read the file that was uploaded using the --files flag during the driver code main method?
Yes, you can access files uploaded via the --files argument.
This is how I'm able to access files passed in via --files:
./bin/spark-submit \
--class com.MyClass \
--master yarn-cluster \
--files /path/to/some/file.ext \
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \
/path/to/app.jar file.ext
and in my Spark code:
val filename = args(0)
val linecount = Source.fromFile(filename).getLines.size
I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.
After the investigation, I found one solution for above issue. Send the any.properties configuration during spark-submit and use it by spark driver before and after SparkSession initialization. Hope it will help you.
any.properties
spark.key=value
spark.app.name=MyApp
SparkTest.java
import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
public class SparkTest{
public Static void main(String[] args){
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
Config conf = loadConf();
System.out.println(conf.getString("spark.key"));
// Initialize SparkContext and use configuration from properties
SparkConf sparkConf = new SparkConf(true).setAppName(conf.getString("spark.app.name"));
SparkSession sparkSession =
SparkSession.builder().config(sparkConf).config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().getOrCreate();
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
}
public static Config loadConf() {
String configFileName = "any.properties";
System.out.println(configFileName);
Config configs = ConfigFactory.load(ConfigFactory.parseFile(new java.io.File(configFileName)));
System.out.println(configs.getString("spark.key")); // get value from properties file
return configs;
}
}
Spark Submit:
spark-submit --class SparkTest --master yarn --deploy-mode client --files any.properties,yy-site.xml --jars ...........
use spark-submit --help, will find that this option is only for working directory of executor not driver.
--files FILES: Comma-separated list of files to be placed in the working directory of each executor.
The --files and --archives options support specifying file names with the # , just like Hadoop.
For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into Spark worker directory, but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
this works for my spark streaming application in both yarn/client and yarn/cluster mode.
In pyspark, I find it really interesting to achieve this easily, first arrange your working directory like this:
/path/to/your/workdir/
|--code.py
|--file.txt
and then in your code.py main function, just read the file as usual:
if __name__ == "__main__":
content = open("./file.txt").read()
then submit it without any specific configurations as follows:
spark-submit code.py
it runs correctly which amazes me. I suppose the submit process archives any files and sub-dir files altogether and sends them to the driver in pyspark, while you should archive them yourself in scala version. By the way, both --files and --archives options are working in worker not the driver, which means you can only access these files in RDD transformations or actions.
Here's a nice solution I developed in Python Spark in order to integrate any data as a file from outside to your Big Data platform.
Have fun.
# Load from the Spark driver any local text file and return a RDD (really useful in YARN mode to integrate new data at the fly)
# (See https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html)
def parallelizeTextFileToRDD(sparkContext, localTextFilePath, splitChar):
localTextFilePath = localTextFilePath.strip(' ')
if (localTextFilePath.startswith("file://")):
localTextFilePath = localTextFilePath[7:]
import subprocess
dataBytes = subprocess.check_output("cat " + localTextFilePath, shell=True)
textRDD = sparkContext.parallelize(dataBytes.split(splitChar))
return textRDD
# Usage example
myRDD = parallelizeTextFileToRDD(sc, '~/myTextFile.txt', '\n') # Load my local file as a RDD
myRDD.saveAsTextFile('/user/foo/myTextFile') # Store my data to HDFS
A way around the problem is that you can create a temporary SparkContext simply by calling SparkContext.getOrCreate() and then read the file you passed in the --files with the help of SparkFiles.get('FILE').
Once you read the file retrieve all necessary configuration you required in a SparkConf() variable.
After that call this function:
SparkContext.stop(SparkContext.getOrCreate())
This will distroy the existing SparkContext and than in the next line simply initalize a new SparkContext with the necessary configurations like this.
sc = SparkContext(conf=conf).getOrCreate()
You got yourself a SparkContext with the desired settings

Resources