Spark Standalone cluster cannot read the files in local filesystem - apache-spark

I have a Spark standalone cluster having 2 worker nodes and 1 master node.
Using spark-shell, I was able to read data from a file on local filesystem, then did some transformations and saved the final RDD in /home/output(let's say)
The RDD got saved successfully but only on one worker node and on master node only _SUCCESS file was there.
Now, if I want to read this output data from /home/output, I am not getting any data as it is getting 0 data on master and then I am assuming that it is not checking the other worker nodes for that.
It would be great if someone can throw some light on why Spark is not reading from all the worker nodes or what is the mechanism which Spark uses to read the data from worker nodes.
scala> sc.textFile("/home/output/")
res7: org.apache.spark.rdd.RDD[(String, String)] = /home/output/ MapPartitionsRDD[5] at wholeTextFiles at <console>:25
scala> res7.count
res8: Long = 0

SparkContext i.e. sc by default points to HADOOP_CONF_DIR.This is generally set to hdfs:// , which means when you say sc.textFile("/home/output/") it searches for the file/dir as hdfs:///home/output , which in your case is not present on HDFS. file:// points to local filesystem
Try sc.textFile("file:///home/output") ,thus explicitly telling Spark to read from the local filesystem.

You should put the file to all worker machine with the same path and name.

Related

Is writing to database done by driver or executor in spark cluster

I have a spark cluster setup with 1 master node and 2 worker nodes.
I am running a pyspark application in this spark standalone cluster where I have a job to write the transformed data into Mysql database.
So, I have a question here whether writing to database is done by driver or executor?
Because when writing to a textfile, it's done by driver since my output file gets created in driver
Updated
Adding below the code I have used to write to a text file
from pyspark import SparkConf,SparkContext
if __name__ =="__main__":
sc = SparkContext(master = "spark://IP:PORT",appName='word_count_application')
words = sc.textFile("book_2.txt")
word_count = words.flatMap(lambda a : a.split(" ")).map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)
word_count.saveAsTextFile("book2_output.txt")
If the writing is done using dataset/datafame api like this:
df.write.csv("...")
Then it's done by the executors, that why in spark we have multiple files in the output because each executor will write each partition defined inside it.
The driver is used for scheduling work across the executors, and not for doing the actual work ( reading, transforming and writing) which will be done by the executors
saveAsTextFile() is distributed, each executor is writing files. Your driver will never write any files since, as #Abdennacer Lachiheb already mentioned, it is responsible for scheduling, the Spark UI and more.
Your path is referring to a local file system, so your files are not getting saved on your driver, but on the machine your driver runs. The path could also be an object storage like S3 or HDFS.

Problem with data locality when running Spark query with local nature on apache Hadoop

I have a Hadoop cluster that uses Apache Spark to query parquet files saved on Hadoop. For example, when I'm using the following PySpark code to find a word in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab. I see that locality level summery set on Any. In contrast, because of this query's nature, it must run locally and on NODE_LOCAL locality level at least. When I check the network IO of the cluster while running this, I find out that this query use network (network IO increases while the query is running). The strange part of this situation is that the number shown in the spark UI's shuffle section is minimal.
With Russell Spitzer's help in the Apache Spark mailing list to determine the root cause of this problem, I ran the following codes to find each partition's preferred location. The result of this code makes me one step closer to solve this problem. I found out that preferred locations are in IP form and neither Hostname, but spark use executors' IP to match the preferred location and achieve data locality.
scala> def getRootRdd( rdd:RDD[_] ): RDD[_] = { if(rdd.dependencies.size == 0) rdd else getRootRdd(rdd.dependencies(0).rdd)}
getRootRdd: (rdd: org.apache.spark.rdd.RDD[_])org.apache.spark.rdd.RDD[_]
scala> val rdd = spark.read.parquet("hdfs://test/parquets/*").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[38] at rdd at <console>:24
scala> val scan = getRootRdd(rdd)
scan: org.apache.spark.rdd.RDD[_] = FileScanRDD[33] at rdd at <console>:24
scala> scan.partitions.map(scan.preferredLocations)
res12: Array[Seq[String]] = Array(WrappedArray(datanode-1, datanode-2, datanode-3), WrappedArray(datanode-2, datanode-3, datanode-4), WrappedArray(datanode-3, datanode-4, datanode-5),...
Now I try to find ways to make the Spark first resolve the hostname then match them with the executor's IPs. Is there any suggestion?
This problem was created because Spark's preferred locations
from Hadoop for partitions are datanode hostname,
but Spark workers registered to Spark master by IP.
Spark is trying to sketch the task to run on the executor
with the local partition. Because executors are mapped to IPs
and partitions to hostname, the scheduler can't match IPs with hostnames,
and tasks always run on "Any" locality level.
To solve this problem, we must run spark-workers with -h [hostname] flag.
As a result, workers registered in master by the hostname instead of IP, and solve the problem.

Saving dataframe to local file system results in empty results

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Apache spark does not giving correct output

I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.
I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.
when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.
here is the code :
spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count
I have the order.JSON file is distributed on both machines. but then also I am getting different output
1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.
val dfs = sqlContext.read.json("file:///C:/order.json")
Change to
val dfs = sqlContext.read.json("HDFS://order.json")
2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables
3.In order to print log your logs in console
please configure your log4j file in your spark conf folder.
details access Override Spark log4j configurations

what will happen in a cluster environment when I do spark.textFile("hdfs://...log.txt")

Guys I am new to Spark and learnt some basic concept of Spark. Although I now have some understanding of the concepts such as partition, stage, tasks,transformation but I find it is a bit difficult for me to connect these concepts or dots.
Assuming the file has 4 lines(each line take 64MB so it is the same as the size of each partition by default) and I have one master node and 4 slave nodes.
val input = spark.textFile("hdfs://...log.txt")
#whatever transformation here
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
I am wondering what will happen on the master node and slave node?
Here is my understanding please correct me if I am wrong.
When I start the context SparkContext, each worker starts an executor according to this post What is a task in Spark? How does the Spark worker execute the jar file?
Then the application will get pushed to the slave node
Will each of the 4 slave nodes read one line from the file? If so, that means on each slave node, a RDD will be generated? Then DAG will be generated based on RDD and stage will be built and tasks will be identified as well. In this case, each slave node has one RDD and one partition to hold the RDD.
OR, Will the master node read the entire file and build a RDD,then DAG, then stage, and then only push the task to the slave nodes and then the slave node will only process tasks such as map, filter or reduceByKey. But if this is the case, how will the slave nodes read the file? How the file or RDD is distributed among the slaves?
What I am looking for is to understand the flow step by step and to understand where each step happens, on the master node or slave node?
thank you for your time.
cheers
Will each of the 4 slave nodes read one line from the file?
Yes, since the file is split the file will read parallely. (Tuneable property # of lines to read)
How the file or RDD is distributed among the slaves?
HDFS takes care of the splitting and spark workers will be responsible for reading.
Source : here https://github.com/jaceklaskowski/mastering-apache-spark-book

Resources