Pyspark reading caffe models from HDFS - apache-spark

I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.
But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:
Adding the files to HDFS, and using addfile or --file when submitting jobs
sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each worker node using
model_weight =SparkFiles.get('test.caffemodel')
net = caffe.Net(model_define, model_weight, caffe.TEST)
Since SparkFiles.get() will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:
ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.
Updated:
I found that if the path in sc.addFile(path) is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.
My questions are
Is any other possibility that will cause the above exception other
than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )
If I do not add the file when submmit, how do I read the model file
from HDFS using PySpark? (So that I can reconstruct model using
caffe API). Or is there any way to get the path other than
SparkFiles.get()?
Any suggestions will be appreciated!!

Related

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.

PySpark job only partly running

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

Spark concurrent writes on same HDFS location

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
df.write.mode("append").format('json').save(hdfsPath)
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error:
java.io.IOException: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.
Thanks in advance.
It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.
Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:
https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#recommended-settings-for-writing-to-object-stores

PySpark: pull data to driver and then upload to dataframe

I am trying to create a pyspark dataframe from data stored in an external database. I use the pyodbc module to connect to the database and pull the required data, after which I use spark.createDataFrame to send my data to the cluster for analysis.
I run the script using --deploy-mode client, so the driver runs on the master node, but the executors can be distributed to other machines. The problem is pyodbc is not installed on any of the worker nodes (this is fine since I don't want them all querying the database anyway), so when I try to import this module in my scripts, I get an import error (unless all the executors happen to be on the master node).
My question is how can I specify that I want a certain portion of my code (in this case, importing pyodbc and querying the database) to run on the driver only? I am thinking something along the lines of
if __name__ == '__driver__':
<do stuff>
else:
<wait until stuff is done>
Your imports in your python driver DO only run on the master. The only time you will see errors on your executors about missing imports is if you are referencing some object/function from one of those imports in a function you are calling on a driver. I would look carefully at any python code you are running in RDD/DataFrame calls for unintended references. If you post your code, we can give you more specific guidance.
Also, routing data through your driver is usually not a great idea because it will not scale well. If you have lots of data you are going to try and force all through a single point which defeats the purpose of distributed processing!
Depending on what database you are using is, there is probably a Spark Connector implemented to load it directly into a dataframe. If you are using ODBC then maybe you are using SQL Server? For example, in that case you should be able to use JDBC drivers, like for example in this post:
https://stephanefrechette.com/connect-sql-server-using-apache-spark/#.Wy1S7WNKjmE
This is not how spark is supposed to work. Spark collections (RDDs or DataFrames) are inherently distributed. What you're describing is to create a dataset locally, by reading the whole dataset into drivers memory, and then sending it over to executors for further processing by creating an RDD or DataFrame out of it. That does not make much sense.
If you want to make sure that there is only one connection from spark to your database, then set the parallelism to 1. You can then increase the parallelism in further transformation steps.

Where does spark look for text files?

I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path)
e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode) has nothing to do with the cluster. Therefore any data loading should be done only from the workers, not on the driver machine, right? e.g. there should be no way that sc.textFile("file:///somePath") will cause spark to look for a file on the driver machine (again, the driver is external to the cluster, e.g. in "client" deploy mode / standalone mode), right?
Well, this is what I thought too...
Our cast
machine A: where the driver runs
machine B: where both spark master and one of the workers run
Act I - The Hope
When I start a spark-shell from machine B to spark master on B I get this:
scala> sc.master
res3: String = spark://machinB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
res4: Long = 976
Act II - The Conflict
But when I start a spark-shell from machine A, pointing to spark master on B I get this:
scala> sc.master
res2: String = spark://machineB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/data/myfile.csv
And indeed /tmp/data/myfile.csv does not exist on machine A, but machine A is not on the cluster, it's just where the driver runs
Act III - The Amazement
What’s even weirder is that if I make this file available on machine A, it doesn’t throw this error anymore. (Instead it creates a job, but no tasks, and just fails due to a timeout, which is another issue that deserves a separate question)
Is there something in the way that Spark behaves that I’m missing? I thought that spark shell when connected to a remote, has nothing to do with the machine you are running on. So why does the error stops when I put that file available on machine A? It means that the location of sc.textFile includes the location of where spark-shell or spark-submit were initiated (in my case also where the driver runs)? This makes zero sense to me. but again, I'm open to learn new things.
Epilogue
tl;dr - sc.textFile("file:/somePath") running form a driver on machine A to a cluster on machines B,C,D... (driver not part of cluster)
It seems like it's looking for path file:/somePath also on the driver, is that true (or is it just me)? is that known? is that as designed?
I have a feeling that this is some weird network / VPN topology issue unique to my workplace network, but still this is what happens to me, and I'm utterly confused whether it is just me or a known behavior. (or I'm simply not getting how Spark works, which is always an option)
So the really short version of it the answer is, if you reference "file://..." it should be accessible on all nodes in your cluster including the dirver program. Sometimes some bits of work happen on the worker. Generally the way around this is just not using local files, and instead using something like S3, HDFS, or another network filesystem. There is the sc.addFile method which can be used to distribute a file from the driver to all of the other nodes (and then you use SparkFiles.get to resolve the download location).
Spark can look for files both locally or on HDFS.
If you'd like to read in a file using sc.textFile() and take advantage of its RDD format, then the file should sit on HDFS. If you just want to read in a file the normal way, it is the same as you do depending on the API (Scala, Java, Python).
If you submit a local file with your driver, then addFile() distributes the file to each node and SparkFiles.get() downloads the file to a local temporary file.

Resources