I'm just getting started using Apache Spark. I'm using cluster mode and I want to process a big file. I am using the textFile method from SparkContext, it will read a local file system available on all nodes.
Due to the fact my file is really big it is a pain to copy and paste in each cluster node. My question is: is there any way to have this file in a unique location like a shared folder?
Thanks a lot
You can keep the file in Hadoop or S3 .
Then you can give the path of the file in textFile method itself .
for s3 :
val data = sc.textFile("s3n://yourAccessKey:yourSecretKey#/path/")
for hadoop :
val hdfsRDD = sc.textFile("hdfs://...")
Related
My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.
Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.
I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.
So far I am able to read from Kafka stream and write it to a parquet file using the following key line.
val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()
I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.
But when I try to read this parquet file using following command:
val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")
it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.
When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.
Am I missing some critical step in the streaming query or while reading data?
NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.
You need a shared file system.
Spark assumes the same file system is visible from all nodes (driver and workers).
If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.
HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.
Is it possible to create a RDD using data from master or worker? I know that there is a option SC.textFile() which sources the data from local system (driver) similarly can we use something like "master:file://input.txt" ? because I am accessing a remote cluster and my input data size is large and cannot login to remote cluster.
I am not looking for S3 or HDFS. Please suggest if there is any other option.
Data in an RDD is always controlled by the Workers, whether it is in memory or located in a data-source. To retrieve the data from the Workers into the Driver you can call collect() on your RDD.
You should put your file on HDFS or a filesystem that is available to all nodes.
The best way to do this is to as you stated use sc.textFile. To do that you need to make the file available on all nodes in the cluster. Spark provides an easy way to do this via the --files option for spark-submit. Simply pass the option followed by the path to the file that you need copied.
You can access the hadoop, file by creating hadoop configuration.
import org.apache.spark.deploy.SparkHadoopUtil
import java.io.{File, FileInputStream, FileOutputStream, InputStream}
val hadoopConfig = SparkHadoopUtil.get.conf
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(fileName), hadoopConfig)
val fsPath = new org.apache.hadoop.fs.Path(fileName)
Once you get the path you can copy, delete, move or perform any operations.
I have an input folder that contains many files. I would like to do a batch operation on them like copy/move them to a new path.
I would like to do this using Spark.
Please help/suggest how to proceed on this.
You can read it using val myfile = sc.textFile("file://file-path") if it is local dir and save them using myfile.saveAsTextFile("new-location"). It's also possible to save with compression Link to ScalaDoc
What spark will do is to read all files and at a same time save them to a new location and make a batch of those files and store them in new location (HDFS/local).
Make sure you have the same directory available in each worker nodes of your spark cluster
In the upper case you have to have the local files' path on each worker node.
If you want to get rid of that you can use a distributed filesystem like hadoop filesystem (hdfs).
In this case you have to give path like this:
hdfs://nodename-or-ip:port/path-to-directory