I have an external hive table defined with a location in s3
LOCATION 's3n://bucket/path/'
When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.
I've tried defining the table using the s3a path but my job fails due to some vague errors.
This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.
Is there a configuration or alternative library that I can use to make this write more efficient?
I guess you're using parquet. The DirectParquetOutputCommitter has been removed to avoid potential data loss issue. The change was actually in 04/2016.
It means the data your write to S3 will firstly be saved in a _temporary folder, then "moved" to its final location. Unfortunately "moving" == "copying & deleting" in S3 and it is rather slow. To make it worse, this final "moving" is done only by the driver.
You will have to write to local HDFS then copy the data over (I do recommend this), if you don't want to fight to add that class back. In HDFS "moving" ~ "renaming" so it takes no time.
Related
My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.
Am trying to understand the pros and cons of one of the approaches that I am hearing frequently in my workspace.
Spark while writing data to Hive table (InsertInto) does following
Write to the staging folder
Move data to hive table using output committer.
Now I see people are complaining that the above 2 step approach is time-consuming and hence resorting to
1) Write files directly to HDFS
2) Refresh metastore for Hive
And I see people reporting a lot of improvement with this approach.
But somehow I am not yet convinced that it's the safe and right method. is it not a tradeoff for Automiticy? (Either full table write or no write)
What if the executor that's writing a file to HDFS crashes? I don't see a way to completely revert the half-done writes.
I also think Spark would have done this if it was the right way to do, isn't it?
Are my questions valid? Do you see any good in the above approach? Please comment.
It is not 100% right cause in hive v3 you could only access hive data with hive driver in order not to break new transaction machine.
Even if you are using hive2 you should at least keep in mind that you would not be able to access the data directly as soon as you upgrade.
I have a bunch of data (on S3) that I am copying to a local HDFS (on amazon EMR). Right now I'm doing that using org.apache.hadoop.fs.FileUtil.copy, but it's not clear if this distributes the file copy to the executors. There's certainly nothing showing up in the Spark History server.
Hadoop DistCp seems like the thing (note I'm on S3, so it's actually supposed to be s3-dist-cp which is built on top of dist-cp) except that it's a command-line tool. I'm looking for a way to invoke this from a Scala script (aka, Java).
Any ideas / leads?
cloudcp is an example of using Spark to do the copy; the list of files is turned into an RDD, each row == a copy. That design is optimised for upload from HDFS, as it tries to schedule the upload close to the files in HDFS.
For download, you want to
use listFiles(path, recursive) for maximum performance in listing an object store.
randomise the list of source files so that you don't get throttled by AWS
randomise the placement across the HDFS cluster so that the blocks end up scattered evenly round the cluster
I have an S3 location with the below directory structure with a Hive table created on top of it:
s3://<Mybucket>/<Table Name>/<day Partition>
Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:
Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")
If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:
Do they consider temporary files being written?
Does spark lock the data file while writing into S3 location?
How can we handle such concurrency situations using Spark as an ETL tool?
No locks. Not implemented in S3 or HDFS.
The process of committing work in HDFS is not atomic in HDFS; there's some renaming going on in job commit which is fast but not instantaneous
With S3 things are pathologically slow with the classic output committers, which assume rename is atomic and fast.
The Apache S3A committers avoid the renames and only make the output visible in job commit, which is fast but not atomic
Amazon EMR now has their own S3 committer, but it makes files visible when each task commits, so exposes readers to incomplete output for longer
Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.
Do they consider temporary files being written?
As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.
Does spark lock the data file while writing into S3 location?
Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.
How can we handle such concurrency situations using Spark as an ETL tool?
Again using the simple writing to temporary location mechanism.
One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.
Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.