Is there a way to use Apache Hudi on AWS glue? - apache-spark

Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job.
Any blogs/articles which can help here as a starting point ?

There is another way which is possible (as per answer from Robert), to include custom jars into the glue job. Then these will be loaded to your glue job and available as in any other hadoop/spark env.
Steps required to achieve this approach are following (at least these work for my pyspark jobs, please correct me if you find some information not exhausting or you have some troubles, I will update my answer):
Note 1: Below is for batch writes, did not test it for hudi streaming
Note 2: Glue job type: Spark, Glue version: 2.0, ETL lang: python
Get all respective jars required by hudi and put them into S3:
hudi-spark-bundle_2.11
httpclient-4.5.9
spark-avro_2.11
When creating glue job (see note 2), specify:
dependent jars path = comma delimited paths for jars from point no. 1 (e.g. s3://your-bucket/some_prefix/hudi-spark-bundle...jar,s3://your-bucket/some_prefix/http...jar,s3://your-bucket/some_prefix/spark-avro....jar)
Create your script according to the documentation provided within hudi docs and enjoy!
Last note:
make sure to assign proper permissions to your glue job

Related

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")

Spark 2.3.3 outputing parquet to S3

A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post
I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0
Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. I usually use flintrock to orchestrate my cluster on AWS. Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? Or do I still for example have to set something in code like I did before. For example like this:
conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')
(I know this is the old code and probably no longer relevant)
You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970
Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds.
With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most.
There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried.

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

How can I set the number of partitions when using the Bigquery Connector in Apache Spark?

I am reading the documentation both for Google Cloud Dataproc and generally for Apache Spark and am unable to figure out how to manually set the number of partitions when using the Bigquery connector.
The HDD is created using newAPIHadoopRDD and my strong suspicion is that this can be set via the config file which is passed to this function. But I can't actually figure out what the possible values are for the config file. Neither the Spark documentation or the Google documentation seems to specify or link to the Hadoop job configuration file specification.
Is there a way to set the partitions upon the creation of this RDD or do I just need to repartition it as the next step?
you need to do repartition in your spark code, for example :
val REPARTITION_VALUE = 24
val rdd = sc.newAPIHadoopRDD(conf,classOf[GsonBigQueryInputFormat],classOf[LongWritable],classOf[JsonObject])
rdd.map(x => f(x))
.repartition(REPARTITION_VALUE)
.groupBy(_.1)
.map(tup2 => f(tup2._1,tup2._2.toSeq))
.repartition(REPARTITION_VALUE)
And so on ...
when you work with rdd you will need to handle the partition
Solution : the best solution is to work with the Dataset or DataFram

Resources