History Server running with different Spark versions - apache-spark

I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?

No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.

Related

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

java.lang.NumberFormatException: in Pyspark when writing to S3

I am trying to read compressed log file from S3 bucket using pyspark on EC2 instance.
EC2 instance has read permission to S3 bucket as I am able to manually download the file using AWS CLI command.
This is how my code looks like
file_path= 's3a://<bucket_name>/<path_of_file>'
rdd1 = sc.textFile(file_path)
rdd1.take(3)
But I am getting below error
*py4j.protocol.Py4JJavaError: An error occurred while calling o36.partitions.
: java.lang.NumberFormatException: For input string: "64M"*
Can somebody help me out?
you are mixing versions of hadoop-common with an older version of hadoop-aws.
the s3a connector added support for using a unit when declaring multipart block size in 2016, eight years ago, in https://issues.apache.org/jira/browse/HADOOP-13680.
hadoop-common JAR versions 2.8+ set it to "64M"
if the version of the s3a connector you are using can't cope with that, it means it is nine years old
please
upgrade your hadoop-* jars to a recent version, ideally 3.3.0+
make sure they are all the same version unless you enjoy seeing stack traces
and use the exact same aws-sdk-bundle jar which hadoop was built with unless you want to see different stack traces.
thisis not an opinion, these are instructions from the hadoop-aws maintenance team.

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
All the new committers config documentation I've read up to date, is missing one fundamental fact:
spark 2.x.x does not have needed support classes to make new S3a committers to function.
They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.
Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with directory committer but not the magic.
In general the directory committer is the recommended over magic as it has been well tested and tried. It requires shared filesystem (magic committer does not require one, but needs s3guard) such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.

integration of csv file with flume vs spark

I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.
To do that I found Flume and Spark can do it.
I know that Spark is preferred when you need to perform data transformations.
My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?
Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.
Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors
Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.
Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code

How to export the output of apache spark program into a csv or text file when using Spark on Amazon EMR

I would like to know how we can print the output after running a SVM algorithm to a csv file. I am hosting my spark cluster on AWS EMR. So any files I access are to be saved and accessed from S3 only. So when I use the saveAsTextFile command and specify an aws path, I don't see the output file(s) being stored in S3. Any suggestions in this regard?
You can use Spark "saveAsTextFile" action to write the results to a file.
An example is available Here

Resources