Data Pipeline failing for EMR Activity - apache-spark

I am trying to run a spark step on AWS Data-pipeline. I am getting the following exception:-
amazonaws.datapipeline.taskrunner.TaskExecutionException: Failed to
complete EMR transform. at
amazonaws.datapipeline.activity.EmrActivity.runActivity(EmrActivity.java:67)
at
amazonaws.datapipeline.objects.AbstractActivity.run(AbstractActivity.java:16)
at
amazonaws.datapipeline.taskrunner.TaskPoller.executeRemoteRunner(TaskPoller.java:136)
at
amazonaws.datapipeline.taskrunner.TaskPoller.executeTask(TaskPoller.java:105)
at
amazonaws.datapipeline.taskrunner.TaskPoller$1.run(TaskPoller.java:81)
at
private.com.amazonaws.services.datapipeline.poller.PollWorker.executeWork(PollWorker.java:76)
at
private.com.amazonaws.services.datapipeline.poller.PollWorker.run(PollWorker.java:53)
at java.lang.Thread.run(Thread.java:748) Caused by:
amazonaws.datapipeline.taskrunner.TaskExecutionException: EMR job
'#DefaultEmrActivity1_2017-11-20T12:13:08_Attempt=1' with jobFlowId
'j-2E7PU1OK3GIJI' is failed with status 'FAILED' and reason 'Cluster
ready after last step completed.'. Step
'df-0693981356F3KEDFQ6GG_#DefaultEmrActivity1_2017-11-20T12:13:08_Attempt=1'
is in status 'FAILED' with reason 'null' at
amazonaws.datapipeline.cluster.EmrUtil.runSteps(EmrUtil.java:286) at
amazonaws.datapipeline.activity.EmrActivity.runActivity(EmrActivity.java:63)
... 7 more
The cluster is getting spun up correctly.
Here is the screenshot of the pipeline:-
I think there is some issue with the 'step' in activity. Any input would be helpful.

The issue was that the:-
1) script should have been comma-separated. Something like:-
command-runner.jar,spark-submit,--deploy-mode,cluster,--class,com.amazon.Main
Link:- http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html
2) EmrActivity does not support Staging. So, we cannot use ${INPUT1_STAGING_DIR} in the step instruction. Currently, I have replaced this with the hardcoded S3 URL's.

Related

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first

Error when brace-expansion includes more than ~25 files in amazon-s3 read from spark

I have just upgraded to using spark 3 instead of spark 2.4.
The following code ran fine in spark 2.4
df = spark.read.parquet('s3a://bucket/path/{'+
'file1,'+
'file2,'+
'file3,'+
'file4,'+
'file5,'+
'file6,'+
'file7,'+
'file8,'+
'file9,'+
'file10,'+
'file11,'+
'file12,'+
'file13,'+
'file14,'+
'file15,'+
'file16,'+
'file17,'+
'file18,'+
'file19,'+
'file20,'+
'file21,'+
'file22,'+
'file23,'+
'file24,'+
'file25'+
'}')
but in spark 3 I get this error:
Py4JJavaError: An error occurred while calling o944.parquet.
: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on s3a://
...
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: aaa), S3 Extended Request ID:
If I reduce the number of files to less than about 24 then the query completes successfully in spark 3.
I can't find any reference to limits on the number of files in a brace expansion like this in s3. What might be going wrong? How can it be fixed?
why not just give spark the entire directory to process and let it scan the files?
df = spark.read.parquet('s3a://bucket/path/')
There is a limit to 1024 characters in a aws query. Somehow, this was not a problem in spark 2.

ORA-01555: snapshot too old: rollback segment number with name “” too small Sonar qube

i am getting error while publishing results on sonar.
Error querying database. Cause: org.apache.ibatis.executor.result.ResultMapException: Error attempting to get column 'RAWLINEHASHES' from result set. Cause: java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 2 with name "_SYSSMU2_111974964$" too small
Cause: org.apache.ibatis.executor.result.ResultMapException: Error attempting to get column 'RAWLINEHASHES' from result set. Cause: java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 2 with name "_SYSSMU2_111974964$" too small
Pipeline executed for 2 hr 30 mins.
Can you please help ?
The error that you are getting is ORA-01555. Which is an Oracle error message.
Your pipeline is executing something against an Oracle database, which after it has run for a long time, gives the error.
For ways to avoid this error see: https://blog.enmotech.com/2018/09/10/ora-01555-snapshot-old-error-ways-to-avoid-ora-01555-snapshot-too-old-error/

Kafka Connect error while starting with AvroConverter

I'm trying to sink from one topic to Cassandra. That topic has schema-registry for an avro file created by a kafka stream.
This is my connect-standalone.properties:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
plugin.path=/usr/share/java
This is my cassandra-sink.properties:
connector.class=io.confluent.connect.cassandra.CassandraSinkConnector
cassandra.contact.points=ip.to.endpoint
cassandra.port=9042
cassandra.keyspace=my_keyspace
cassandra.write.mode=Update
tasks.max=1
topics=avro-sink
key.converter.schema.registry.url=http://localhost:8081
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.avro.AvroConverter
name=cassandra-sink-connector
internal.key.converter=org.apache.kafka.connect.storage.StringConverter
internal.value.converter=org.apache.kafka.connect.avro.AvroConverter
transforms=createKey
transforms.createKey.fields=id,timestamp
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
And this is the error:
[2019-03-15 16:32:05,238] ERROR Failed to create job for /etc/kafka/cassandra-sink.properties (org.apache.kafka.connect.cli.ConnectStandalone:108)
[2019-03-15 16:32:05,238] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:119)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.runtime.rest.errors.BadRequestException: Connector configuration is invalid and contains the following 1 error(s):
Invalid value org.apache.kafka.connect.avro.AvroConverter for configuration value.converter: Class org.apache.kafka.connect.avro.AvroConverter could not be found.
You can also find the above list of errors at the endpoint `/{connectorType}/config/validate`
at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:79)
at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:66)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:116)
Caused by: org.apache.kafka.connect.runtime.rest.errors.BadRequestException: Connector configuration is invalid and contains the following 1 error(s):
Invalid value org.apache.kafka.connect.avro.AvroConverter for configuration value.converter: Class org.apache.kafka.connect.avro.AvroConverter could not be found.
You can also find the above list of errors at the endpoint `/{connectorType}/config/validate`
at org.apache.kafka.connect.runtime.AbstractHerder.maybeAddConfigErrors(AbstractHerder.java:423)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:188)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:113)
This broker is the one that can be found in https://github.com/confluentinc/cp-docker-images
Any ideas? Thanks!
I solved it by removing all converter properties since the default are already avro for value and string for key on confluent docker container

how to avoid spark-submit cache

spark-submit job is put in CDH, there is a weird thing. It always complains a query (XXX in below), but this query is not in the current application, it was an OLD query used it before and deleted. Looks like there is some cache somewhere.
The code is simple, var extract = sqlContext.sql(".....")
How to fix it ? thanks.
16/11/13 22:12:29 INFO DAGScheduler: Job 1 finished: aggregate at InferSchema.scala:41, took 3.032230 s
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'XXX' (string and boolean).;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
Thanks.
You may need to remove the old jar and rebuild it for execution.

Resources