Spark 1.6 kafka streaming on dataproc py4j error - apache-spark

I get the following error:
Py4JError(u'An error occurred while calling o73.createDirectStreamWithoutMessageHandler. Trace:\npy4j.Py4JException: Method createDirectStreamWithoutMessageHandler([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist\n\tat py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)\n\tat py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)\n\tat py4j.Gateway.invoke(Gateway.java:252)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:209)\n\tat java.lang.Thread.run(Thread.java:745)\n\n',)
I am using spark-streaming-kafka-assembly_2.10-1.6.0.jar (which is present in the /usr/lib/hadoop/lib/ folder on all my nodes + master)
(EDIT)
The actual error was: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.Apps.crossPlatformify(Ljava/lang/String;)Ljava/lang/String;
This was due to a wrong hadoop version. Therefore spark should be compiled with the correct hadoop version:
mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 -DskipTests clean package
This will result in a jar in the external/kafka-assembly/target folder.

Using image version 1, I've successfully run the pyspark streaming / kafka example wordcount
In each of these examples "ad-kafka-inst" is my test kafka instance with a 'test' topic.
Using a cluster with no initialization actions:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka2 --properties spark.jars.packages=org.apache.spark:spark-streaming-kafka_2.10:1.6.0 ./kafka_wordcount.py ad-kafka-inst:2181 test
Using initialization actions with a full kafka assembly:
Download / unpack spark-1.6.0.tgz
Build with:
$ mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 package
Upload spark-streaming-kafka-assembly_2.10-1.6.0.jar to a new GCS bucket (MYBUCKET for example).
Create the following initialization action in the same GCS bucket (e.g., gs://MYBUCKET/install_spark_kafka.sh):
$ #!/bin/bash
gsutil cp gs://MY_BUCKET/spark-streaming-kafka-assembly_2.10-1.6.0.jar /usr/lib/hadoop/lib/
chmod 755 /usr/lib/hadoop/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar
Start a cluster with the above initialization action:
$ gcloud dataproc clusters create ad-kafka-init --initialization-actions gs://MYBUCKET/install_spark_kafka.sh
Start the streaming word count:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka-init ./kafka_wordcount.py ad-kafka-inst:2181 test

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

gcloud spark submit:Path does not exist: hdfs://cluster-xxxx-m/user/root/--;

I am trying to use gsutil to submit my spark job from Airflow.
This is my gcloud command: gcloud dataproc jobs submit spark --cluster=xxx --region=us-central1 --class=com.xxx --jars=gs://xxx/xxx/xxx.jar -- xxx -- xxx -- xxx -- gs://xxx/xxx/xxx
I am getting this exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://cluster-xxxx-m/user/root/--;
Is anything wrong with my command?
This error could be solved by disabling the flat glob algorithm in the GCS connector, by setting these Hadoop properties during cluster creation core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false. Additionally, upgrade the GCS_CONNECTOR_VERSION to the latest with this command --metadata GCS_CONNECTOR_VERSION=2.2.6.

spark gcp dataproc dependency error with guava

Our project uses gradle and scala to build spark app, but I've added gcp kms library and now when this runs on dataproc it errors with missing guava method:
java.lang.noSuchMethodError: com.google.common.base.Preconditions.checkArgument
I followed the recommendations on the following guide to shade the google libraries:
https://cloud.google.com/blog/products/data-analytics/managing-java-dependencies-apache-spark-applications-cloud-dataproc
My shadowJar definition in gradle build:
shadowJar {
zip64 true
relocate 'com.google', 'shadow.com.google'
relocate 'com.google.protobuf', 'shadow.com.google.protobuf'
relocate 'google.cloud', 'shadow.google.cloud'
exclude 'META-INF/**'
exclude "LICENSE*"
mergeServiceFiles()
archiveFileName = "myjar"
}
When running java tf on compiled fat jar this shows the guava classes including checkArgument being relocated under shadow.
But it still errors when runnin dataproc spark submit and it still seems to choose the old versions from hadoop at runtime. Below is the stack trace, starting from my KmsSymettric class which is using gcp kms decrypt:
Exception in thread "main" java.lang.NoSuchMethodError: shadow.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;CLjava/lang/Object;)V
at io.grpc.Metadata$Key.validateName(Metadata.java:629)
at io.grpc.Metadata$Key.<init>(Metadata.java:637)
at io.grpc.Metadata$Key.<init>(Metadata.java:567)
at io.grpc.Metadata$AsciiKey.<init>(Metadata.java:742)
at io.grpc.Metadata$AsciiKey.<init>(Metadata.java:737)
at io.grpc.Metadata$Key.of(Metadata.java:593)
at io.grpc.Metadata$Key.of(Metadata.java:589)
at shadow.com.google.api.gax.grpc.GrpcHeaderInterceptor.<init>(GrpcHeaderInterceptor.java:60)
at shadow.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:221)
at shadow.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:194)
at shadow.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:186)
at shadow.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:155)
at shadow.com.google.cloud.kms.v1.stub.GrpcKeyManagementServiceStub.create(GrpcKeyManagementServiceStub.java:370)
at shadow.com.google.cloud.kms.v1.stub.KeyManagementServiceStubSettings.createStub(KeyManagementServiceStubSettings.java:333)
at shadow.com.google.cloud.kms.v1.KeyManagementServiceClient.<init>(KeyManagementServiceClient.java:155)
at shadow.com.google.cloud.kms.v1.KeyManagementServiceClient.create(KeyManagementServiceClient.java:136)
at shadow.com.google.cloud.kms.v1.KeyManagementServiceClient.create(KeyManagementServiceClient.java:127)
at mycompany.my.class.path.KmsSymmetric.decrypt(KmsSymmetric.scala:31)
My dataproc submit is:
gcloud dataproc jobs submit spark \
--cluster=${CLUSTER_NAME} \
--project ${PROJECT_ID} \
--region=${REGION} \
--jars=gs://${APP_BUCKET}/${JAR} \
--class=${CLASS} \
--app args --arg1 val1 etc
I'm using dataproc image version 1.4
What am I missing?

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

Why Zeppelin notebook is not able to connect to S3

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.
Spark Version:
Standalone: spark-1.2.1-bin-hadoop1.tgz
I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.
Code:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
val file = "s3n://<bucket>/<key>"
val data = sc.textFile(file)
data.count
file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
I have build the Zeppelin by following command:
mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests
when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.
I have also tried -Phadoop-1 mentioned in this spark website. but got the same error.
1.x to 2.1.x hadoop-1
Please let me know what I am missing here.
The following installation worked for me (spent also many days with the problem):
Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):
mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)
After this installation steps my notebook worked with S3 files:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.
Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)
For me it worked in one two steps-
1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. -
val performanceFactor = sqlContext.
read. parquet("s3n://<accessKey>:<secretKey>#mybucket/myfile/")
where access key and secret key you need to supply.
in #2 I am using s3n protocol and access and secret keys in path itself.

Resources