stop hive's RetryingHMSHandler logging to databricks cluster

stop hive's RetryingHMSHandler logging to databricks cluster - apache-spark

I'm using azure databricks 5.5 LTS with spark 2.4.3 and scala 2.11. Almost every request going to the databricks cluster is coming up with the following error log
ERROR RetryingHMSHandler: NoSuchObjectException(message:There is no database named global_temp)
at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:487)
at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:498)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
While this isn't affecting the end-result of what we're trying to do, our logs are constantly getting filled with this and isn't very pleasant to go through. I've tried turning it off by setting the following property to the driver and executor
log4j.level.org.apache.hadoop.hive.metastore.RetryingHMSHandler=OFF
only to, later on, realize the class RetryingHMSHandler actually uses slf4j logger, is there an elegant way to overcome this?

Maybe late, but faced the same issue with Databricks cluster 9.1 LTS (Apache Spark 3.1.2, Scala 2.12). Solved by using a init script that added the following two properties
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL, publicFile
log4j.additivity.org.apache.hadoop.hive.metastore.RetryingHMSHandler=false
to driver's log4j.properties.
My goal was to remove all verbose logs from the "log4j-active.log" file that can be downloaded from a job UI. By following https://learn.microsoft.com/en-us/azure/databricks/kb/clusters/overwrite-log4j-logs, I decided to add/overwrite some property values within driver's log4j.properties (first I had a look at its content, of course).
Added that two properties, I was able to remove also RetryingHMSHandler (the only third-party log call that was still surviving)
Hope it helps ;)

Related

Reading CSV file with Spark runs sometimes forever

i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.
But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.
What i am doing wrong?

#JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.
For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?

Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

Flink-Cassandra connector throws exception (flink-connector-cassandra_2.11-1.10.0)

I am trying to upgrade flink 1.7.2 to flink 1.10 and I am having problem with cassandra connector. Everytime I start a job that is using it the following exception is thrown:
com.datastax.driver.core.exceptions.TransportException: [/xx.xx.xx.xx] Error writing
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:550)
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:534)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at com.datastax.driver.core.Connection$Flusher.run(Connection.java:870)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.shaded.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Direct buffer memory
at com.datastax.shaded.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:643)
Also the following message was printed when the job was run locally (not in YARN):
13:57:54,490 ERROR com.datastax.shaded.netty.util.ResourceLeakDetector - LEAK: You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the JVM,so that only a few instances are created.
All jobs that do not use cassandra connector are working properly
Can someone help?

UPDATE: The bug is still reproducible and I think this is the reason: https://issues.apache.org/jira/browse/FLINK-17493.
I had an old configuration (from flink 1.7) where classloader.parent-first-patterns.additional: com.datastax. was configured and my cassadndra-flink connector was in flink/lib folder ( this was done because of other problems related to shaded netty I had with Cassandra-flink connector). Now with the migration to flink 1.10 the following problem was hit. Once removing this configuration - classloader.parent-first-patterns.additional: com.datastax., including flink-connector-cassandra_2.12-1.10.0.jar in my jar and removing it from /usr/lib/flink/lib/ the problem was no longer reproducible.

Unable to access S3 data using Spark 2.2

I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.

Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A

Azure data factory HDInsight on-demand cluster 'Unable to instantiate SessionHiveMetaStoreClient'

I'm deploying an Azure data factory by deploying an ARM template using Visual Studio, basically following this Azure tutorial exactly, step by step.
The template defines a data factory, with an Azure Storage linked service (for reading and writing source and output data), an input dataset and an output data set, an HDInsight on-demand linked service, and a pipeline which runs an HDInsight HIVE activity to run a HIVE script which processes the input datasets into an output data set.
Everything deploys successfully and the pipeine activity starts. However I get the following error from the activity:
Exception in thread "main" java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:445)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:619) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I have found various posts such as this one and this one suggesting the porblem is a known bug caused by a dash or hyphen in the HIVE metastore database name.
My problem is that using an ARM template to deploy an HDInsigh cluster on demand, I have no access to the cluster itself, so I can't make any manual config changes (the idea of on-demand is that it is transient, only created to serve a set of demands and then deletes itself).
The issue can be reproduced easily simply by following the tutorial step by step.
The only possible glimmer of hope I have found is by setting the hcatalogLinkedServiceName as documented here, which is designed to allow you to use your own Azure SQL database as the hive metastore. However, this doesn't work either - if I use that property, I get:
‘JamesTestTutorialARMDataFactory/HDInsightOnDemandLinkedService’
failed with message ‘HCatalog integration is not enabled for this
subscription.’
My subscription is unrestricted, and should have all the features of Azure available. So now I'm completely stuck. It seems that currently, using Hive with on-demand HDInsight is basically impossible?
If anyone can think of anything to try, I'm all ears!
Thanks

I recently studied the tutorial, and revised the tutorial. Here is my version, https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-create-linux-clusters-adf/. I didn't see the error. The Hive table name doesn't have hyphen. I think mine is a bit easier to follow. I made a few minor changes on the template itself.

I managed to get in contact with the author of the tutorial on GIT - who contacted the Azure product team, and this was their response:
...this is a known issue with Linux based HDI clusters when you use
them with ADF. The HDI team has fixed the issue and will be deploying
the fix in the coming weeks. In the meantime, you have to use Window
based HDI clusters with ADF.
Please use windows as osType for now. I have fixed the article in GIT
and it will go live some time tomorrow.
The tutorial I link to has indeed been changed to use windows instead of linux. I've tried this and it works.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string