Some streams terminated before this command could finish! Structured Streaming - azure

I am trying to read streaming data into Azure Databricks coming from Azure Eventhubs.
This is the code i've been using:
connectionString = "Connection string"
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
query = df \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
And its giving me an error saying:
ERROR: Some streams terminated before this command could finish!
I understood that we have to give the Jar file of Azure Eventhub according to the Databricks run time and also according to the spark version.
My spark version is 2.4.5 and Databricks runtime is 6.6, and the jar file i used is azure-eventhubs-spark_2.12-2.3.17.jar for this combination as specified
But i'm still facing this issue as "Some streams terminated before this command could finish!".Can anyone please help me on this.
Thanks

As I started working on this issue: First faced the same issue as you are experiencing.
ERROR: Some streams terminated before this command could finish!
After making this changes, it works perfectly with the below configuration:
Databrick Runtime: 6.6 (includes Apache Spark 2.4.5, Scala 2.11)
Azure EventHub library: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17
Step1: Install libraries using Library.
You can try to install "com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17" using Install Library option.
Step2: Change the configuration related to Azure Event Hubs.
If you are using "ehConf = {'eventhubs.connectionString' : connectionString}" with version above 2.3.15, you will receive the below error message.
java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
Note: All configuration relating to Event Hubs happens in your Event Hubs configuration dictionary. The configuration dictionary must contain an Event Hubs connection string:
connectionString = "YOUR.CONNECTION.STRING"
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
For **2.3.15** version and above, the configuration dictionary requires that connection string be encrypted.
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)

Related

403 Error while accessing s3a using Spark

ISSUE:
Able to successfully download the file using AWS CLI as well as boto 3.
However, while using the S3A connector of Hadoop/Spark , receiving the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o24.parquet.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: BCFFD14CB2939D68, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: MfT8J6ZPlJccgHBXX+tX1fpX47V7dWCP3Dq+W9+IBUfUhsD4Nx+DcyqsbgbKsPn8NZzjc2U
Configuration:
Running this on my Local machine
Spark Version 2.4.4
Hadoop Version 2.7
Jars added:
hadoop-aws-2.7.3.jar
aws-java-sdk-1.7.4.jar
Hadoop Config:
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.session.token", session_key)
hadoop_conf.set("fs.s3a.endpoint", "s3-us-west-2.amazonaws.com") # yes, I am using central eu server.
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
Code to Read the file:
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext.getOrCreate()
hadoop_conf=sc._jsc.hadoopConfiguration()
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(path)
print(df.head())
Set AWS credentials provider to profile credentials:
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
I faced the same issue as above. It turns out that I needed to add in the session token into my configuration. See the documentation.
configuration.set("fs.s3a.session.token", System.getenv("AWS_SESSION_TOKEN"))

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

How to access a file in windows using Spark and winutils?

I am running spark on windows using winutils.
In spark shell in trying to load a csv file, but it says Path does not exist, i.e. I have a file at location E:/data.csv.
I am executing:
scala> val df = spark.read.option("header","true").csv("E:\\data.csv")
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/E:/data.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
I cant figure out why is it appending a "/E:", whereas it should have been only E:
How should I access the file?
In my case I am able to read the file as below
val input = spark.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.option("delimiter", ";").option("quoteAll","true").option("inferSchema","false").load("C:/Work/test.csv").toDF()

SaveAsTextFile() results in Mkdirs failed - Apache Spark

Good, I currently have a cluster in spark with 3 working nodes. I also have a nfs server mounted on /var/nfs with 777 permission for testing. I'm trying to run the following code to count the words in a text:
root#master:/home/usuario# MASTER="spark://10.0.0.1:7077" spark-shell
val inputFile = sc.textFile("/var/nfs/texto.txt")
val counts = inputFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.toDebugString
counts.cache()
counts.count()
counts.saveAsTextFile("/home/usuario/output");
But spark gives me the following error:
Caused by: java.io.IOException: Mkdirs failed to create
file:/var/nfs/output-4/_temporary/0/_temporary/attempt_20170614094558_0007_m_000000_20
(exists=false, cwd=file:/opt/spark/work/app-20170614093824-0005/2)
I have searched for many websites but I can not find the solution for my case. All help is grateful.
When you start a spark-shell with MASTER as valid application-master url - and not local[*], spark treats all paths as HDFS; and performs IO operations only in underlying HDFS; not in local.
YOu have mounted the locations in local file-system; and those paths are not existed in HDFS.
That's why, the error says: exists=false
Same issue with me. Check ownership of your directory again.
sudo chown -R owner-user:owner-group directory

SQOOP is not able load SAP HANA driver

I am trying to import data from SAP HANA database onto Azure DataLake Store using SQOOP.
for this, I've downloaded the HDB client to connect to HANA database but I'm looking for the location to copy 'ngdbc.jar' to $SQOOP_HOME/lib. On HDInsight Cluster, am not able to see the environmental variable $SQOOP_HOME/lib, it seems to be blank. Can anybody point me to the right location on HDP - HDInsight Cluster.
Currently, I am encountering following error.
sshadmin#hn0-busea2:~$ sqoop import --connect 'jdbc:sap://XXXXXXX0004.ms.XXXXXXX.com:30015/?database=HDB&user=XXXXXXXXX&password=XXXXXXXXXXXXX' --driver com.sap.db.jdbc.Driver \
--query 'select * from XXX.TEST_HIERARCHY where $CONDITIONS' \
--target-dir 'adl://XXXXXXXXXXXXX.azuredatalakestore.net:443/hdi-poc-dl/SAP_TEST_HIERARCHY' \
--m 1;
Warning: /usr/hdp/2.4.2.4-5/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/01/18 10:34:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.4-5
17/01/18 10:34:26 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
17/01/18 10:34:26 INFO manager.SqlManager: Using default fetchSize of 1000
17/01/18 10:34:26 INFO tool.CodeGenTool: Beginning code generation
17/01/18 10:34:26 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.sap.db.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.sap.db.jdbc.Driver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForQuery(SqlManager.java:234)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:304)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1845)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1645)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
try this path /usr/hdp/current/sqoop-client/lib/

Resources