Missing library to ingest data into Azure Data Explorer with PySpark - azure

I am trying to ingest data into Azure Data Explorer through PySpark with PyCharm IDE. However, I am having a lot of problems related to missing libraries when running my code.
According to Azure Data Explorer connector's page, I need to install the connector's jar and the two dependencies jar kusto-ingest and kusto-data.
After download all these 3 jar's and importing them to PySpark, I can't proceed with my data ingestion, it keeps returning me missing library errors. The first one is the azure-storage lib, then I've installed and imported the jar, it asks for adal4j lib, I do the same and it asks oauth2 lib, then json lib, azure-client-authentication lib, javax mail lib, and so on.
I've installed more than 10 jars and I still can't run this ingestion. Am I doing something wrong?
My PySpark version is 2.4. You can see my code below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.appName("Teste") \
.config('spark.jars', 'kusto-spark_2.4_2.11-2.5.2.jar,kusto-data-2.4.1.jar,kusto-ingest-2.4.1.jar,azure-storage-8.3.0.jar,json-20180813.jar,adal4j-1.6.5.jar') \
.getOrCreate()
# loading a test csv file
df = spark.read.csv('MOCK_DATA.csv', header=True, sep=',')
df.write.format("com.microsoft.kusto.spark.datasource")\
.option("kustoCluster", "myclustername")\
.option("kustoDatabase", "mydatabase")\
.option("kustoTable", "mytable")\
.option("kustoAadAppId", "myappid")\
.option("kustoAadAppSecret", "mysecret")\
.option("kustoAadAuthorityID", "myautorityid")\
.mode("Append")\
.save()

When working with a non-maven installation you need to use a JAR with all the dependencies.
You can take it from the github releases:
https://github.com/Azure/azure-kusto-spark/releases
or build yourself if its missing from the specific version by cloning the repo and running
mvn assembly:single

Related

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library
Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4
You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here
Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient

I am following this example to get data from Solr to my Scala Spark program. Below is my code:
val solrURL = "someurl"
val collectionName = "somecollection"
val solrRDD = new SelectSolrRDD(solrURL,collectionName,sc)
val solrQuery=new SolrQuery("somequery")
solrQuery.setTimeAllowed(0)
val solrDataRDD=solrRDD.query(solrQuery)
When I run this code on my local Spark cluster, I get the following exception at new selectSolrRDD line:
java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient(Lorg/apache/http/client/HttpClient;)Lorg/apache/solr/client/solrj/impl/SolrClientBuilder;
I looked at some other answers on StackOverflow but nothing worked.
The problem is with your packaging and deployment (your pom.xml assuming you are using maven). The issue is that the Solr client libraries are not being loaded when you run your Spark app. You need to package your app and any dependencies into an "uber jar" for deployment to a cluster.
Take a look at how spark-solr has it setup. They use the maven-shade-plugin to generate the uber jar.
My cluster had jars of spark-solr already present which were conflicting with the jars I was using. After removing those jars, my code worked correctly.

Load external jars to Zeppelin from s3

Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?
A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

Query Hive table created with built-in Serde from Spark app

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.
Make a shaded JAR of your application which includes hive-serde JAR. Refer this
add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar
The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Resources