import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka - apache-spark

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library

Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

Related

Apache Spark Connector - where to install on Databricks

This Apache Spark connector: SQL Server & Azure SQL article from Azure team describes how to use this connector.
Question: If you want to use the above connector in Azure Databricks, where will you install it?
Remarks: The above article tells you to install it from here and import it in, say, your notebook using com.microsoft.azure:spark-mssql-connector_2.12:1.2.0. But it does not tell you where to install. I'm probably not understanding the article correctly. I need to use it in an Azure Databricks and would like to know where to install the connector jar (compiled) file.
You can do this in the cluster setup. See this documentation: https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-libraries.html
In short, when setting up the cluster, you can add third party libraries by their Maven coordinates - "com.microsoft.azure:spark-mssql-connector_2.12:1.2.0" is an example of a Maven coordinate.

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4
You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here
Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient

I am following this example to get data from Solr to my Scala Spark program. Below is my code:
val solrURL = "someurl"
val collectionName = "somecollection"
val solrRDD = new SelectSolrRDD(solrURL,collectionName,sc)
val solrQuery=new SolrQuery("somequery")
solrQuery.setTimeAllowed(0)
val solrDataRDD=solrRDD.query(solrQuery)
When I run this code on my local Spark cluster, I get the following exception at new selectSolrRDD line:
java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient(Lorg/apache/http/client/HttpClient;)Lorg/apache/solr/client/solrj/impl/SolrClientBuilder;
I looked at some other answers on StackOverflow but nothing worked.
The problem is with your packaging and deployment (your pom.xml assuming you are using maven). The issue is that the Solr client libraries are not being loaded when you run your Spark app. You need to package your app and any dependencies into an "uber jar" for deployment to a cluster.
Take a look at how spark-solr has it setup. They use the maven-shade-plugin to generate the uber jar.
My cluster had jars of spark-solr already present which were conflicting with the jars I was using. After removing those jars, my code worked correctly.

Providing Hive support to a deployed Apache Spark

I need to use Hive-specific features in Spark SQL, however I have to work with an already deployed Apache Spark instance that, unfortunately, doesn't have Hive support compiled in.
What would I have to do to include Hive support for my job?
I tried using the spark.sql.hive.metastore.jars setting, but then I always get these exceptions:
DataNucleus.Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
and
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
In the setting I am providing a fat-jar of spark-hive (excluded spark-core and spark-sql) with all its optional Hadoop dependencies (CDH-specific versions of hadoop-archives, hadoop-common, hadoop-hdfs, hadoop-mapreduce-client-core, hadoop-yarn-api, hadoop-yarn-client and hadoop-yarn-common).
I am also specifying spark.sql.hive.metastore.version with the value 1.2.1
I am using CDH5.3.1 (with Hadoop 2.5.0) and Spark 1.5.2 on Scala 2.10

loading Mllib of Apache Spark

I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:
Object Mllib is not a member of package org.apache.spark
Please look at the package -
import org.apache.spark.mllib._
And follow the guide here.
https://spark.apache.org/docs/1.1.0/mllib-guide.html

Resources