Connecting Pyspark to Oracle SQL - apache-spark

I am almost new in spark. I want to connect pyspark to oracle sql, I am using the following pyspark code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
import os
spark_config = SparkConf().setMaster("local").setAppName("Project_SQL")
sc = SparkContext(conf = spark_config)
sqlctx = SQLContext(sc)
os.environ['SPARK_CLASSPATH'] = "C:\Program Files (x86)\Oracle\SQL Developer 4.0.1\jdbc\lib.jdbc6.jar"
df = sqlctx.read.format("jdbc").options(url="jdbc:oracle:thin:#<>:<>:<>"
, driver = "oracle.ojdbc6.jar.OracleDriver"
, dbtable = "account"
, user="...."
, password="...").load()
But I get the following error:
An error occurred while calling o29.load.:
java.lang.ClassNotFoundExceotion : oracle.ojdbc6.jar.OracleDriver
I searched a lot and try several ways that I found to change/correct the path to the driver but still got the same error.
Could anyone help me with this please?

oracle.ojdbc6.jar.OracleDriver is not a valid driver class name for the Oracle JDBC driver. The name of the driver is oracle.jdbc.driver.OracleDriver. Just make sure that the jar-file of the Oracle driver is on the classpath.

Try placing the oracle JDBC connectivity jar in jars folder under spark

Related

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders while reading from Azure Blob Storage

I am trying to read a CSV file stored in Azure Storage Account. For that, I have installed a spark on my Virtual Machine and trying to read a CSV file in a dataframe from pyspark.
I read somewhere how to do that and I followed the steps and copied the latest hadoop-azure & azure-storage JAR files on my /jar directories. Then, I came up with this error:-
NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
I searched for this error and found that I need to refer hadoop-azure-2.8.5.jar instead of latest hadoop-azure JAR. So, I replaced this JAR with the latest hadoop-azure jar and again executed my pyspark code.
After executing my code, I encountered with another error: -
: java.lang.NoSuchMethodError:
org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
Also, below is my pyspark code: -
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
storage_account_name = "<storage_account_name>"
storage_account_access_key = "<storage_account_access_key>"
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net",storage_account_access_key)
spark._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.my_account.blob.core.windows.net", "storage_account_access_key")
df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://<container_name>#<storage_account_name>.blob.core.windows.net/<path_to_csv>/sample_file.csv")
df.show()
I searched for this and tried various hadoop-azure JAR versions. The one which worked for me was hadoop-azure-2.7.0.jar.
With this JAR version, I was able to read the CSV file from Blob storage.

Do we need any external jar for xml parsing in Spark?

I'm trying to parse XML in Spark. i am getting below error. Could you please help me?
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object TestSpark{
def main(args:Array[String})
{
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "book")
load("c:\\sample.xml")
}
}`
Error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.xml.
No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support.
This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?
This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

Resources