Setting up ignite context in pyspark - apache-spark

How to set up ignite context in spark using python .
igniteContext = IgniteContext(sparkContext,"c:/desktop/config.xml")
Import Error : cannot import IgniteContext

There is only Java and Scala API available for IgniteRDD.

Related

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Read /Write delta lake tables on S3 using AWS Glue jobs

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *
data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")
Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job.
Can anyone help me on this
Thanks
I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job.
Here you have the list of them (I am using Deltalake 0.6.1):
com.ibm.icu_icu4j-58.2.jar
io.delta_delta-core_2.11-0.6.1.jar
org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
org.antlr_antlr4-4.7.jar
org.antlr_antlr4-runtime-4.7.jar
org.antlr_antlr-runtime-3.5.2.jar
org.antlr_ST4-4.0.8.jar
org.glassfish_javax.json-1.0.4.jar
Then in your Glue job you can use the following code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")
from delta.tables import *
glueContext = GlueContext(sc)
spark = glueContext.spark_session
delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)
deltaTable = DeltaTable.forPath(spark, delta_path)
You need to pass the additional configuration properties
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

Connecting Pyspark to Oracle SQL

I am almost new in spark. I want to connect pyspark to oracle sql, I am using the following pyspark code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
import os
spark_config = SparkConf().setMaster("local").setAppName("Project_SQL")
sc = SparkContext(conf = spark_config)
sqlctx = SQLContext(sc)
os.environ['SPARK_CLASSPATH'] = "C:\Program Files (x86)\Oracle\SQL Developer 4.0.1\jdbc\lib.jdbc6.jar"
df = sqlctx.read.format("jdbc").options(url="jdbc:oracle:thin:#<>:<>:<>"
, driver = "oracle.ojdbc6.jar.OracleDriver"
, dbtable = "account"
, user="...."
, password="...").load()
But I get the following error:
An error occurred while calling o29.load.:
java.lang.ClassNotFoundExceotion : oracle.ojdbc6.jar.OracleDriver
I searched a lot and try several ways that I found to change/correct the path to the driver but still got the same error.
Could anyone help me with this please?
oracle.ojdbc6.jar.OracleDriver is not a valid driver class name for the Oracle JDBC driver. The name of the driver is oracle.jdbc.driver.OracleDriver. Just make sure that the jar-file of the Oracle driver is on the classpath.
Try placing the oracle JDBC connectivity jar in jars folder under spark

Do we need any external jar for xml parsing in Spark?

I'm trying to parse XML in Spark. i am getting below error. Could you please help me?
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object TestSpark{
def main(args:Array[String})
{
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "book")
load("c:\\sample.xml")
}
}`
Error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.xml.
No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support.
This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Resources