Saving data in Elasticsearch using PySpark [duplicate] - apache-spark

This question already has answers here:
How to save dataframe to Elasticsearch in PySpark?
(3 answers)
Closed 3 years ago.
I have a program that takes a dataframe and should save it into Elasticsearch. Here's what it looks like when I save the dataframe:
model_df.write.format(
"org.elasticsearch.spark.sql"
).option(
"pushdown", True
).option(
"es.nodes", "example.server:9200"
).option("es.index.auto.create", True
).mode('append').save("EPTestIndex/")
When I run my program, I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o96.save.
: java.lang.ClassNotFoundException: Failed to find data source:
org.elasticsearch.spark.sql. Please find packages at
http://spark.apache.org/third-party-projects.html
I did some research and thought I needed a jar, so I added these configurations to my SparkSession:
spark = SparkSession.builder.config("jars", "/Users/public/ProjectDirectory/lib/elasticsearch-spark-20_2.11-6.0.1.jar")\
.getOrCreate()
sqlContext = SQLContext(spark)
I initialize the SparkSession in main and write to ES in another package. The package takes the dataframe and runs the write command above. However, even with this I am still getting the same ClassNotFoundExceptioin What might be the issue?
I am running this program in PyCharm, how can I make it so that PyCharm is able to run it?

Elasticsearch exposes a JSON API and a pandas dataframe is not a JSON supported type.
If you had to insert it, you could serialize the dataframe using dataframe.to_json()

Related

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

Loading Data from Azure Synapse Database into a DataFrame with Notebook

I am attempting to load data from Azure Synapse DW into a dataframe as shown in the image.
However, I'm getting the following error:
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Traceback (most recent call last):
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Any thoughts on what I'm doing wrong?
That particular method has changed its name to synapsesql (as per the notes here) and is Scala only currently as I understand it. The correct syntax would therefore be:
%%spark
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
It is possible to share the Scala dataframe with Python via the createOrReplaceTempView method, but I'm not sure how efficient that is. Mixing and matching is described here. So for your example you could mix and match Scala and Python like this:
Cell 1
%%spark
// Get table from dedicated SQL pool and assign it to a dataframe with Scala
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
// Save the dataframe as a temp view so it's accessible from PySpark
df.createOrReplaceTempView("someTable")
Cell 2
%%pyspark
## Scala dataframe is now accessible from PySpark
df = spark.sql("select * from someTable")
## !!TODO do some work in PySpark
## ...
The above linked example shows how to write the dataframe back to the dedicated SQL pool too if required.
This is a good article for importing / export data with Synpase notebooks and the limitation is described in the Constraints section:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export#constraints

Error when attempting to read Parquet in Spark

I am using Python Spark 2.4.3
I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks.
df = spark.read.csv("file.csv", header=True)
df.write.parquet("result_parquet")
parquetFile = spark.read.parquet("result_parquet")
I am getting this:
Py4JJavaError: An error occurred while calling o1312.parquet.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
What am I doing wrong? I got the line straight from the Spark documentation https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#loading-data-programmatically
The problem is I was using Java 11 (not supported fully by Spark). I uninstalled and Installed Java 8 and now it works

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources