How to Convert Great Expectations DataFrame to Apache Spark DataFrame - apache-spark

The following code will convert an Apache Spark DataFrame to a Great_Expectations DataFrame. For if I wanted to convert the Spark DataFrame, spkDF to a Great_Expectations DataFrame I would do the following:
ge_df = SparkDFDataset(spkDF)
Can someone let me know how convert a Great_Expectation dataframe to a Spark DataFrame.
So what would I need to do to convert the new Great_Expectations dataframe ge_df back to Spark DataFrame?

According to the official documentation, the class SparkDFDataset holds the original pyspark dataframe:
This class holds an attribute spark_df which is a spark.sql.DataFrame.
So you should be able to access it with :
ge_df.spark_df

Related

converting between spark df, parquet object and pandas df

I converted parquet file to pandas without issue but had issue converting parquet to spark df and converting spark df to pandas.
after creating a spark session, I ran these code
spark_df=spark.read.parquet('summarydata.parquet')
spark_df.select('*').toPandas()
It returns error
Alternatively, with a parquet object (pd.read_table('summary data.parquet'), how can I convert it to spark df?
The reason I need both spark df and pandas df is that for some smaller DataFrame, I wanna easily use various pandas EDA function, but for some bigger ones I need to use spark sql. And turning parquet to pandas first then to spark df seems a bit of detour.
To convert a Pandas Dataframe into Spark dataframe and viceversa, you will have to use pyarrow which is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.
Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.
In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by spark.sql.execution.arrow.fallback.enabled.
For more details refer this link PySpark Usage Guide for Pandas with Apache Arrow
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Xml parsing on spark Structured Streaming

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help
You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

How to convert a spark dataframe into a databrick koalas dataframe?

I know that you can convert a spark dataframe df into a pandas dataframe with
df.toPandas()
However, this is taking very long, so I found out about a koala package in databricks that could enable me to use the data as a pandas dataframe (for instance, being able to use scikit learn) without having a pandas dataframe. I already have the spark dataframe, but I cannot find a way to make it into a Koalas one.
To go straight from a pyspark dataframe (I am assuming that is what you are working with) to a koalas dataframe you can use:
koalas_df = ks.DataFrame(your_pyspark_df)
Here I've imported koalas as ks.
Well. First of all, you have to understand the reason why toPandas() takes so long :
Spark dataframe are distributed in different nodes and when you run toPandas()
It will pull the distributed dataframe back to the driver node (that's the reason it takes long time)
you are then able to use pandas, or Scikit-learn in the single(Driver) node for faster analysis and modeling, because it's like your modeling on your own PC
Koalas is the pandas API in spark and when you convert it to koalas dataframe : It's still distributed, so it will not shuffle data between different nodes, so you can use pandas' similar syntax for distributed dataframe transformation

How to save pandas DataFrame to Hive within `forearchPartition` (PySpark)

I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?
I have tried the following without success (gives a serialization error):
def processData(x):
#do something
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)
original_spark_df.rdd.forearchPartition(processData)
I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().
Is there some solution to save the pandas to Hive within forearchPartition?

Spark How to create a DataFrame from a list of file in another DataFrame

In Apache Spark, if I have a DataFrame that is a list of CSV files, how can I create a DataFrame from the content of all the files listed in the first DataFrame?
From your description, I think the number of files should be small. You can just collect the file paths to the driver and use them to create DataFrame. E.g.,
val filePathDF = sc.parallelize(Seq("a.txt", "b.txt", "c.txt")).toDF("path")
val df = sqlContext.read.text(filePathDF.collect().map(_.getString(0)): _*)
df.show()
text is a 1.6 API. If you are using a pre 1.6 Spark, you can use format("text").load(...) instead.

Resources