PySpark pandas converting Excel to Delta Table Failed - excel

I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not proceed to write _delta_log (at least that's what it seems).
Have you any clue what might be happening?
I'm using Databricks 11.3, Spark 3.3.
I have also tried importing Excel using regular pandas, convert the pandas DF to spark DF using spark.createDataFrame, and then write.saveAsTable without success if format is delta.

Related

Apache Spark Dataframes Not Being Created with Databricks

When reading in data from SQL from a notebook a Spark Dataframe is created, but when I read in the same data from a different notebook I don't see a DataFrame.
So when I run the following in one notebook I get a dataframe as follows:
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword}"
my_sales = spark.read.jdbc(jdbcUrl, 'AZ_FH_ELLIPSE.AZ_FND_MSF620')
I get the following DataFrame output
However, but when I run the same code on a different notebook I only get how long it took to run the code but no dataframe.
Any thoughts?
I should mention that the DataFrame isn't appearing on the Community Edition of Databricks. However, I don't think that should be the reason why I'm not seeing a Dataframe or Schema appearing..

Upsert into table using Merge for Scala for Scala using Databricks

With Databricks Delta Table you can upsert data from a source table, view, or DataFrame into a target Delta table using the merge operation. This operation is similar to the SQL MERGE INTO command but has additional support for deletes and extra conditions in updates, inserts, and deletes.
I can successfully carryout a Merge using the following Python code:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, delta_path)
(deltaTable
.alias("t")
.merge(loanUpdates.alias("s"), "t.loan_id = s.loan_id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
However, I need to use Scala. Therefore, can someone provide the code that will do the same in Scala. Basically, I help converting the Python code Scala.
There are examples provided here, https://docs.databricks.com/delta/delta-update.html#language-scala however I would like to able to use the Python code above
Based on your comment, loanUpdates is a string, but it needs to be a Dataframe. You can load a CSV into Spark using:
val loanUpdatesDf = spark.read.csv(loanUpdates)
You will probably need to use further options to read the csv correctly.

converting between spark df, parquet object and pandas df

I converted parquet file to pandas without issue but had issue converting parquet to spark df and converting spark df to pandas.
after creating a spark session, I ran these code
spark_df=spark.read.parquet('summarydata.parquet')
spark_df.select('*').toPandas()
It returns error
Alternatively, with a parquet object (pd.read_table('summary data.parquet'), how can I convert it to spark df?
The reason I need both spark df and pandas df is that for some smaller DataFrame, I wanna easily use various pandas EDA function, but for some bigger ones I need to use spark sql. And turning parquet to pandas first then to spark df seems a bit of detour.
To convert a Pandas Dataframe into Spark dataframe and viceversa, you will have to use pyarrow which is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.
Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.
In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by spark.sql.execution.arrow.fallback.enabled.
For more details refer this link PySpark Usage Guide for Pandas with Apache Arrow
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

How to convert a spark dataframe into a databrick koalas dataframe?

I know that you can convert a spark dataframe df into a pandas dataframe with
df.toPandas()
However, this is taking very long, so I found out about a koala package in databricks that could enable me to use the data as a pandas dataframe (for instance, being able to use scikit learn) without having a pandas dataframe. I already have the spark dataframe, but I cannot find a way to make it into a Koalas one.
To go straight from a pyspark dataframe (I am assuming that is what you are working with) to a koalas dataframe you can use:
koalas_df = ks.DataFrame(your_pyspark_df)
Here I've imported koalas as ks.
Well. First of all, you have to understand the reason why toPandas() takes so long :
Spark dataframe are distributed in different nodes and when you run toPandas()
It will pull the distributed dataframe back to the driver node (that's the reason it takes long time)
you are then able to use pandas, or Scikit-learn in the single(Driver) node for faster analysis and modeling, because it's like your modeling on your own PC
Koalas is the pandas API in spark and when you convert it to koalas dataframe : It's still distributed, so it will not shuffle data between different nodes, so you can use pandas' similar syntax for distributed dataframe transformation

Spark dataframe saveAsTable not truncating data from Hive table

I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?
This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036

Resources