PySpark with a Teradata connection - apache-spark

Currently I'm using the following method to convert a SQL query to a Pandas Dataframe
import teradatasql as tsql
import pandas as pd
with tsql.connect(host="xxx",
user="xxx",
password="xxx") as connect:
dataframe = pd.read_sql(query, connect)
I'm connected to a Database in Teradata which lives in a virtual machine. Can I do this with PySpark instead of Pandas? I tried following this post but always get an error related to the drivers. But I think this should be easy using teradatasql

Related

PySpark pandas converting Excel to Delta Table Failed

I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not proceed to write _delta_log (at least that's what it seems).
Have you any clue what might be happening?
I'm using Databricks 11.3, Spark 3.3.
I have also tried importing Excel using regular pandas, convert the pandas DF to spark DF using spark.createDataFrame, and then write.saveAsTable without success if format is delta.

can we load the data from pandas dataframe to databricks table without spark.sql

I have a requirement, to write the data from csv/pandas dataframe to databricks table.
My python code may not be running on databricks cluster. I may be running on an isolated standalone node. I am using databricks python connector to select the data from databricks table. selects are working. But I am unable to load the data from csv or pandas dataframe to databricks.
Can I use databricks python connector to load the bulk data in csv/pandas dataframe into databricks table?
Below is the code snippet for getting the databricks connection and performing selects on standalone node using databricks-python connector.
from databricks import sql
conn = sql.connect(server_hostname=self.server_name,
http_path=self.http_path,
access_token=self.access_token
)
try:
with conn.cursor() as cursor:
cursor.execute(qry)
return cursor.fetchall_arrow().to_pandas()
except Exception as e:
print("Exception Occurred:" + str(e))
Note: My csv file is on Azure ADLS Gen2 storage. I am reading this file to create a pandas dataframe. All I need is to either load the data from pandas to Databricks delta table or read csv file and load the data to delta table. Can this be achieved using databricks-python connector instead of using spark?
Can this be achieved using databricks-python connector instead of using spark?
The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL warehouses.
So, there isn't any scope with Databricks SQL connector for python to convert the Pandas Dataframe to Delta lake.
Coming to the second part of your question that if there any other way to convert pandas Dataframe to Delta table without using spark.sql.
Since Delta lake is tied with Spark, there isn't any possible way as far as I know which allows you to convert pandas Dataframe to delta table without using spark.
Alternatively, I suggest you to read the file as spark Dataframe and then convert it into Delta format using below code.
val file_location = "/mnt/tables/data.csv"
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("sep", ",")
.load(file_location)
df.write.mode("overwrite").format("delta").saveAsTable(table_name)

(PySpark) Problem when read data from local computer

When I use pyspark to read data (DAT file - 4 Gb) from my computer everything is fine but when I use pyspark to read data from local computer (other computer in my company connect by LAN) there was an error occurs as below:
'' Py4JJavaError: An error occurred while calling o304.csv.
: java.io.IOException: No FileSystem for scheme: null ''
Error picture
If I use pandas.read_csv to read file from local computer, everything is fine (only problem with pyspark). Please help to support in this case. Thank!
My code to read data in my computer (no problem occur):
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
path='V04R-V04R-SQLData.dat'
df = spark.read.option("delimiter", "\t").csv(path)
My code to read data in local computer (problem occur):
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
path='//8LWK8X1/Data/Subfolder1/V04R-V04R-SQLData.dat'
df = spark.read.option("delimiter", "\t").csv(path)
Note:
8LWK8X1 is a local computer name
read with pandas and convert that to a Pyspark Dataframe - Easy solution :)
Loading into Pandas DF
gam_charge_item_df = pd.read_scv(path)
Creating a PySpark dataFrame
spark_df = spark.createDataFrame(df)

converting between spark df, parquet object and pandas df

I converted parquet file to pandas without issue but had issue converting parquet to spark df and converting spark df to pandas.
after creating a spark session, I ran these code
spark_df=spark.read.parquet('summarydata.parquet')
spark_df.select('*').toPandas()
It returns error
Alternatively, with a parquet object (pd.read_table('summary data.parquet'), how can I convert it to spark df?
The reason I need both spark df and pandas df is that for some smaller DataFrame, I wanna easily use various pandas EDA function, but for some bigger ones I need to use spark sql. And turning parquet to pandas first then to spark df seems a bit of detour.
To convert a Pandas Dataframe into Spark dataframe and viceversa, you will have to use pyarrow which is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.
Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.
In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by spark.sql.execution.arrow.fallback.enabled.
For more details refer this link PySpark Usage Guide for Pandas with Apache Arrow
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Is it possible to use PySpark to insert data into couchbase?

Is it possible to use Pyspark to load csv data into Couchbase? I was on Couchbase website but I didn't see support for Pyspark just python.

Resources