Glue not able to recognize Delta Lake Python Library - apache-spark

I am trying to use Delta Lake Python Library in my Glue job. However, my Glue job is not able to recognize it and I get the error "NameError: name 'DeltaTable' is not defined". Per Glue-DeltaLake documentation , I added the paramter --datalake-formats = delta and also updated the required spark configuration
.config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")
My code fails at below line
deltaTable = DeltaTable.forPath(self.spark,self.dest_path_sdad)
Any ideas?

These configuration properties configure Glue with the Delta Lake file format, so you can write spark.read.format("delta").load(...) or df.write.format("delta").save(...). But they doesn't provide the Python API that is available as the delta-spark package. It could be made available to Glue by using the --additional-python-modules option (doc).

I was missing the import statement
from delta.tables import *

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

How to get File/Files create by Spark df.write?

I have requirement to capture the parquet files created as the outcome of a df.write.parquet("s3://bkt/folder", mode="append") command.
I am running this on AWS EMR pyspark.
I can achive this using awswrangler using wr.s3.to_parquet() but this is not really fit for my EMR spark use case.
Is there such functionality ?
I want list of the files from s3://bkt/folder which spark wrote
Thx all
If you want a list of files that spark wrote to particular S3 path you can use either of below approach:
Use input_file_name which will give file path from which the record is originating from and do a distinct operation by selecting filename:
from pyspark.sql.functions import input_file_name
df=spark.read.parquet("s3://bkt/folder")
df.withColumn("filename", input_file_name())
Or you can use boto3 to list the files :
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])

Pyspark: Delta table as stream source, How to do it?

I am facing issue in readStream on delta table.
What is expected, reference from following link
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-stream-source
Ex:
spark.readStream.format("delta").table("events") -- As expected, should work fine
Issue, I have tried the same in the following way:
df.write.format("delta").saveAsTable("deltatable") -- Saved the Dataframe as a delta table
spark.readStream.format("delta").table("deltatable") -- Called readStream
error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'DataStreamReader' object has no attribute 'table'
Note:
I am running it in localhost, using pycharm IDE,
Installed latest version of pyspark, spark version = 2.4.5, Scala version 2.11.12
The DataStreamReader.table and DataStreamWriter.table methods are not in Apache Spark yet. Currently you need to use Databricks Notebook in order to call them.
Try now with Delta Lake 0.7.0 release which provides support for registering your tables with the Hive metastore. As mentioned in a comment, most of the Delta Lake examples used a folder path, because metastore support wasn't integrated before this.
Also note, it's best for the Open Source version of Delta Lake to follow the docs at https://docs.delta.io/latest/index.html

reading a csv file from azure blob storage with PySpark

I'm trying to do a machine learning project using a PySpark HDInsight cluster on Microsoft Azure. To operate on my cluster a use a Jupyter notebook. Also, I have my data (a csv file), stored on the Azure Blob storage.
According to the documentation the syntax of the path to my file is:
path = 'wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
However, when i try to read the csv file with the following command:
csvFile = spark.read.csv(path, header=True, inferSchema=True)
I get the following error:
'java.net.URISyntaxException: Illegal character in scheme name at index 4: wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
Here is a screenshot of the the error looks like in the notebook:
Any ideas on how to fix this?
It is either (unencrypted):
wasb://...
or (encrypted):
wasbs://...
not
wasb[s]://...

load csv and set parameters in jupyter notebook on Azure ML

I'm using a Python 3.4 Jupyter notebook to load a dataset in Azure ML which is stored in the cloud as a dataset in the Azure ML project environment. But using the default template created by Azure ML, I can't load the data due to a mixed datatypes error.
from azureml import Workspace
import pandas as pd
ws = Workspace()
ds = ws.datasets['rossmann-train.csv']
df = ds.to_dataframe()
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/kernel/main.py:6: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
In my local environment I just import the dataset as follows:
df = pd.read_csv('train.csv',low_memory=False)
But I'm not sure how to do this in azure using the ds object.
df = pd.read_csv(ds)
and
pd.DataFrame.from_csv(ds)
raise the error:
OSError: Expected file path name or file-like object, got type
*edit: more info on the ds object:
In [1]: type(ds)
Out [1]: azureml.SourceDataset
In [2]: print (ds)
Out [2]: rossmann-train.csv
First of all, I am not sure, by your question, what is the ds object. But I'm pretty sure it is not a csv file, since, if it were, you'd have processed it your self and you wouldn't be having this question.
Now, I am not sure whether pandas has a native way of dealing with Azure, but this piece of documentation indicates that first you must download the data form Azure, using their package, and save it into your local file system.
But for that, they are assuming that the data you downloaded is already in the csv format. If not, use the appropriate reader (or parse it by hand) in order to tabulate the data for a pandas.DataFrame.
According to the docs on the azureml library, one workaround would be to import the file as text then parse it into csv but this seems unnecessary since the data is already recognised as being in csv structure.
text_data = ds.read_as_text()

Resources