Unable to read Databricks Delta / Parquet File with Delta Format - apache-spark

I am trying to read a delta / parquet in Databricks using the follow code in Databricks
df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet')
However, I'm getting the following error:
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: curorigination.presentation.parquet
This seemed very straightforward, but not sure why I'm getting the error
Any thoughts?
The file structure looks like the following

The error shows that delta lake think you have wrong partition path naming.
If you have any partition column in your delta table, for example year month day, your path should look like /mnt/lake/CUR/CURATED/origination/company/opportunities_final/year=yyyy/month=mm/day=dd/curorigination.presentation.parquet and you just need to do df = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final").
If you just read it as parquet, you can just do df = spark.read.parquet("/mnt/lake/CUR/CURATED/origination/company/opportunities_final") because you don't need to read the absolute path of the parquet file.

The above error mainly happens because of incorrect path format curorigination.presentation.parquet. please check your delta location and also check whether delta file is created or not :
%fs ls /mnt/lake/CUR/CURATED/origination/company/opportunities_final/
I reproduced the same thing in my environment. First of all, I created a data frame with a parquet file.
df1 = spark.read.format("parquet").load("/FileStore/tables/")
display(df1)
After that I just converted the parquet file into delta format and saved the file into this location/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1 .
df1.coalesce(1).write.format('delta').mode("overwrite").save("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1")
#Reading delta file
df3 = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta")
display(df3)

Related

reading delta table specific file in folder

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot
Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different different schema structure.
code snippet for writing a file :
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/home/games/Documents/test_delta/")
Code for reading a delta file
import pyspark[![enter image description here][1]][1]
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
path_to_data = '/home/games/Documents/test_delta/_delta_log/00000000000000000001.json'
df = spark.read.format("delta").load(path_to_data)
df.show()
error :
org.apache.spark.sql.delta.DeltaAnalysisException: /home/games/Documents/test_delta/_delta_log/ is not a Delta table.
You should use:
df = spark.read.format("delta").option("versionAsOf", 0).load(path_to_data)
You can specify other versions instead of 0 depending upon how many times how have overwritten the data. You can also use timestamps. Please see delta quick-start for more info.
Also, the delta_log folder actually contains delta transaction log in json format, not the actual data. The data is present in parent folder (test_delta in your case). The files starting with part-0000 are the ones that contain the actual data. These are .parquet files. There are no files with .delta extensions.

delta log (delta lake) - convert parquet to delta table issues

I'm using trying to convert parquet table to delta table in spark.
I'm able to convert the existing table into delta table using the below command,
import io.delta.tables._
val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`path`")
This created delta log folder in the path along with
lastcheckpoint
00000000000000000000.checkpoint.parquet
00000000000000000000.json files
inside it, we are good till now.
Next I try to insert data into the table, without any errors it would insert.
But I dont any new .json file or new checkpoint parquet file (after 10 inserts) in the delta log folder.
Instead I see all the inserts in a new parquet file.
It should be merged into one right? and every-time there should be a new json created after every inserts
What am I missing here? Any specific configuration?
Spark - 3.2.1
delta - delta-core_2.12-1.2.1
also using --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:
hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.
However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:
Here is what I am doing:
# save spark df to parquet
dbutils.fs.rm('dbfs:/mnt/team01/assembled_train.parquet', recurse=True)
assembled_train.write.parquet('dbfs:/mnt/team01/assembled_train')
# look at files
display(dbutils.fs.ls('mnt/team01/assembled_train/'))
# results
path name size
dbfs:/mnt/team01/assembled_train/_SUCCESS _SUCCESS 0
dbfs:/mnt/team01/assembled_train/_committed_2150262571233317067 _committed_2150262571233317067 856
dbfs:/mnt/team01/assembled_train/_started_2150262571233317067 _started_2150262571233317067 0
dbfs:/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet 578991
dbfs:/mnt/team01/assembled_train/part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet 579640
dbfs:/mnt/team01/assembled_train/part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet 580675
dbfs:/mnt/team01/assembled_train/part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet 579483
dbfs:/mnt/team01/assembled_train/part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet 578807
dbfs:/mnt/team01/assembled_train/part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet 580942
dbfs:/mnt/team01/assembled_train/part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet 579202
dbfs:/mnt/team01/assembled_train/part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet 579810
While testing with a basic dataframe load from the file structure, like so:
df1 = spark.read.option("header", "true").parquet('file:///mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')```
I get file does not exist.
You just need to specify the path as it is, no need for 'file:///':
df1 = spark.read.option("header", "true").parquet('/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')
If this doesn't work, try the methods in https://docs.databricks.com/applications/machine-learning/load-data/petastorm.html#configure-cache-directory

Loading Parquet File into a Hive Table stored as Parquet fail(values are null)

I am simply trying to create a table in hive that is stored as a parquet file, and then transform the csv file that holds the data into a parquet file, and then load it into the hdfs directory to insert the values.below is my sequence that I am doing but to no avail:
First I created a table in Hive:
CREATE external table if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55';
Then i loaded a parquet file into the above hdfs location using this spark:
df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")
It loads but here is the output......all null values:
Here is the original values in the use_this.csv file that I transformed into a parquet file:
This is proof that the specified location created the table's folder(managed_table55) and the file(test.parquet):
Any ideas or suggestions as why this keeps happening? I know there's probably a minor tweak but I cant seem to identify it.
As you are writing parquet file to /hadoop/db1/managed_table55/test.parquet this location Try creating table in the same location and read the data from hive table.
Create Hive Table:
hive> CREATE external table if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55/test.parquet';
Pyspark:
df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")

Resources