How can we view the column names and other metadata for a [DataBricks] Delta-Table? - databricks

A Spark DataFrame has the .columns attribute:
dataFrame.columns
A Delta-Table does not. Note that the latter is based off a parquet file/directory and parquets are self-describing so the columnar info is available at the least in the files themselves. So the columnar info should be accessible/available from the Delta-Table. I just have not been able to find anything even by going deep into its protected/private attributes with a debugger. I wonder what is the way to work with these constructs?

One of the way that I know is as below using sql syntax but you could also write that inside spark.sql.
Describe table extended tablename
Executing above command will give you all the details about the column name, data type, comments, physical location of parquet files, partitioning information if any and many more details.

Related

How can I extract information from parquet files with Spark/PySpark?

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.
For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.
I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.
For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.
Any suggestions on how to create the index would be greatly appreciated.
This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)
Basically, great idea, but this is already in place.

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Spark HiveContext : Insert Overwrite the same table it is read from

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.
Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.
Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.
Please suggest.
You should never overwrite a table from which you are reading. It can result in anything between data corruption and complete data loss in case of failure.
It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark.
I was going through the documentation of spark and a thought clicked to me when I was checking one property there.
As my table was parquet, I used hive meta store to read the data by setting this property to false.
hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")
This solution is working fine for me.

spark: dataframe.count yields way more rows than printing line by line or show()

New to Spark; using Databricks. Really puzzled.
I have this dataFrame: df.
df.count() yields Long = 5460
But if I print line by line:
df.collect.foreach(println) I get only 541 rows printed out. Similarly, df.show(5460) only shows 1017 rows. What could be the reason?
A related question: how can I save "df" with Databricks? And where does it save to? -- I tried to save before but couldn't find the file afterwards. I load the data by mounting an S3 bucket, if that's relevant.
Regarding your first question, Databricks output truncates by default. This applies both to text output in cells and to the output of display(). I would trust .count().
Regarding your second question, there are four types of places you can save on Databricks:
To Hive-managed tables using df.write.saveAsTable(). These will end up in an S3 bucket managed by Databricks, which is mounted to /user/hive/warehouse. Note that you will not have access to the AWS credentials to work with that bucket. However, you can use the Databricks file utilities (dbutils.fs.*) or the Hadoop filesystem APIs to work with the files, should you need to.
Local SSD storage. This is best done with persist() or cache() but, if you really need to, you can write to, for example, /tmp using df.write.save("/dbfs/tmp/...").
Your own S3 buckets, which you need to mount.
To /FileStore/, which is the only "directory" you can download from directly from your cluster. This is useful, for example, for writing CSV files you want to bring into Excel immediately. You write the file and output a "Download File" HTML link into your notebook.
For more details see the Databricks FileSystem Guide.
The difference could be bad source data. Spark is lazy by nature so it's not going to build a bunch columns and fill them in just to count rows. So the data may not parse when you actually execute against it or the rows or null. Or your schema doesn't allow nulls for certain columns and they are null when the data is fully parsed. Or you are modifying the data between your count, collect and show. There is just not enough detail to tell for sure. you can open up a spark shell and create a small piece of data and test those conditions by turning that data into a dataframe. Change the schema to allow and not allow nulls or add nulls in source data and not nulls. make the source data string but make the schema require integers.
As far as saving your data frame. you create a dataframe writer with write and then define the file type you want to save it as and then the file name. This example saves a parquet file. There are many other options for filetype and write options that are permitted here.
df.write.parquet("s3://myfile")

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?
Answers:
parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers)
Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

Resources