Databricks: Incompatible format detected (temp view) - apache-spark

I am trying to create a temp view from a number of parquet files, but it does not work so far. As a first step, I am trying to create a dataframe by reading parquets from a path. I want to load all parquet files into the df, but so far I dont even manage to load a single one, as you can see on the screenshot below. Can anyone help me out here? Thanks
Info: batch_source_path is the string in column "path", row 1

Your data is in Delta format and this is how you must read:
data = spark.read.load('your_path_here', format='delta')

Related

External table from existing data with additional column

This is my first question ever so thanks in advance for answering me.
I want to create an external table by Spark in Azure Databricks. I've the data in my ADLS already that are automatically extracted from different sources every day. The folder structure on the storage is like ../filename/date/file.parquet.
I do not want to duplicate the files by saving their copy on another folder/container.
My problem is that I want to add a date column extracted from the folder path to the table neither without copying nor changing the source file.
I am using Spark SQL to create the table.
CREATE TABLE IF EXISTS my_ext_tbl
USING parquet
OPTIONS (path "/mnt/some-dir/source_files/")
Is there any proper way to add such a column in one easy and readable step or I have to read the raw data into Dataframe, add column and then save it as external tabel to different location?
I am aware of that unmanaged tables stores only metadata in dbfs. However, I am wondering is this even possible.
Hope it's clear.
EDIT:
Since it seems like there is no viable solution for that without copying or interfere in source file, I would like to ask how are you handling such challenges?
EDIT2:
I think that link might provide a solution. The difference in my case is that, the date inside the folder path is not the real partition, it's just a date added during the pipeline extracting data from external source.

Big Query is not able to convert String to Timestamp

I have a BigQuery table where one of the column (publishTs) is timeStamp. I am trying to upload a parquet file into same table using GCP UI BQ upload option having same column name (publishTs) with String datatype (e.g. “2021-08-24T16:06:21.122Z “), But BQ is complaining with following error :-
I am generating parquet file using Apache Spark. I tried searching on internet but could not get the answer.
Try to generate this column as INT64 - link

How to find out whether Spark table is parquet or delta?

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet).
How can I determine a table's format?
I tried show tblproperties <tbl_name> but this gives an empty result.
According to Delta lake Api Doc you can check
DeltaTable.isDeltaTable(Spark, "path")
Please see the note in the documentation
This uses the active SparkSession in the current thread to read the table data. Hence, this throws error if active SparkSession has not been set, that is, SparkSession.getActiveSession() is empty.

Databricks delta table truncating column data containing '-'

I am using a delta table to load data from my dataframe. I am observing that the column values which have a '-' in them, are getting truncated. I tried to check the records in the dataframe that I am loading, by loading them to a csv file and I don't see this issue in the csv file.
Even on doing a DESCRIBE DETAIL DB_NAME.TABLE_NAME, I can see that the createdAt and lastModified columns are having this same issue as shown in the attached screenshot. This seems like some issue with the way table data is being displayed. Can anyone let me know on how to get this fixed?

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

Resources