Databricks delta table truncating column data containing '-' - apache-spark

I am using a delta table to load data from my dataframe. I am observing that the column values which have a '-' in them, are getting truncated. I tried to check the records in the dataframe that I am loading, by loading them to a csv file and I don't see this issue in the csv file.
Even on doing a DESCRIBE DETAIL DB_NAME.TABLE_NAME, I can see that the createdAt and lastModified columns are having this same issue as shown in the attached screenshot. This seems like some issue with the way table data is being displayed. Can anyone let me know on how to get this fixed?

Related

Athena query displays data differently than in S3

Athena Query is changing few of data points to 0.
During data sanity I found that a particular column was showing huge difference on dashboard and in S3 files, data displayed on dashboard was around 40k and on reading it after downloading file from S3 was around 80k.
Since I am querying data directly from S3 using Athena, the datasource remains the same for aAhena and file download. I am wondering why this is happening, any help would be appreciated.
Eg:
Athena Query Results :
Data in S3:
Queried the data through a simple select query:
SELECT "orderid","orderdate","total tax"
FROM gbc_owss"
The datatype in Athena for the total tax column was double.
EDIT: Solved the above issue, it was indeed a delimiter issue and that was pushing the values to next column thus making it look like athena is changing values but that wasn't the case.

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Databricks: Incompatible format detected (temp view)

I am trying to create a temp view from a number of parquet files, but it does not work so far. As a first step, I am trying to create a dataframe by reading parquets from a path. I want to load all parquet files into the df, but so far I dont even manage to load a single one, as you can see on the screenshot below. Can anyone help me out here? Thanks
Info: batch_source_path is the string in column "path", row 1
Your data is in Delta format and this is how you must read:
data = spark.read.load('your_path_here', format='delta')

Excel Data Queries - Ignore missing table / assign specific table number for every query

I am having a bit of trouble to create an automated report based on an HTML file. The file contains tables with data structured from the web page, and I just create tables from the tables recognized by Excel. So far it does what I need, but sometimes one or more tables from the HTML file is missing, and causing the tables to shuffle between them, like table 0 is missing then table 1 will take it's place and break the entire sheet because the wrong table is in the place of table 0.
What I wanted to know if it's a way to assign every query to a specific table number for each query. Like Table 0 will get the value from the specified query, not the first one that comes in the list of queries. The code so far is this for Power Query Editor:
let
Source = Web.Page(File.Contents("D:\AUTO.html")),
Data0 = Source{0}[Data]
in Data0
I use this code because the columns or rows will not always be the same, sometimes one can be missing and if I use the original code that is generated when getting the data from the page it will give errors and not load the table if there is a missing column/row.
Any help is appreciated.
MissingField.Ignore
When you use functions like Table.SelectColumns or RenameColumns or ReorderColumns you can use the MissingField.Ignore options to avoid the missing field error to stop your query
eg:
= Table.SelectColumns(#"blah",{"column1", "column2", "column3"}, MissingField.Ignore)
documentation:
https://learn.microsoft.com/en-us/powerquery-m/missingfield-error

Altering CSV Rows in Azure Data Factory

I've tried to use the 'Alter Rows' function within a Data Flow in Azure Data Factory to remove rows that match a condition from a CSV dataset.
The Data Preview shows that the rows matched will be deleted, however in the next step 'sink' it seems to ignore that and writes the original rows to the CSV file output.
Is it not possible to use alter rows on a CSV dataset and if not, is there a work around?
Firstly,use 'union' to migrate your csv files as source.
Then,use 'filter' to filter your data with date time stamps at source.

Resources