I am ingesting a csv file from a mounted blob storage to delta live table and here's my initial query:
CREATE INCREMENTAL LIVE TABLE table_raw
COMMENT "Ingesting data from /mnt/foo"
TBLPROPERTIES ("quality" = "bronze")
AS
SELECT * FROM cloud_files("/mnt/foo/", "csv")
I need/want to transform its column names (e.g. remove punctuations, trim whitespaces, etc.). How should I approach it?
Related
I am making a data pipeline in Azure Synapse. I want to copy a 500 GB CSV file from a Blob container file and convert it into an Azure Data Lake Storage Gen2 table. Before I copy it into the table, I want to make some changes to the data using a Data Flow block, to change some column names and other transformations.
Is it possible to copy the data and make transformations implicitly, without a staging Parquet store ?
If yes, how do I make the transformations implicitly ? Ex: Remove dashes ("-") from all column names ?
You can use rule-based mapping in the select transformation to remove the Hyphen symbol from all the column names.
Matching condition: true()
Output column name expression: replace($$, '-','')
Select output:
I have a use case where the table columns will be changing [addition/deletion] at each refresh [currently its weekly refresh]. It is stored as delta format. is there any way that can we trach the version of these column addition/deletion like a kind of meta store.
Is there any where I can find such information in delta table or delta file format ?
I have a parquet file and created a new External table, but the performance is very slow as compare to a normal table in the synapse. Can you please let me know how to over come this.
Very broad question. So I'll give broad answer:
Use normal table. Hard to beat performance of "normal table" with external tables. "normal table" means a table created in a Dedicated SQL pool using CREATE TABLE. If you're querying data from one or more tables repeatedly and each query is different (group-by, join, selected columns) then you can't get beat performance of "normal" table with external tables.
Understand and apply basic best practices:
Use parquet format, which you're doing.
Pick right partition column and partition your data by storing partitions to different folders or file names.
If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.
Try to keep your CSV (if using csv) file size between 100 MB and 10 GB.
Use correct data type.
Manually create statistics for CSV files
Use CETAS to enhance query performance and joins
...and many more.
a) The first step is to partition your Parquet File using a relevant partition column, such as Year, Month, and Date.
b) I recommend using a View rather than an external table as a second recommendation. External Tables don't support Partition Prunning and won't use the partition columns to eliminate unnecessary files during the read.
c) Assure that data types are enforced, and that string types are being used appropriately.
d) If possible, convert your Parquet file to Delta format. Synapse is able to read Partition columns from Delta without the need for the filepath() and filename() functions. External tables do not support Delta, only views.
Note: External tables doesn't support Parquet partition columns.
SELECT *,
CAST(fct.filepath(1) AS SMALLINT) AS SalesOrderPathYear,
CAST(fct.filepath(2) AS TINYINT) AS SalesOrderPathMonth,
CAST(fct.filepath(3) AS DATE) AS SalesOrderPathDate
FROM
OPENROWSET
(
BULK 'conformed/facts/factsales/*/*/*/*.parquet',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'Parquet'
) AS fct
WITH
(
ColA as String(10),
ColB as Integer,
ColC as ...
)
Ref: https://www.serverlesssql.com/certification/mastering-dp-500-exam-querying-partitioned-sources-in-azure-storage/
I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")
I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe