I am trying to load a csv file (without header) into a delta table using the Load the sample data from cloud storage into the table guideline but I can not find any instructions how to define source file schema/header.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Bases on delta-copy-into and FORMAT_OPTIONS docs I assume, the enforceSchema would be the right option but how to privide the schema definition using SQL API?
If you don't have header in the files then Spark will assign names automatically, like, _c0, _c1, etc., and put them into the table. If you want give a meaningful names then you need to use slightly different syntax by using SELECT option that will give you ability to rename columns, and do type casting if necessary. Like this (cast just for example):
COPY INTO my_table FROM (
SELECT _c0 as col1, cast(_c1 as data) as date, _c2 as col3, ...
FROM '/path/to/files'
)
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
P.S. I'm not sure that infrerSchema is good to use here, as you anyway may need to do casts, etc.
Related
I have created new table with csv file with following code
%sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
create table if not exists catlog.schema.tablename;
COPY INTO catlog.schema.tablename
FROM (SELECT * FROM 's3://bucket/test.csv')
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
but i have new file with additional data. how can i load that please guide?
thanks
need to load new datafile in delta table
I tried to reproduce the same in my environment and got the below
Make sure, to check whether the schema and file.csv data_type should match otherwise you will get an error.
Please follow below syntax insert data from csv file
%sql
copy into <catalog>.<schema>.<table_name>
from "<file_loaction>/file_3.csv"
FILEFORMAT = csv
FORMAT_OPTIONS('header'='true','inferSchema'='True');
I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.
The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:
Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/
I googled the error and found articles advising to replace the space with "%20". This solution is not effective.
So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.
This is my current Databricks SQL Code:
COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
'header'='true',
'delimiter'=';',
'skipRows'='8',
'mergeSchema'='true', --Whether to infer the schema across multiple files and to merge the schema of each file
'encoding'='UTF-8',
'enforceSchema'='true', --Whether to forcibly apply the specified or inferred schema to the CSV files
'ignoreLeadingWhiteSpace'='true',
'ignoreTrailingWhiteSpace'='true',
'mode'='PERMISSIVE' --Parser mode around handling malformed records
)
COPY_OPTIONS (
'force' = 'true', --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
'mergeSchema'= 'true' --If set to true, the schema can be evolved according to the incoming data.
)
Trying to use the path where one of the folders has space, gave the same error.
To overcome this, you can specify the folder in PATTERN parameter as follows:
%sql
COPY INTO table1
FROM '/mnt/repro/op/'
FILEFORMAT = csv
PATTERN='has space/sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.
%sql
COPY INTO table2
FROM '/mnt/repro/op/has?space/'
FILEFORMAT = csv
PATTERN='sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.
dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')
%sql
COPY INTO table3
FROM '/FileStore/tables/'
FILEFORMAT = csv
PATTERN='mycsv.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou
I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'
My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day.
example:
bucket/2020-01-03/website 1 and within this are where the csv's are stored.
I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.
I used the code below to create one table for
Athena configuration
athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'
database = 'database1'
table = 'consolidated_table'
Athena database and table definition
create_table = \
"""CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
) LOCATION '%s'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )
athena.start_query_execution(QueryString=create_table,
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})
I also want to over-write this table with new data from S3 everytime I run it.
You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE, each file contains 4 columns website_id, user, action and date. So you can simply change LOCATION to point to the root of your S3 "directory structure"
CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
)
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
'skip.header.line.count'='1'
);
In this case, each Athena query would scan all files under s3://bucket location and you can use website_id and date in WHERE clause to filter results. However, if you have a lot of data you should consider partitioning. It will save you not only time to execute query but also money (see this post)
I also want to over-write this table with new data from S3 every time I run it.
I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE. Note, that CREATE EXTERNAL TABLE simply defines a meta information about you data, i.e. where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket' (w/o partitioning) will always include all your S3 files