U-SQL Azure Data Lake Analytics search files by date - azure

I have U-SQL script that uses file pattern to find files in Azure Data Lake and extracts some data from them:
DECLARE #input_file string = #"\data\{*}\{*}\{*}.avro";
#data = EXTRACT
Column1 string,
Column2 double
FROM #input_file
USING new MyExtractors.AvroExtractor();
File pattern is:
data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}
Problem: Custom extractor is executing very slow. I have many files in the Data Lake and it takes 15hrs to process and costs $600USD per run. Too slow and too expensive.
I only need to extract fresh data from files that are not more than 90 days old. How can I filter out old files using file pattern, file date modified or any other technique?

You can leverage GetMetadata activity in Azure data factory to retrieve lastModifiedTime of the files.
ref doc:
Get metadata activity in Azure Data Factory
And there's a relevant post about incremental copy:
Azure data factory | incremental data load from SFTP to Blob

You could use the .AddDays method of DateTime.Now, although whether or not this actually filters out all your files is (I think) dependent on your custom extractor, eg
//DECLARE #input_file string = #"\data\{*}\{*}\{*}.csv";
DECLARE #input_file string = #"\data\{Namespace}-{EventHub}-{PartitionId}\{xdate:yyyy}-{xdate:MM}-{xdate:dd}\{Hour}-{Minute}-{Second}.csv";
// data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}
#input =
EXTRACT Column1 string,
Column2 double,
xdate DateTime,
Namespace string,
EventHub string,
PartitionId string,
Hour int,
Minute int,
Second int
FROM #input_file
USING Extractors.Csv();
//USING new MyExtractors.AvroExtractor();
#output =
SELECT Column1,
Column2
FROM #input
WHERE xdate > DateTime.Now.AddDays(-90);
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv();
In my simple tests with .Csv this worked to reduce the input stream from 4 to 3 streams, but as mentioned I'm not sure if this will work with your custom extractor.

Related

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

Resolve datatype conflicts with Copy command in databricks

I have a requirement to load the data from a csv file in ADLS Gen2 to Delta table. I am using the copy for the same. One of the column in the target table 'IS_ACTIVE' is defined as TINYINT. When triggering the below code, it fails with the following error.
Failed to merge fields 'IS_ACTIVE' and 'IS_ACTIVE'. Failed to merge incompatible data types ByteType and StringType
COPY INTO metadata.md_config_master
FROM 'abfss://{container}#{storage_account}.dfs.core.windows.net/table_folder/'
WITH (CREDENTIAL (AZURE_SAS_TOKEN = '<sas_token_string>')
)
FILEFORMAT = CSV
FILES = ('MD_CONFIG_MASTER.csv')
FORMAT_OPTIONS ('mergeSchema'='true', 'header' = 'true', 'inferSchema'='true')
COPY_OPTIONS ('force' = 'true', 'mergeSchema'= 'true')
When I did not use 'inferSchema'='true' option in FORMAT_OPTIONS, it was failing due to data type mismatch for a integer column also. when i used 'inferSchema'='true' then this error disappeared.
But still have issue with TINYINT column conversion.
When I create the target table with all string columns, then the command is successful.
Is there a way to make this code run? I did not define ByteType at all in my target table. May be it is considering TINYINT as ByteType. I am not so sure.
Note: I can actually read the ADLS file and create a pandas Dataframe and convert this to spark dataframe and load the data to target table. But that is not what I am looking for. I want this copy command to work. Hence looking for a solution specifically for this COPY command.
Sample Schema:
CREATE TABLE IF NOT EXISTS metadata.MD_CONFIG_MASTER(
CONFIG_ID INT,
CLIENT_NAME STRING,
TARGET_DATABASE STRING,
TARGET_DATABASE_MODULE_NAME STRING,
TARGET_DATABASE_DRIVER_CLASS_NAME STRING,
TARGET_DATABASE_CNX_INFO STRING,
EXECUTION_PLATFORM STRING,
IS_ACTIVE TINYINT,
INSERT_DTS TIMESTAMP
) USING DELTA;
This error occurs when your data has any string values in that column.
I reproduced this with sample data. I have given a string value in Age column.
You can see, I got the same error when tried copy into code.
So, check your data if it has any string values in that column.
When I removed the string value and added the integer value in that column, I am able to copy the data.
Data successfully copied to delta table.

Azure Synapse TSQL

I am new to Azure Synapse and had a question about how the files are setup on Azure while creating an external table from a select. Would the files be over-written or would one need to truncate the files every time a create external table script is run? For e.g. if I run the following script
CREATE EXTERNAL TABLE [dbo].[PopulationCETAS] WITH (
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
) AS
SELECT
*
FROM
OPENROWSET(
BULK 'csv/population-unix/population.csv',
DATA_SOURCE = 'sqlondemanddemo',
FORMAT = 'CSV', PARSER_VERSION = '2.0'
) WITH (
CountryCode varchar(4),
CountryName varchar(64),
Year int,
PopulationCount int
) AS r;
Would the file created
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
be overwritten every time the script is run? Can this be specified at the time of setup or within the query options?
I would love to be able to drop the files in storage with a DELETE or TRUNCATE operation but this feature doesn’t currently exist within T-SQL. Please vote for this feature.
In the meantime you will need to use outside automation like an Azure Data Factory pipeline.

Azure Data lake analysis job failed reading data from Data lake store

I have a CSV file copied from Azure blob to Azure data lake store. The pipe line is established successfully and file copied.
I'm trying to write USQL sample script from here:
Home -> datalakeanalysis1->Sample scripts-> New job
Its showing me default script.
//Define schema of file, must map all columns
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM #"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT #searchlog
TO #"/Samples/Output/SearchLog_output.tsv"
USING Outputters.Tsv();
Note: my file in data lake store is here:
Home->dls1->Data explorer->rdl1
How can I give the path of my CSV file in the script ( my CSV file is stored in Data Lake Store).
Also, I would like to keep my destination file(output) in Data lake store.
How can I modify my script to refer to the data lake store path?
Edit:
I have changed my script as below:
//Define schema of file, must map all columns
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR.csv"
USING Extractors.Csv();
OUTPUT #searchlog
TO #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR1.csv"
USING Outputters.Csv();
However, my job is getting failed with attached error:
Moreover, I'm attaching the CSV file that I wanted to be used in the job.
Sample CSV file
Is there anything wrong in the CSV file ? Or in my script??
Please help. Thanks.
I believe that while extracting data from the file you can pass in some additional parameters to ignore the header row
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extractor-parameters-u-sql#skipFirstNRows
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR.csv"
USING Extractors.Csv(skipFirstNRows:1);
Modifying the input file may or may not be possible in all scenarios specially if the input file is being dropped by stakeholders that you cannot control.
I followed your steps and reproduce your issue.
My sample data:
ID1,ID2,Date,Rs,Rs1,Number,Direction,ID3
1,1,9/14/2014 0:00,46.81006,-92.08174,51,S,1
1,2,9/14/2014 0:00,46.81006,-92.08174,13,NE,1
1,3,9/14/2014 0:00,46.81006,-92.08174,48,NE,1
1,4,9/14/2014 0:00,46.81006,-92.08174,30,W,1
Based on the error log, I found it can't parse the title row.So, I removed the title row and everything works fine.
Modified data:
1,1,9/14/2014 0:00,46.81006,-92.08174,51,S,1
1,2,9/14/2014 0:00,46.81006,-92.08174,13,NE,1
1,3,9/14/2014 0:00,46.81006,-92.08174,48,NE,1
1,4,9/14/2014 0:00,46.81006,-92.08174,30,W,1
Usql script :
//Define schema of file, must map all columns
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"/test/data.csv"
USING Extractors.Csv();
OUTPUT #searchlog
TO #"/testOutput/dataOutput.csv"
USING Outputters.Csv();
Output:
Hope it helps you.

virtual file set column and rowset variable U-SQL

I'm having an issue with scheduling job in Data Factory.
I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern.
/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store.
In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE #file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
#result = EXTRACT [Id] long,
....
date DateTime
FROM #file_set_path
USING someextractor;
#rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
#finalResult = SELECT [Id],... FROM #result
CROSS JOIN #rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?
I think this approach should work unless there is something not quite right somewhere, ie can you confirm the datatypes of the dbo.SyncTable table? Dump out #rdate and make sure the value you get there is what you expect.
I put together a simple demo which worked as expected. My copy of SyncTable had one record with the value of 01/01/2018:
#working =
SELECT *
FROM (
VALUES
( (int)1, DateTime.Parse("2017/12/31") ),
( (int)2, DateTime.Parse("2018/01/01") ),
( (int)3, DateTime.Parse("2018/02/01") )
) AS x ( id, someDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM dbo.SyncTable;
//#output =
// SELECT *
// FROM #rdate;
#output =
SELECT *, (w.someDate - r.maxDate).ToString() AS diff
FROM #working AS w
CROSS JOIN
#rdate AS r
WHERE w.someDate >= r.maxDate;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
I did try this with a filepath (full script here). The thing to remember is the custom date format H represents the hour as a number from 0 to 23. If your SyncTable date does not have a time component to it when you insert it, it will default to midnight (0), meaning the whole day will be collected. Your file structure should look something like this according to your pattern:
"D:\Data Lake\USQLDataRoot\Data\SomeEntity\2017\12\31\SomeEntity_2017_12_31__8\test.csv"
I note your filepath has underscores in the second section and a double underscore before the hour section (which will be between 0 and 23, single digit up to the hour 10). I notice your fileset path does not have a file type or quotes - I've used test.csv in my tests. My results:
Basically I think the approach will work, but there is something not quite right, maybe in your file structure, the value in your SyncTable, the datatype etc. You need to go over the details, dump out intermediate values to check until you find the problem.
Doesn't the gist of wBob's full script resolve your issue? Here is a very slightly edited version of wBob's full script to address some of the issues you raised:
Ability to filter on SyncTable,
last part of pattern is file name and not folder. Sample file and structure: \Data\SomeEntity\2018\01\01\SomeEntity_2018_01_01__1
DECLARE #file_set_path string = #"/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H}";
#input =
EXTRACT [Id] long,
date DateTime
FROM #file_set_path
USING Extractors.Text();
// in lieu of creating actual table
#syncTable =
SELECT * FROM
( VALUES
( "SomeEntity", new DateTime(2018,01,01,01,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,01,00,00) ),
( "SomeEntity", new DateTime(2018,01,01,00,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,00,00,00) ),
( "SomeEntity", new DateTime(2017,12,31,23,00,00) ),
( "AnotherEntity", new DateTime(2017,12,31,23,00,00) )
) AS x ( EntityName, ProcessedDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM #syncTable
WHERE EntityName== "SomeEntity";
#output =
SELECT *,
date.ToString() AS dateString
FROM #input AS i
CROSS JOIN
#rdate AS r
WHERE i.date >= r.maxDate;
OUTPUT #output
TO "/output/output.txt"
ORDER BY Id
USING Outputters.Text(quoting:false);
Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.

Resources