How to specify a Date Range in U-SQL Extract statement

How to specify a Date Range in U-SQL Extract statement - azure

My input files are in a month directory, with the naming pattern
_.csv
I can create an extract to grab all files
#InputFile_Daily + "{*}.json"
However I now need be able to create a file set of a specific range of dates, eg Today -> Today-3
Is there a way to specify this kind of range, be it regex or other within the U-SQL extract? or as I've seen elsewhere, extract all data and then filter the result down to the range I'm interested in. This is not ideal as cost is a factor

In U-SQL you extract all files like you said (#InputFile_Daily + "{*}.json") and then in the 1st select you apply your date filter, and it internally only extracts the needed data.
Example:
DECLARE #input string = #"/temp/stackoverflow.json";
// Read input file
#inputData =
EXTRACT Account string,
Alias string,
Company string,
date DateTime,
Json string
FROM #input
USING Extractors.Text(delimiter : '\n', quoting : false);
#extractedFields =
SELECT Account,
Alias,
Company,
date,
Json
FROM #inputData
WHERE #referenceDate == DateTime.MinValue OR (date >= #dateFrom AND date <= #dateTo);
If you have 1 million files, and your filter is for most recent files, for example 5 files, it will extract only 5 files. You can confirm this then on the u-sql job graph how many files have been extracted.

Related

Reformatting date values when using them as URL parameters in a PowerQuery API request

I have two dates in my Excel table with the following format: "dd-mm-yyyy". These dates need to be sent as URL query parameters to an API endpoint for getting some data using PowerQuery. However, the API endpoint does not accept dates in that format. Therefore, I need to convert them to the format "mm-dd-yyyy" instead for it to work.
For getting the values from my table, I use the following code:
let GetNamedRange=(NamedRange) =>
let
name = Excel.CurrentWorkbook(){[Name=NamedRange]}[Content],
value = name{0}[Column1]
in
value
in
GetNamedRange
This function, called "GetValue", is then called when inserting URL query parameters in my GET request:
Csv.Document(Web.Contents("my.api/leave/leavecsv", [Query = [periodStart = GetValue("periodStart"), periodEnd = GetValue("periodEnd"), department = GetValue("department")]]),[Delimiter=";", Columns=14, Encoding=1252, QuoteStyle=QuoteStyle.None])
Currently the cells for my dates are in Text format. I tried using Date.FromText(...) to format the dates, but I get an error saying the datetime format is invalid.
https://learn.microsoft.com/en-us/powerquery-m/date-fromtext
How can I propertly format my date values before inserting them as URL query parameters using PowerQuery?

Ensure your dates are real dates and set to type date. then you can use the Date.ToText function:
let
theDate = #date(2022,12,7),
output = Date.ToText(theDate,"MM-dd-yyyy")
in
output
If, for some reason, you must maintain your dates as text strings (I'd like to know why, if that's the case), you can convert them first to a "real" date, and then create the string:
let
theDate = "13-12-2022",
output = Date.ToText(Date.FromText(theDate, "en-150"),"MM-dd-yyyy")
in
output

Make sure you pass in a culture and format. i.e.
Date.FromText([Column1], [Format="dd-MM-yyyy", Culture="en-UK"])

Join files from different pathes in USQL

My data is saved on a daily basis in the following path: "/Data/{year}/{month}/{day}/mydata.json"
So, e.g. "/Data/2018/10/1/mydata.json" , "/Data/2018/10/2/mydata.json", "/Data/2018/11/1/mydata.json", "/Data/2018/12/5/mydata.json", etc.
I would like to combine all the months and days in one file using USQL. Is it possible to do it in an easy way without mentioning each path separately (otherwise it's crazy to do it for all the days of the year)?
At the moment I use this:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#a =
EXTRACT EventCategory string
, EventAction string
, EventLabel string
FROM "/Data/2018/10/2/mydata.json"
USING new JsonExtractor()
UNION ALL
EXTRACT EventCategory string
, EventAction string
, EventLabel string
FROM "/Data/2018/11/2/mydata.json"
USING new JsonExtractor();
OUTPUT #a
TO "/Output/mydata.Csv"
USING Outputters.Csv(outputHeader:true);

I would like to combine all the months and days in one file using USQL. Is it possible to do it in an easy way without mentioning each path separately (otherwise it's crazy to do it for all the days of the year)?
Yes! You can do this using patterns, a basic example:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #input string = "/Data/2018/{*}/2/mydata.json";
USING Microsoft.Analytics.Samples.Formats.Json;
#a =
EXTRACT EventCategory string
, EventAction string
, EventLabel string
FROM #input
USING new JsonExtractor()
OUTPUT #a
TO "/Output/mydata.Csv"
USING Outputters.Csv(outputHeader:true);
this will load all data of the second day of the month.
Other variations:
DECLARE #input string = "/Data/2018/{*}/{*}/mydata.json"; will process all files of 2018
DECLARE #input string = "/Data/{*}/12/{*}/mydata.json"; will process all files generated in the 12th month of all years
If you want to retrieve the file parts to get the actual date parts you can do:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#a =
EXTRACT EventCategory string
, EventAction string
, EventLabel string
, date DateTime
FROM "/Data/{date:yyyy}/{date:MM}/{date:dd}/mydata.json"
USING new JsonExtractor()
OUTPUT #a
TO "/Output/mydata.Csv"
USING Outputters.Csv(outputHeader:true);
As you can see there is now an additional column date of type DateTime that can be used in the query and/or included in the output.

Python format incomplete date to YYYYMM

As a start, I am extremely new at Python.
I am receiving an Excel file where the date field is incomplete. The value displays as "190808" (YYMMDD) instead of "2019-08-08".
Part of my automation attempt is to move the file to a different location, where the file is renamed. I want to use the date field to change the file name to the file description and date (e.g. "Sales figures 201908").
The code I have only works if the date format is
str(df['Bank date'][0].strftime("%Y%m"))
I have tried dateparser with the following:
dateparser.parse(df['Bank date'][0].strftime("%Y.%m"))
The error I am receiving is 'numpy.int64' object has no attribute 'strftime'
Any help will do.
Thanks.

I modified it slightly and built my own date-string using slicing.
vOldDate = str(df['Bank date'][0])
vNewDate = '20' + vOldDate[:2] + '.' + vOldDate[2:4]

Numpy is interpreting the date as an integer. To use dateparser, you need to convert that value into a string first, then parse that string, and then format the result:
dateparser.parse(str(df['Bank date'][0])).strftime("%Y.%m")
Since the input format is expected, you should specify it to ensure you get the right date:
>>> dateparser.parse(str(190808), date_formats=['%y%m%d']).strftime("%Y.%m")
'2019.08'

How can I define multiple input file patterns in USQL?

I have U-SQL script where I need to process some data. The data is stored in blob, with ~100 files per day in this folder structure: /{year}/{month}/{day}/{hour}/filenames.tsv
Getting one day of data is easy, just put a wildcard in the end and it will pick out all the files for all the hours for the day.
However, in my script I want to read out the current day and the last 2 hours of the previous day. The naive way is with 3 extract statements in this way:
DECLARE #input1 = #"/data/2017/10/08/22/{*}.tsv";
DECLARE #input2 = #"/data/2017/10/08/23/{*}.tsv";
DECLARE #input3 = #"/data/2017/10/09/{*}.tsv";
#x1 = EXTRACT .... FROM #input1 USING Extractors.Tsv();
#x2 = EXTRACT .... FROM #input2 USING Extractors.Tsv();
#x3 = EXTRACT .... FROM #input3 USING Extractors.Tsv();
But in my case each extract line is very long and complicated (~50 columns) using the AvroExtractor, so I would really prefer to only specify the columns and extractor once instead of 3 times. Also, by having 3 inputs its not possible from the caller side to decide how many hours from the previous days that should be read.
My question is how can I define this in a convenient way, ideally using only one extract statement?

You could wrap your logic up into a U-SQL stored procedure so it is encapsulated. Then you need only make a few calls to the proc. A simple example:
CREATE PROCEDURE IF NOT EXISTS main.getContent(#inputPath string, #outputPath string)
AS
BEGIN;
#output =
EXTRACT
...
FROM #inputPath
USING Extractors.Tsv();
OUTPUT #output
TO #outputPath
USING Outputters.Tsv();
END;
Then to call it (untested):
main.getContent (
#"/data/2017/10/08/22/{*}.tsv",
#"/output/output1.tsv"
)
main.getContent (
#"/data/2017/10/08/23/{*}.tsv",
#"/output/output2.tsv"
)
main.getContent (
#"/data/2017/10/09/{*}.tsv",
#"/output/output3.tsv"
)
That might be one way to go about it?

Importing data from Excel into Access using DAO and WHERE clause

I need to import certain information from an Excel file into an Access DB and in order to do this, I am using DAO.
The user gets the excel source file from a system, he does not need to directly interact with it. This source file has 10 columns and I would need to retrieve only certain records from it.
I am using this to retrieve all the records:
Set destinationFile = CurrentDb
Set dbtmp = OpenDatabase(sourceFile, False, True, "Excel 8.0;")
DoEvents
Set rs = dbtmp.OpenRecordset("SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536]")
My problem comes when I want to retrieve only certain records using a WHERE clause. The name of the field where I want to apply the clause is 'Date (UCT)' (remember that the user gets this source file from another system) and I can not get the WHERE clause to work on it. If I apply the WHERE clause on another field, whose name does not have ( ) or spaces, then it works. Example:
Set rs = dbtmp.OpenRecordset("SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE Other = 12925")
The previous instruction will retrieve only the number of records where the field Other has the value 12925.
Could anyone please tell me how can I achieve the same result but with a field name that has spaces and parenthesis i.e. 'Date (UCT)' ?
Thank you very much.
Octavio

Try enclosing the field name in square brackets:
SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE [Date (UCT)] = 12925
or if it's a date we are looking for:
SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE [Date (UCT)] = #02/14/13#;
To use date literal you must enclose it in # characters and write the date in MM/DD/YY format regardless of any regional settings on your machine

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to specify a Date Range in U-SQL Extract statement - azure

Related

Reformatting date values when using them as URL parameters in a PowerQuery API request

Join files from different pathes in USQL

Python format incomplete date to YYYYMM

How can I define multiple input file patterns in USQL?

Importing data from Excel into Access using DAO and WHERE clause

Categories

Resources