I want to use a WHERE statement with two variables within the where clause. I've done research on this looking at how to use variables in SQL statements in Databricks and Inserting Variables Using Python, Not Working. I've tried to implement the solutions provided but it's not working.
a= 17091990
b = 30091990
df = spark.sql(' SELECT * FROM table WHERE date between "a" AND "b" ')
You can use python's formatted string literals
df = spark.sql(f"SELECT * FROM table WHERE date between {a} AND {b} ")
For more about formatted string literals you can refer to https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498
I'm having an issue with scheduling job in Data Factory.
I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern.
/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store.
In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE #file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
#result = EXTRACT [Id] long,
....
date DateTime
FROM #file_set_path
USING someextractor;
#rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
#finalResult = SELECT [Id],... FROM #result
CROSS JOIN #rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?
I think this approach should work unless there is something not quite right somewhere, ie can you confirm the datatypes of the dbo.SyncTable table? Dump out #rdate and make sure the value you get there is what you expect.
I put together a simple demo which worked as expected. My copy of SyncTable had one record with the value of 01/01/2018:
#working =
SELECT *
FROM (
VALUES
( (int)1, DateTime.Parse("2017/12/31") ),
( (int)2, DateTime.Parse("2018/01/01") ),
( (int)3, DateTime.Parse("2018/02/01") )
) AS x ( id, someDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM dbo.SyncTable;
//#output =
// SELECT *
// FROM #rdate;
#output =
SELECT *, (w.someDate - r.maxDate).ToString() AS diff
FROM #working AS w
CROSS JOIN
#rdate AS r
WHERE w.someDate >= r.maxDate;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
I did try this with a filepath (full script here). The thing to remember is the custom date format H represents the hour as a number from 0 to 23. If your SyncTable date does not have a time component to it when you insert it, it will default to midnight (0), meaning the whole day will be collected. Your file structure should look something like this according to your pattern:
"D:\Data Lake\USQLDataRoot\Data\SomeEntity\2017\12\31\SomeEntity_2017_12_31__8\test.csv"
I note your filepath has underscores in the second section and a double underscore before the hour section (which will be between 0 and 23, single digit up to the hour 10). I notice your fileset path does not have a file type or quotes - I've used test.csv in my tests. My results:
Basically I think the approach will work, but there is something not quite right, maybe in your file structure, the value in your SyncTable, the datatype etc. You need to go over the details, dump out intermediate values to check until you find the problem.
Doesn't the gist of wBob's full script resolve your issue? Here is a very slightly edited version of wBob's full script to address some of the issues you raised:
Ability to filter on SyncTable,
last part of pattern is file name and not folder. Sample file and structure: \Data\SomeEntity\2018\01\01\SomeEntity_2018_01_01__1
DECLARE #file_set_path string = #"/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H}";
#input =
EXTRACT [Id] long,
date DateTime
FROM #file_set_path
USING Extractors.Text();
// in lieu of creating actual table
#syncTable =
SELECT * FROM
( VALUES
( "SomeEntity", new DateTime(2018,01,01,01,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,01,00,00) ),
( "SomeEntity", new DateTime(2018,01,01,00,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,00,00,00) ),
( "SomeEntity", new DateTime(2017,12,31,23,00,00) ),
( "AnotherEntity", new DateTime(2017,12,31,23,00,00) )
) AS x ( EntityName, ProcessedDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM #syncTable
WHERE EntityName== "SomeEntity";
#output =
SELECT *,
date.ToString() AS dateString
FROM #input AS i
CROSS JOIN
#rdate AS r
WHERE i.date >= r.maxDate;
OUTPUT #output
TO "/output/output.txt"
ORDER BY Id
USING Outputters.Text(quoting:false);
Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.
I read the post:
Turning a Comma Separated string into individual rows
And really like the solution:
SELECT A.OtherID,
Split.a.value('.', 'VARCHAR(100)') AS Data
FROM
( SELECT OtherID,
CAST ('<M>' + REPLACE(Data, ',', '</M><M>') + '</M>' AS XML) AS Data
FROM Table1
) AS A CROSS APPLY Data.nodes ('/M') AS Split(a);
But it did not work when I tried to apply the method in Teradata for a similar question. Here is the summarized error code:
select failed 3707: expected something between '.' and the 'value' keyword. So is the code only valid in SQL Server? Would anyone help me to make it work in Teradata or SAS SQL? Your help will be really appreciated!
This is SQL Server syntax.
In Teradata there's a table UDF named STRTOK_SPLIT_TO_TABLE,
e.g.
SELECT * FROM dbc.DatabasesV AS db
JOIN
(
SELECT token AS DatabaseName, tokennum
FROM TABLE (STRTOK_SPLIT_TO_TABLE(1, 'dbc,systemfe', ',')
RETURNS (outkey INTEGER,
tokennum INTEGER,
token VARCHAR(128) CHARACTER SET UNICODE)
) AS d
) AS dt
ON db.DatabaseName = dt.DatabaseName
ORDER BY tokennum;
Or see my answer to this similar question
I'm looking for a SAS Code that works just like "VLOOKUP" function in Excel.
I have two tables:
table_1 has an ID column with some other columns in it with 10 rows. Table_2 has two columns: ID and Definition with 50 rows. I want to define a new variable "Definition " in table_1 and lookup the ID values from table_2.
I haven't really tried anything other than merge. but merge keeps all the extra 40 variables from table_2 and that's not what I like.
Thanks, SE
The simplest way is to use the keep option on your merge statement.
data result;
merge table_1 (in=a) table_2 (in=b keep=id definition);
by id;
if a;
run;
An alternative that means you don't have to sort your datasets is to use proc sql.
proc sql;
create table result as
select a.*,
b.definition
from table_1 a
left join table_2 b on a.id = b.id;
quit;
Finally, there is the hash table option if table_2 is small:
data result;
if _n_ = 1 then do;
declare hash b(dataset:'table_2');
b.definekey('id');
b.definedata('definition');
b.definedone();
call missing(definition);
end;
set table_1;
b.find();
run;
Here is one very useful (and often very fast) method specifically for 1:1 matching, which is what VLOOKUP does. You create a Format or Informat with the match-variable and the lookup-result, and put or input the match-variable in the master table.
data class_income;
set sashelp.class(keep=name);
income = ceil(12*ranuni(7));
run;
data for_format;
set class_income end=eof;
retain fmtname 'INCOMEI';
start=name;
label=income;
type='i'; *i=informat numeric, j=informat character, n=format numeric, c=format character;
output;
if eof then do;
hlo='o'; *hlo contains some flags, o means OTHER for nonmatching records;
start=' ';
label=.;
output;
end;
run;
proc format cntlin=for_format;
quit;
data class;
set sashelp.class;
income = input(name,INCOMEI.);
run;
This is a part of macro:
%let mvTableName = "MyTable";
proc append base = &mvTableName data = TEMP_TABLE;
run;
And i can't find table in WORK :\
After that i check creation of table.
data &mvTableName;
run;
And see in log: Dataset MyTable ...
But when i change string %let mvTableName=MyTable;
I see this log: Dataset WORK.MyTable ..
How it can be explained?
If you are going to use mvTableName as an input for a DATA= option, don't include double quotes
Assuming MyTable and Temp_table are SAS data sets in the WORK library...this should work.
%Let mvTableName=MyTable;
Proc Append base=&mvTableName data=temp_table;
run;
Also,
Data &mvTableName;
Run;
Creates an empty data set...so mvTableName would be overwritten with an empty data set.