I have U-SQL script where I need to process some data. The data is stored in blob, with ~100 files per day in this folder structure: /{year}/{month}/{day}/{hour}/filenames.tsv
Getting one day of data is easy, just put a wildcard in the end and it will pick out all the files for all the hours for the day.
However, in my script I want to read out the current day and the last 2 hours of the previous day. The naive way is with 3 extract statements in this way:
DECLARE #input1 = #"/data/2017/10/08/22/{*}.tsv";
DECLARE #input2 = #"/data/2017/10/08/23/{*}.tsv";
DECLARE #input3 = #"/data/2017/10/09/{*}.tsv";
#x1 = EXTRACT .... FROM #input1 USING Extractors.Tsv();
#x2 = EXTRACT .... FROM #input2 USING Extractors.Tsv();
#x3 = EXTRACT .... FROM #input3 USING Extractors.Tsv();
But in my case each extract line is very long and complicated (~50 columns) using the AvroExtractor, so I would really prefer to only specify the columns and extractor once instead of 3 times. Also, by having 3 inputs its not possible from the caller side to decide how many hours from the previous days that should be read.
My question is how can I define this in a convenient way, ideally using only one extract statement?
You could wrap your logic up into a U-SQL stored procedure so it is encapsulated. Then you need only make a few calls to the proc. A simple example:
CREATE PROCEDURE IF NOT EXISTS main.getContent(#inputPath string, #outputPath string)
AS
BEGIN;
#output =
EXTRACT
...
FROM #inputPath
USING Extractors.Tsv();
OUTPUT #output
TO #outputPath
USING Outputters.Tsv();
END;
Then to call it (untested):
main.getContent (
#"/data/2017/10/08/22/{*}.tsv",
#"/output/output1.tsv"
)
main.getContent (
#"/data/2017/10/08/23/{*}.tsv",
#"/output/output2.tsv"
)
main.getContent (
#"/data/2017/10/09/{*}.tsv",
#"/output/output3.tsv"
)
That might be one way to go about it?
Related
I have had to look up hundreds (if not thousands) of free-text answers on google, making notes in Excel along the way and inserting SAS-code around the answers as a last step.
The output looks like this:
This output contains an unnecessary number of blank spaces, which seems to confuse SAS's search to the point where the observations can't be properly located.
It works if I manually erase superflous spaces, but that will probably take hours. Is there an automated fix for this, either in SAS or in excel?
I tried using the STRIP-function, to no avail:
else if R_res_ort_txt=strip(" arild ") and R_kom_lan=strip(" skåne ") then R_kommun=strip(" Höganäs " );
If you want to generate a string like:
if R_res_ort_txt="arild" and R_kom_lan="skåne" then R_kommun="Höganäs";
from three variables, let's call them A B C, then just use code like:
string=catx(' ','if R_res_ort_txt=',quote(trim(A))
,'and R_kom_lan=',quote(trim(B))
,'then R_kommun=',quote(trim(C)),';') ;
Or if you are just writing that string to a file just use this PUT statement syntax.
put 'if R_res_ort_txt=' A :$quote. 'and R_kom_lan=' B :$quote.
'then R_kommun=' C :$quote. ';' ;
A saner solution would be to continue using the free-text answers as data and perform your matching criteria for transformations with a left join.
proc import out=answers datafile='my-free-text-answers.xlsx';
data have;
attrib R_res_ort_txt R_kom_lan length=$100;
input R_res_ort_txt ...;
datalines4;
... whatever all those transforms will be performed on...
;;;;
proc sql;
create table want as
select
have.* ,
answers.R_kommun_answer as R_kommun
from
have
left join
answers
on
have.R_res_ort_txt = answers.res_ort_answer
& have.R_kom_lan = abswers.kom_lan_answer
;
I solved this by adding quotes in excel using the flash fill function:
https://www.youtube.com/watch?v=nE65QeDoepc
My input files are in a month directory, with the naming pattern
_.csv
I can create an extract to grab all files
#InputFile_Daily + "{*}.json"
However I now need be able to create a file set of a specific range of dates, eg Today -> Today-3
Is there a way to specify this kind of range, be it regex or other within the U-SQL extract? or as I've seen elsewhere, extract all data and then filter the result down to the range I'm interested in. This is not ideal as cost is a factor
In U-SQL you extract all files like you said (#InputFile_Daily + "{*}.json") and then in the 1st select you apply your date filter, and it internally only extracts the needed data.
Example:
DECLARE #input string = #"/temp/stackoverflow.json";
// Read input file
#inputData =
EXTRACT Account string,
Alias string,
Company string,
date DateTime,
Json string
FROM #input
USING Extractors.Text(delimiter : '\n', quoting : false);
#extractedFields =
SELECT Account,
Alias,
Company,
date,
Json
FROM #inputData
WHERE #referenceDate == DateTime.MinValue OR (date >= #dateFrom AND date <= #dateTo);
If you have 1 million files, and your filter is for most recent files, for example 5 files, it will extract only 5 files. You can confirm this then on the u-sql job graph how many files have been extracted.
I have a BUNCH of fixed width text files that contain multiple transaction types with only 3 that I care about (121,122,124).
Sample File:
D103421612100188300000300000000012N000002000001000032021420170012260214201700122600000000059500000300001025798
D103421612200188300000300000000011000000000010000012053700028200004017000000010240000010000011NNYNY000001000003N0000000000 00
D1034216124001883000003000000000110000000000300000100000000000CS00000100000001200000033NN0 00000001200
So What I need to do is read line by line from these files and look for the ones that have a 121, 122, or 124 at startIndex = 9 and length = 3.
Each line needs to be parsed based on a data dictionary I have and the output needs to be grouped by transaction type into three different files.
I have a process that works but it's very inefficient, basically reading each line 3 times. The code I have is something like this:
#121 = EXTRACT
col1 string,
col2 string,
col3 string //ect...
FROM inputFile
USING new MyCustomExtractor(
new SQL.MAP<string, string> {
{"col1","2"},
{"col2","6"},
{"col3","3"} //ect...
};
);
OUTPUT #121
TO 121.csv
USING Outputters.Csv();
And I have the same code for 122 and 124. My custom extractor takes the SQL MAP and returns the parsed line and skips all lines that don't contain the transaction type I'm looking for.
This approach also means I'm running through all the lines in a file 3 times. Obviously this isn't as efficient as it could be.
What I'm looking for is a high level concept of the most efficient way to read a line, determine if it is a transaction I care about, then output to the correct file.
Thanks in advance.
How about pulling out the transaction type early using the Substring method of the String datatype? Then you can do some work with it, filtering etc. A simple example:
// Test data
#input = SELECT *
FROM (
VALUES
( "D103421612100188300000300000000012N000002000001000032021420170012260214201700122600000000059500000300001025798" ),
( "D103421612200188300000300000000011000000000010000012053700028200004017000000010240000010000011NNYNY000001000003N0000000000 00" ),
( "D1034216124001883000003000000000110000000000300000100000000000CS00000100000001200000033NN0 00000001200" ),
( "D1034216999 0000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000" )
) AS x ( rawData );
// Pull out the transaction type
#working =
SELECT rawData.Substring(8,3) AS transactionType,
rawData
FROM #input;
// !!TODO do some other work here
#output =
SELECT *
FROM #working
WHERE transactionType IN ("121", "122", "124"); //NB Note the case-sensitive IN clause
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
As of today, there is no specific U-SQL function that can define the output location of a tuple on the fly.
wBob presented an approach to a potential workaround. I'd extend the solution the following way to address your need:
Read the entire file, adding a new column that helps you identify the transaction type.
Create 3 rowsets (one for each file) using a WHERE statement with the specific transaction type (121, 122, 124) on the column created in the previous step.
Output each rowset created in the previous step to their individual file.
If you have more feedback or needs, feel free to create an item (and voting for others) on our UserVoice site: https://feedback.azure.com/forums/327234-data-lake. Thanks!
I'll begin by saying I am really not good in programming especially in extracting data so please bear with me. I think my problem is simple, I just can't figure out how to do it.
My problem is I want to extract part of the data in a series of excel files stored in the same folder. To be specific, let's say I have 10 excel files with 1000 data in each (from A1:A1000). I want to extract the first 100 data (A1:A100) in each excel files and store it in a single variable with a 10x100 size (each row represents each file).
I would really appreciate if any of you can help me. This would make my data processing a lot faster.
EDIT: I have figured out the code but my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
here's the code i've written:
for k=1:1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,'A1:A100'))';
z(k,:)=data(1,:);
end
I'm not sure how i will edit this part data=(xlsread(file,'A1:A100'))' to do the loop i wanted to do.
my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
Why? Why not extract A1:A1000 in one block and then reshape or otherwise split up the data?
data(k,:)=(xlsread(file,'A1:A1000'))';
Then the A1:A100 data is in data(k,1:100), and so on. If you do this:
data = data(reshape, [10 100 10]);
Then data(:,:,1) should be your A1:A100 values as in your original loop, and so on until data(:,:,10).
This should do it:
for sec = 1:1:10
for k=1:1:10
file=['',int2str(k),'.xlsx'];
section = ['A', num2str(1+(100*(sec-1)), ':A', mum2str(100*sec)]
data=(xlsread(file, section))';
z(k,:)=data(1,:);
end
output(sec) = z;
end
Here's a suggestion to loop through the different cells to read. Obviously, you can change how you arrange the collected data in z. I have done it as the first index representing the different cells to read (1 for 1:100, 2 for 101:200, etc...), the second index being the file number (as per your original code) and the third index the data (100 data points).
% pre-allocate data
z = zeros(10,10,100);
for kk=1:10
cells_to_read = ['A' num2str(kk*100-99) ':A' num2str(kk*100)];
for k=1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,cells_to_read))';
z(kk,k,:)=data(1,:);
end
end
I have a variable that is created by a loop. The variable is large enough and in a complicated enough form that I want to save the variable each time it comes out of the loop with a different name.
PM25 is my variable. But I want to save it as PM25_year in which the year changes based on `str = fname(13:end)'
PM25 = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]); % Reshape and permute to achieve the right shape. Each face of the 3D should be one day
str = fname(13:end); % The year
% Third dimension is organized so that the data for each site is on a face
save('PM25_str', 'PM25_Daily_US.mat', '-append')
The str would be a year, like 2008. So the variable saved would be PM25_2008, then PM25_2009, etc. as it is created.
Defining new variables based on data isn't considered best practice, but you can store your data more efficiently using a cell array. You can store even a large, complicated variable like your PM25 variable within a single cell. Here's how you could go about doing it:
Place your PM25 data for each year into the cell array C using your loop:
for i = 1:numberOfYears
C{i} = PM25;
end
Resulting in something like this:
C = { PM25_2005, PM25_2006, PM25_2007 };
Now let's say you want to obtain your variable for the year 2006. This is easy (assuming you aren't skipping years). The first year of your data will correspond to position 1, the second year to position 2, etc. So to find the index of the year you want:
minYear = 2005;
yearDesired = 2006;
index = yearDesired - minYear + 1;
PM25_2006 = C{index};
You can do this using eval, but note that it's often not considered good practice. eval may be a security risk, as it allows user input to be executed as code. A better way to do this may be to use a cell array or an array of objects.
That said, I think this will do what you want:
for year = 2008:2014
eval(sprintf('PM25_%d = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]);',year));
save('PM25_Daily_US.mat',sprintf('PM25_%d',year),'-append');
end
I do not recommend to set variables like this since there is no way to track these variables and completely prevents all kind of error checking that MATLAB does beforehand. This kind of code is handled completely in runtime.
Anyway in case you have a really good reason for doing this I recommend that you use the function assignin for this.
assignin('caller', ['myvar',num2str(1)], 63);