How to write .dsv file into dynamic folders in U-SQL script - azure

I am new to U-SQL and am trying to read multiple .dsv files from a folder and write the output as .dsv files in dynamic folder/sub folder. So far I have successfully read multiple .dsv files and am able to write the files in a single location(pre-defined location) but am not able to write the .dsv files into multiple folders/subfolders. The folders and subfolders are defined based on the file name. As an example, If the file name is ABC_20200421#015814.dsv, it should be reading data from the .dsv file and write it into folder system: test/ABC/2020/04/21/File_Data.dsv
The code I have so far is :
DECLARE #ImportFile string = "*.dsv";
-- To read all the .dsv files and move them to folder locations
DECLARE #Code String = #ImportFile.Substring(0, 3);
DECLARE #CommanName String = #ImportFile.Substring(4, 14); File_Contents
DECLARE #Year String = #ImportFile.Substring(18, 4);
DECLARE #Month String = #ImportFile.Substring(22, 2);
DECLARE #File_Date String = #ImportFile.Substring(24, 2);
#result=
SELECT col1,col2,col3
FROM Table1
//Writing to dsv file:
OUTPUT #result
TO "test/"+#Code+"/"+#Year+"/"+#Month+"/"+#File_Date+"/File_Data.dsv"
USING Outputters.Text(delimiter : '|', quoting: false);
On running the code, I get an error:
E_CSC_USER_EXPRESSIONNOTCONSTANTFOLDABLE: Expression cannot be constant folded.
Details:
at token '"/test/"', near the ###:
OUTPUT #result TO ### "/test/"+#Code+"/"+#Year+"/"+#Month+"/"+#File_Date+"/STAY_REVENUES.dsv" USING Outputters.Text(delimiter : '|', quoting : false)
Any help would be appreciated. Thanks in advance.

Related

Striping a string from newline markers

I store configuration data (paths to specific files) inside a file named app.cfg that looks like this :
path/to/config.json
path/to/default/folder
and I query those item with the following Python code:
with open("app.cfg","r",newline='') as config:
data = config.readlines()
PathToConfig = data[0]
DefaultPath = data[1]
config.close()
But when I use PathToConfig in my script, the path stored in this variable cannot be used because there is \n at the end of the string.
I tried to fix this issue by using this PathToConfig = data[0].rstrip() but there still is \n at the end of the string.
How can I strip this string from the newline marker ?
You should be able to solve it with .rstrip to strip "\n":
create app.cfg:
with open("app.cfg","w",newline='') as config:
config.writelines("""path/to/config.json
path/to/default/folder""")
app.cfg looks like this:
read contents from file:
with open("app.cfg","r",newline='') as config:
data = config.readlines()
PathToConfig = data[0].rstrip("\n")
DefaultPath = data[1]
output:

Python Glob - Get Full Filenames, but no directory-only names

This code works, but it's returning directory names and filenames. I haven't found a parameter that tells it to return only files or only directories.
Can glob.glob do this, or do I have to call os.something to test if I have a directory or file. In my case, my files all end with .csv, but I would like to know for more general knowledge as well.
In the loop, I'm reading each file, so currently bombing when it tries to open a directory name as a filename.
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
for loop_full_filename in files:
print(loop_full_filename)
Results:
c:\Demo\WatchDir\
c:\Demo\WatchDir\2202
c:\Demo\WatchDir\2202\07
c:\Demo\WatchDir\2202\07\01
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
Results needed:
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
For this specific program, I can just check if the file name contains.csv, but I would like to know in general for future reference.
Line:
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
replace with the line:
files = sorted(glob.glob(input_watch_directory + "/**/*.*", recursive=True))

Reading parquet file by spark using wildcard

I have many parquet files in S3 directory. The directory structure may vary based on vid. something like this:
bucketname/vid=123/year=2020/month=9/date=12/hf1hfw2he.parquet
bucketname/vid=456/year=2020/month=8/date=13/34jbj.parquet
bucketname/vid=876/year=2020/month=9/date=15/ghg76.parquet
I have a list which contains all the vid something like this
vid_list = ['123','456','876']
How can I read all the files at once for month=9 with out effective performance issue?
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3a://bucketname' + 'vid={}/year=2020/month={}/*/*.parquet'.format(*vid_list,current_month))
This is giving me error Path does not exist: file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;. Is there any way to achieve this in efficient way?
Try the following code:
vid_list = '(' + '|'.join(['123','456','876']) + ')'
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3://bucketname/' + 'vid={}/year=2020/month={}/*/*.parquet'.format(vid_list,current_month))
// URL should look like: s3://bucketname/vid=(123|456|876)/year=2020/month=9/*/*.parquet
Error in your code: Month value is 456, it should be 9
file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;

How to get the name of the directory from the name of the directory + the file

In an application, I can get the path to a file which resides in a directory as a string:
"/path/to/the/file.txt"
In order to write another another file into that same directory, I want to change the string "/path/to/the/file.txt" and remove the part "file.txt" to finally only get
"/path/to/the/"
as a string
I could use
string = "/path/to/the/file.txt"
string.split('/')
and then glue all the term (except the last one) together with a loop
Is there an easy way to do it?
You can use os.path.basename for getting last part of path and delete it with using replace.
import os
path = "/path/to/the/file.txt"
delete = os.path.basename(os.path.normpath(path))
print(delete) # will return file.txt
#Remove file.txt in path
path = path.replace(delete,'')
print(path)
OUTPUT :
file.txt
/path/to/the/
Let say you have an array include txt files . you can get all path like
new_path = ['file2.txt','file3.txt','file4.txt']
for get_new_path in new_path:
print(path + get_new_path)
OUTPUT :
/path/to/the/file2.txt
/path/to/the/file3.txt
/path/to/the/file4.txt
Here is what I finally used
iter = len(string.split('/'))-1
directory_path_str = ""
for i in range(0,iter):
directory_path_str = directory_path_str + srtr.split('/')[i] + "/"

How to find missing files?

I have several files (with the same dim) in a folder called data for certain dates:
file2011001.bin named like this "fileyearday"
file2011009.bin
file2011020.bin
.
.
file2011322.bin
certin dates(files) are missing. What I need is just loop through these files
if file2011001.bin exist ok, if not copy any file in the directory and name it file2011001.bin
if file2011002.bin exist ok, if not copy any file in the directory and name it file2011002.bin and so on untill file2011365.bin
I can list them in R:
dir<- list.files("/data/", "*.bin", full.names = TRUE)
I wonder if it is possible thru R or any other language!
Pretty much what you'd expect:
AllFiles = paste0("file", 2010:2015, 0:364, ".bin")
for(file in AllFiles)
{
if(file.exists(file))
{
## do something
}
}

Resources