I wanna the script that Copy the last object that modified (based on "Data modified" in Windows Explorer)
For Example:
D:\Source\A1.txt Data modified: 12/24/2022
D:\Source\A2.txt Data modified: 12/23/2022
After use that script, Copy only A1.txt in destination path (like F:\destination)
Related
In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file
I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/
with yeardir containing the year directories.
When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.
My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.
Is this possible within the scope of the data factory?
edit:
manual filepath that works
parameterized filepath
properties used in copy activity
To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside #dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.
The sink is a binary dataset linked to Azure Data Lake Storage Gen2.
sink filepath
Update:
The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.
Since giving exampledir/yeardir/2015 worked perfectly for you and you want to copy all the folders present in exampledir/yeardir, you can follow the below procedure:
I have taken a get metadata activity to get the child items of the folder exampledir/yeardir/ (In my demonstration, I have taken path as 'maindir/yeardir'.).
This will give you all the year folders present. I have taken only 2020 and 2021 as an example.
Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
#activity('Get Metadata1').output.childItems
Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/#{item().name}
For sink, I have given the output directory as follows:
outputDir/#{item().name}
Since giving path manually as exampledir/yeardir/2015 worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path as exampledir/yeardir/<current_iteration_year_folder>.
Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.
I need to get last modified dates of all Folders and Files in DBFS mount point (of ADLS Gen1) under Azure Databricks.
Folder structure is like:
Not containing any files, Empty folders:
/dbfs/mnt/ADLS1/LANDING/parent/child/subfolder1
/dbfs/mnt/ADLS1/LANDING/parent/child/subfolder2/subfolder3
Containing some files:
/dbfs/mnt/ADLS1/LANDING/parent/XYZ/subfolder4/File1.txt
/dbfs/mnt/ADLS1/LANDING/parent/XYZ/subfolder5/subfolder6/File2.txt
Used following Python code to get last modified date:
root_dir = "/dbfs/mnt/ADLS1/LANDING/parent"
def get_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
print(child, datetime.fromtimestamp(getmtime(child)).date())
else:
print(child, datetime.fromtimestamp(getmtime(child)).date())
get_directories(child)
From above code, I am getting correct modified date for all folders containing files.
But for empty folders, it is giving current date. Not last modified date.
Whereas, when I hardcode the path for empty folder, it is giving correct modified date:
print(datetime.fromtimestamp(getmtime("/dbfs/mnt/ADLS1/LANDING/parent/child/subfolder1")).date())
Can someone please help me out, what am I missing here in loop?
Seems, the issue was with processing time.
I given a wait time : time.sleep(.000005).
It worked as expected.
I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!
So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.
I have some simple code to import a csv file of Windows File Paths e.g."C:/Folder/SubFolder/folder1" to copy the contents and any subfolders to a new directory.
When I run the code I get about 1/3 of my sample returning doesn't exist or not a regular file while the other are copied successfully.
The files that are causing problems are either .docx or .pdf but equally as many were copied over successfully.
What could be causing this issue on a local Windows 10 machine and how do I debug it further?
for Submission in FilePaths.itertuples():
Create the Path
FolderName = Submission.Group+"-"+Submission.ID+"-"+Submission.FirstName+Submission.Surname
DestinationPath =DestinationBasePath + Submission.Category+"\\"+ Submission.Value+"\\"+FolderName
#Copy the source folder tree and contents to the destination
try:
copy_tree(Submission.FullFilePath, DestinationPath,verbose=1)
except Exception as Ex:
print(Ex)
os.listdir(Submission.FullFilePath)
I suspect it was connected to syncing or file locking issues which prevented the files from being copied.
I have a folder which has log files from June to now. I want to retrieve files created in the last week of August and store them at a different path. I can get the creation date from the filename which is in this format:
Run_Merge_BJDSBC_20190901-093
I want to code this in python, so far I have reached here:
for filename in glob.glob(r"C:\Users\chke01\Desktop\PentahoLogs\BAF\*201908[0-9]*.log"):
But I dont know how to select these filenames and write to another folder. Can someone help me here
glob.glob('path') gives you a list of files. You can iterate over that list and use the shutil module to move and rename the files.
for file in glob.glob('*'):
shutil.move(file, new_folder_path)