Amazon S3 - How to recursively rename files? - node.js

I'm trying to fetch my files via the s3.getObject() method in my node.js backend.
Trouble is, upon uploading the files to my bucket, I failed to replace special characters, dashes, and white-spaces. So, any files that have a Key value of (e.g., a Key with the value of 10th Anniversary Party (Part 1) 1-23-04 has an endpoint of 10th+Anniversary+Party+(Part+1)+1-23-04).
This becomes troublesome when trying to encode the URI for fetching. I'd like to replace all dashes, white-space, and special chars with a simple underscore. I've seen some possible conventions using the aws-cli, however I am unsure what the best command for this is. Any advice would be greatly appreciated.

You could write a program that:
Lists the contents of the bucket
Calls CopyObject() to copy the object to a new Key
Calls DeleteObject() to delete the previous copy
Or, you could take advantage of the fact that the AWS CLI offers a aws s3 mv command that will Copy + Delete for you.
I often simply create an Excel spreadsheet with the existing names, and a formula for determining what name I'd like. Then, I create a third column with:
aws s3 mv [Column 1] [Column 2]
Use Copy Down on the rows to get all the mv commands. Then, copy the column of commands, paste them into the command-line and it will rename all the objects in Amazon S3! (Test with 1-2 lines first, in case there is an error in the formula.)
This might seem primitive, but it's a very quick way to make the changes.

Related

Copy a set of files using ADF

I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!
There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?
As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:
You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

run ibm data stage job with different file in same job

I created a job to input excel data into database. I need the job to be reusable for different excel version. The columns of the excel will be the same but only the values will change, it's like inserting newest excel values version to the database.
Example, the file of sales_report_january.xlsx , sales_report_february.xlsx both have same columns and only the row values is different. I need the job to be able to process both files without changing anything else except the file path. Because recreating different job with the same everything(except for the filepath) for the same task seems inefficient.
Is it available to do this in ibm data stage or do i need to remap everything despite it doesn't need any change? i already tried it by changing the file path manually but it raised error.
In a word: Parameter
Construct your job using a job parameter for the pathname of the Excel workbook.
Whichever stage you are using to read the worksheet will have the workbook name set up as reference(s) to that parameter.
Tip: Use two parameters; one for the dirname part of the pathname and one for the actual name of the workbook. This is a more flexible design in the long run.
I can think of at least four ways to do this. Usually, if the files are all in the same directory, we use looping in the sequence job to process a list of the file names obtained through an appropriate command (such as ls -m pattern for UNIX/Linux). Capture the output, convert the newlines to a delimiter such as comma if necessary, and use that list in the StartLoop activity.

Unable to copy file from SFTP in Azure Data Factory when using wildcard(*) in the filename

I am unable to copy csv files from an SFTP connection to blob storage when using the wildcard(*) in the filename.
More specifically, I receive csv files in the SFTP on a daily basis, and they are of the format: "ddMMyyyyxxxxxx.csv", where "xxxxxx" is the timestamp. More concretely, my csv file for the 13th of March is: "13032019083647.csv", while for the 14th of March: "14032019083556.csv". Obviously, the timestamp is different for every day, thus I want to copy the file independently of whatever strings exists between the date and the the file extenstion.
In the "File" subfield of the "File path" of the "Connection" tab of my subset, I give as input: "13032019*.csv", as instructed by the help icon next to the field:
When I do so, my Debug run fails with:
{"errorCode": "2200", "message":
"ErrorCode=UserErrorInvalidCopyBehaviorBlobNameNotAllowedWithPreserveOrFlattenHierarchy,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
adopt copy behavior PreserveHierarchy when copying from folder to a
single file.,Source=Microsoft.DataTransfer.ClientLibrary}
I receive a similar error no matter which type of copy behaviour I choose. I have also tried experimenting with the fileFilter parameter (even though ADF warns that the same behaviour can be achieved with the fileName option), but I still end up getting the same error.
For further clarification, I am attaching the Code segment that ADF produces for this configuration:
I should also mention, that when using the full fileName in the corresponding field, namely the value: "13032019083647.csv", copying works normally.
Any help would be greatly appreciated!
My guess it might get two files with wildcard operation.
In such cases we need to use metadata activity, filter activity and for-each activity to copy these files.
1.Metadata activity : Use data-set in these activity to point the particular location of the files and pass the child Items as the parameter.
2.Filter activity : Use filter to filter the files based on your needs.
3.For-each activity : In the For-each activity get Items from the previous activity and add copy activity inside the for-each.
In copy activity the source data set should be #item().name.
I hope this will solve your issue.
What worked for me was the following: I kept the same regex for the input file, but I defined as "Copy behaviour: Merge Files". Since as mentioned, there is only 1 file that satisfies the regex condition, only 1 file was created as output. I am aware that this is a sort of "dirty" solution, but it did the trick for me.

Not using colnames when reading .xls files with RODBC

I have another puzzling problem.
I need to read .xls files with RODBC. Basically I need a matrix of all the cells in one sheet, and then use greps and strsplits etc to get the data out. As each sheet contains multiple tables in different order, and some text fields with other options inbetween, I need something that functions like readLines(), but then for excel sheets. I believe RODBC the best way to do that.
The core of my code is following function :
.read.info.default <- function(file,sheet){
fc <- odbcConnectExcel(file) # file connection
tryCatch({
x <- sqlFetch(fc,
sqtable=sheet,
as.is=TRUE,
colnames=FALSE,
rownames=FALSE
)
},
error = function(e) {stop(e)},
finally=close(fc)
)
return(x)
}
Yet, whatever I tried, it always takes the first row of the mentioned sheet as the variable names of the returned data frame. No clue how to get that solved. According to the documentation, colnames=FALSE should prevent that.
I'd like to avoid the xlsReadWrite package. Edit : and the gdata package. Client doesn't have Perl on the system and won't install it.
Edit:
I gave up and went with read.xls() from the xlsReadWrite package. Apart from the name problem, it turned out RODBC can't really read cells with special signs like slashes. A date in the format "dd/mm/yyyy" just gave NA.
Looking at the source code of sqlFetch, sqlQuery and sqlGetResults, I realized the problem is more than likely in the drivers. Somehow the first line of the sheet is seen as some column feature instead of an ordinary cell. So instead of colnames, they're equivalent to DB field names. And that's an option you can't set...
Can you use the Perl-based solution in the gdata instead? That happens to be portable too...

Resources