run ibm data stage job with different file in same job - excel

I created a job to input excel data into database. I need the job to be reusable for different excel version. The columns of the excel will be the same but only the values will change, it's like inserting newest excel values version to the database.
Example, the file of sales_report_january.xlsx , sales_report_february.xlsx both have same columns and only the row values is different. I need the job to be able to process both files without changing anything else except the file path. Because recreating different job with the same everything(except for the filepath) for the same task seems inefficient.
Is it available to do this in ibm data stage or do i need to remap everything despite it doesn't need any change? i already tried it by changing the file path manually but it raised error.

In a word: Parameter
Construct your job using a job parameter for the pathname of the Excel workbook.
Whichever stage you are using to read the worksheet will have the workbook name set up as reference(s) to that parameter.
Tip: Use two parameters; one for the dirname part of the pathname and one for the actual name of the workbook. This is a more flexible design in the long run.

I can think of at least four ways to do this. Usually, if the files are all in the same directory, we use looping in the sequence job to process a list of the file names obtained through an appropriate command (such as ls -m pattern for UNIX/Linux). Capture the output, convert the newlines to a delimiter such as comma if necessary, and use that list in the StartLoop activity.

Related

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?
As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:
You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

Unable to copy file from SFTP in Azure Data Factory when using wildcard(*) in the filename

I am unable to copy csv files from an SFTP connection to blob storage when using the wildcard(*) in the filename.
More specifically, I receive csv files in the SFTP on a daily basis, and they are of the format: "ddMMyyyyxxxxxx.csv", where "xxxxxx" is the timestamp. More concretely, my csv file for the 13th of March is: "13032019083647.csv", while for the 14th of March: "14032019083556.csv". Obviously, the timestamp is different for every day, thus I want to copy the file independently of whatever strings exists between the date and the the file extenstion.
In the "File" subfield of the "File path" of the "Connection" tab of my subset, I give as input: "13032019*.csv", as instructed by the help icon next to the field:
When I do so, my Debug run fails with:
{"errorCode": "2200", "message":
"ErrorCode=UserErrorInvalidCopyBehaviorBlobNameNotAllowedWithPreserveOrFlattenHierarchy,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
adopt copy behavior PreserveHierarchy when copying from folder to a
single file.,Source=Microsoft.DataTransfer.ClientLibrary}
I receive a similar error no matter which type of copy behaviour I choose. I have also tried experimenting with the fileFilter parameter (even though ADF warns that the same behaviour can be achieved with the fileName option), but I still end up getting the same error.
For further clarification, I am attaching the Code segment that ADF produces for this configuration:
I should also mention, that when using the full fileName in the corresponding field, namely the value: "13032019083647.csv", copying works normally.
Any help would be greatly appreciated!
My guess it might get two files with wildcard operation.
In such cases we need to use metadata activity, filter activity and for-each activity to copy these files.
1.Metadata activity : Use data-set in these activity to point the particular location of the files and pass the child Items as the parameter.
2.Filter activity : Use filter to filter the files based on your needs.
3.For-each activity : In the For-each activity get Items from the previous activity and add copy activity inside the for-each.
In copy activity the source data set should be #item().name.
I hope this will solve your issue.
What worked for me was the following: I kept the same regex for the input file, but I defined as "Copy behaviour: Merge Files". Since as mentioned, there is only 1 file that satisfies the regex condition, only 1 file was created as output. I am aware that this is a sort of "dirty" solution, but it did the trick for me.

Skip element in BizTalk flat file assembly?

I've been tasked to map an input xml (actually an SAP idoc xml), and to generate a number of flat files. Each input xml may yield multiple output files (one output file per lot number), so I will be using xsl:key and the key() function in my mapping, based on the lot number
The thing is, the lot number itself will not be in the file itself, but the output file name needs to contain that lot number value.
So the question really is: can I map the lot number to the xml and have the flat file assembler skip it when it produces the file? Or is there another way the lot number can be applied as file name by the assembly without having it inside the file itself?
In your orchestration you can set a context property for each output message:
msgOutput(FILE.ReceivedFileName) = "DynamicStuff";
msgOutput then goes to the send shape.
In your send port you set the output file like this:
FixedStuff_%SourceFileName%.xml
The result:
FixedStuff_DynamicStuff.xml
If the value is not required in the message content, don't map it. That's it.
To insert at value in the file name, lot number in this case, you will need to promote that value to the FILE.ReceivedFileName Context Property. Then, you can use the %SourceFileName% Macro as part of the name setting in the Send Port. You can set FILE.ReceivedFileName by either Property Promotion or xpath() in an Orchestration.
Bonus: Sorting and Grouping in xslt is rather unwieldy, which is why I don't do that anymore. Instead, you can use SQL: BizTalk: Sorting and Grouping Flat File Data In SQL Instead of XSL

Create automated report from web data

I have a set of multiple API's I need to source data from and need four different data categories. This data is then used for reporting purposes in Excel.
I initially created web queries in Excel, but my Laptop just crashes because there is too many querie which have to be updated. Do you guys know a smart workaround?
This is an example of the API I will source data from (40 different ones in total)
https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all
The data points I need are:
EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources
Any ideas how I can create an automated report which queries the above data on request?
Thanks so much.
If Excel is crashing due to the demand, and that doesn't surprise me, you should consider using Python or R for this task.
install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")
require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")
Next we need to set our working directory and parse the XML file as a matter of practice, so we're sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it's XML.
setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"
Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed's website. We start by looking at the contents of the first node or root, PubmedArticleSet. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root's child nodes are all named PubmedArticle.
xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children
To see the first two entries, we can do the following.
# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]
Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are MedlineCitation and PubmedData.
#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size
We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:
#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]
#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]
The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path—or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:
#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable
Finally, we can transform the XML into a more familiar structure—a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the ELocationID node has two attributes, ValidYN and EIDType. Take the time to note how the duplicates arise from this separation.
#Turning XML into a dataframe
Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
View(Madhu2012) #for easy checking that the data is properly formatted
Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows
Here is a link that should help you get started.
http://www.informit.com/articles/article.aspx?p=2215520
If you have never used R before, it will take a little getting used to, but it's worth it. I've been using it for a few years now and when compared to Excel, I have seen R perform anywhere from a couple hundred percent faster to many thousands of percent faster than Excel. Good luck.

Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel

I have a requirement such that whenever i run my Kettle job, the database connection parameters must be taken dynamically from an excel source on each run.
Say i have an excel with column names : HostName, Username, Database, Password.
i want to pass these connection parameters to my table input step dynamically whenever the job runs.
This is what i was trying to do.
You can achieve this by
reading the DB connection parameters from a source (e.g. Excel or in my example a CSV file)
storing the parameters in variables
using the variables in your connection setting.
Proceed as follows
Create another transformation for setting the variables (you cannot do this in the same transformation that uses it):
In the Set Variables element configure the variables:
In the element reading/writing your data create a new connection and set the connection parameters using ${variable_name}. Note that you have to blindly write ${password} into the appropriate field. Also note that this may be a security issue because the value may show up as plain text in log files!
In your job call the variable transformation first and then the functional part:
All you need is the XLS input and the Set Variables step. Define your variables as being valid in the Root job and you can use them in subsequent jobs, as long as they're called by the same root job, when defining the connection.
The "Copy rows to result" and "Get rows from result" are used to send information (rows of data) from one transformation to the next transformation or job in the same parent job. They're not used to send data between steps, that's what the hops are for.

Resources