How to Read in multiple CSV files from an XSLT file and output a single XML file - c#-4.0

I plan to use Saxon for an XSLT problem. I need to run my program on a schedule. When it runs it needs to select all CSV files from a directory. The number of files can be random but once processed they are cleared from the folder by another process. Originally there was only one CSV file with a fixed name so referencing it in the XSLT wasn’t a problem. I could also programmatically set the filename at runtime so all was working well. My XSLT now needs to know about all the files so I can output a single XML. I’m not sure if I can pass in a file path and let the XSLT read in all the files at that location? Is there a command to do this or is there a better way to do this? Remember I don’t know how many CSV files will be in the folder when the XSLT is run.

See www.saxonica.com/documentation/sourcedocs/intro.xml, you can use the collection function to read in files from a directory e.g.
<xsl:for-each select="collection('file:///C:/dir/subdir?select=*.csv;unparsed=yes')/tokenize(., '\n')">
<line><xsl:value-of select="."/></line>
</xsl:for-each>

Related

import multiple excel files to database in pentaho 6

I want to import multiple excel files to my db follow a loop. For example, I put all excel files in a for and each excel file import to my db.
Because when I try to import all files in forder which I has maximum of 2 files to import. Three files shows errors related to ram.
Thank you in advance.
You can use a Get file names step as an input to get all the excel files.
You feed the information of the Get file names to the Microsoft excel input step, this step has a check to accept filenames from previous step.
To make this work all excel files must have the same structure, if they have different structure, you'll have to inject metadata with the differences in each file, and you'll have to build a logic in previous transformations to determine the metadata to inject.

Are excel files stored internally as XML files?

I have come to understand that excel files(.xlsx) files are essentially xml file archives internally. I even tried verifying this by extracting the xlsx file in my local.
So if that's the case, how exactly are excel files stored and what is the structure and how do they work.
I also know they can be parsed by SAX parser of Apache POI API.
Please help

azure data factory: iterate over millions of files

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).
It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

How can I use python to edit docx and/or doc file tags on a windows system?

I have a folder with a large amount of .doc and .docx files, I would like to develop a python script to edit the tags of each file so I can find a file in the folder using the tags - thus making my life a little easier.
I am unsure of how to even start and was hoping someone could point me to a library or provide some sample code to help me get started.
I am not sure if the file extenstion matters because this seems to be a windows property (right-click file > Properties > Details > Tags > type in tags) but if the extension matters I do can change all the files to be .docx
The python-docx package provides methods to access most of the metatdata in a word file. The class docx.opc.coreprops.CoreProperties in specific allows you to modify author, category, etc. I didn't see tags mentioned but if you do some more research i'm sure you can find it.
docx.opc.coreprops.CoreProperties.keywords can be used to update doc file tags.

How could I access the source code of a .one OneNote file?

How could I access the source code of a .one OneNote file?
I've tried to rename the .one file to .zip as what happens with .doc files in order to access their source code, but .one doesn't seem to work like that.
Also, I've tried to open it with Notepad++, but it isn't in a plain-text format.
I regard this as a programming question because:
I'm using content-editing-automation scripts (e.g. RegEx-related find and replace scripts). Accessing the source code of .one files helps me apply bulky automated edits on their content Using RegEx.
.one files aren't technically source code - they contain the data that describes the pages in a section and their content.
Opening them as text won't show you anything meaningful as they are binary data.
Microsoft has released the way this data is structured in .one files in the following documentation. You can use this to parse the binary file to obtain the information you need.
https://msdn.microsoft.com/en-us/library/dd924743(v=office.12).aspx
https://support.office.com/en-us/article/File-format-changes-in-OneNote-2016-for-Windows-a9129622-1755-470b-91e7-b2a461194036
The .one file format is super-complicated as it has to store images and all revisions, so it's binary and not XML-based like the rest of the office suite
That said if you do want to see the XML structure of the notebook or specific page content you can use OMSpy:
https://blogs.msdn.microsoft.com/johnguin/2011/07/28/onenote-spy-omspy-for-onenote-2010/
It works fine for 2016 Desktop.

Resources