I want to create a BigQuery table out of some files that are present in our organizations secured space of sharepoint.
The files will be added on weekly basis and I need to setup a pipeline that I can use to ingest data in bigquery. I have been following manual process of loading the files by downloading the files and uploading to gcs bucket but that doesnt seems feasible anymore.
Any help will be appreciated.
Related
I have created a JMeter test that randomly selects orders from a pool of 20K .json data files.
I need to upload the .json files along with the .jmx file, however, the Azure Load Testing UI allows the upload of 10 files at most.
I have read the documentation and I could not find anything relevant on how to upload the 20k data files.
Is there a way to upload the 20k files to my test in one go?
Thanks,
P.
I just found out that there is an Azure Load Testing API that I can use to upload my data files.
More information can be found in the following link:
https://learn.microsoft.com/en-us/rest/api/loadtesting/dataplane/test/upload-test-file?tabs=HTTP
Thanks,
P.
We have an files partitioned in the datalake and are using Azure Synapse SQL Serverless pool to query them using external tables before visualising in Power BI.
Files are stored in the following partition format {source}/{year}/{month}/{filename}_{date}.parquet
We then have an external table that loads all files for that source.
For all files that increment each day this is working great as we want all files to be included. However we have some integrations that we want to return only the latest file. (i.e. the latest file sent to us is the current state that we want to load into Power BI).
Is it possible in the external table statement to only return the latest file? Or do we have to add extra logic?
We could load all the files in, and then filter for the latest filename and save that in a new location. Alternatively we could try to create an external table that changes every day.
Is there a better way to approach this?
If you are using dedicated pools then I would alter the location of your table with the latest files folder.
Load every day into a new folder and then alter the LOCATION of the external table to look at the current/latest day, but you might need to add additional logic to track in a control table what the latest successful load date is.
Unfortunately I have not found a better way to do this myself.
I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated
I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.
I have some dynamically created files in a blob storage container. I want to send it through email as a single attachment.
The total file size is less than 5 MB.
But here the difficulty I am facing is, when I try to compress the file using CopyData options, the compressed/zipped file not creating properly with multiple files.
If I try to zip a single file by giving its full path and filename, it is working fine. But when I give a folder name to compress all the files in that folder, it is not working correctly.
Please note that here I am not using any kind of external C# code or libraries.
Any help appreciated
Thank you
You can reference my settings in Data Factory Copy active:
Source settings:
Source dataset settings:
Sink settings:
Sink dataset settings:
Pipeline works ok:
Check the zip file in contianer containerleon:
Hope this helps.
I'm in the process of evaluating the possibilites of lifting and shifting my ssis packages to ADFv2 but without testing I'm finding it hard to see if all SSIS functionalities are supported.
For example my package unzips files, modifies contents of files (script task) saving new version in different directory, loads modified files to DB and update data etc
What I'm not sure about is unzipping the files (I dont want to transfer unzipped files from on prem) and also modifying files with script task. I believe these would have to be moved outside of SSIS and created as an activity of ADF? And leave only the load of files, updating data etc as my SSIS package? Probably with the files stored in Blob storage?
Or can all this still be done directly in SSIS?
Thanks
What you currently do using SSIS on premises, you could also do using SSIS in ADF. For example, you could install additional (un)zip programs using custom setup and utilize the %TEMP% folder/current working directory (".") of your SSIS IR to modify files, see
https://learn.microsoft.com/en-us/azure/data-factory/how-to-configure-azure-ssis-ir-custom-setup
https://learn.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-files-file-shares?view=sql-server-2017