Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm trying to implement a process using Data Factory and Databricks to ingest data into Data Lake and convert it all to a standard format i.e. parquet. So we'll have a raw data tier and a clean/standardized data tier.
When the source system is a DB or delimited files its (relatively) easy, but in some cases we will have excel sources. I've been testing the conversion process with com.crealytics.spark.excel which is ok because we can infer the schema BUT its not able to iterate through multiple sheets OR get the list of sheet names to enable me to iterate thought each one to convert into a single file.
I need this to be as dynamic as possible so that we can ingest almost any file regardless or its type or schema.
Does anyone know of any alternative methods of doing this? I'm open to moving away from databricks if necessary, such as Azure Batch with a custom C# script.
thanks in advance!
Since you are aiming to store the data in Azure Data Lake, another approach may be to use Azure Data Lake Analytics with a custom Excel extractor. U-SQL then can convert it into Parquet. See here for a sample Excel extractor.
How much variability do you expect with the Excel sheets?
The main problem here will be that it is hard to be completely schema agnostic, especially if you have many columns. To handle variability of the schema, you could change the extractor to output the columns either as key/value pairs or - if the number of columns and size of a row is reasonable - as a SqlMap (or a few for different target types). Although you would have to probably pivot it into a column format before creating the Parquet which would either require a second script to create the pivoting script or some custom outputter (instead of the built-in Parquet outputter).
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 days ago.
Improve this question
Context:
I have a large number of CSV files containing logged information.
Some of the log files may "overlap" in that the same information may exist in two or more CSV files.
Not all of the information in the log files is required so a Python script is used to extract only the relevant information for insertion into a SQL (lite) database for easy querying.
The information in the log file that is "relevant" is a serial number, error ID, timestamp of event start, timestamp of event end, error description, longnitude and latitude at the time of the error event.
Problem:
I want to ensure that information duplicated in the CSV files is not fed into the SQL database.
Since the data has a serial number, timestamp and location, this should aid in filtering repeated events.
I did think that I could create a hash for the relevant information from the CSV file, when Python parsed the file, and use this to determine if the "same" record already existed within the SQL database being added to but maybe this isn't very efficient?
I guess the alternative is for SQL to only add the information if it doesn't already exist, but I'm not entirely sure how to do this.
Which would be the most efficient way of achieving this?
I know how to hash the data (by putting it into a tuple) in Python and to not add a record if a hash already exists but I'm not sure whether SQL can already do this for me.
If you have a unique identifier in your various csv files which can help you to filter duplicated information it's quite to build a table with this ID has primary and use the on conlict clause in your insert query to not insert several time the same row. Here is an example the table, of course you need other columns for the remaining data:
CREATE TABLE data (
id TEXT PRIMARY KEY
);
then you safely unduplicated data with such an insert clause:
INSERT INTO data (id)
VALUES (?)
ON CONLICT DO NOTHING
Duplicated data will just be ignored.
You can read this sqlite documentation page on the insert query type: https://www.sqlite.org/lang_insert.html
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?
IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH
By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.
To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.
I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Any one have some idea which is the best way to implement continuous Replication of some DB tables from Azure SQL DB to Azure SQL DB(PaaS) in incremental way.
I have tried Data Sync preview (schema is not loading even after couple of hours),
Data Factory (Copy Data) - Fast but it is always copying entire data(duplicate records) - not an incremental way.
Please suggest.
What is the business requirement behind this request?
1 - Do you have some reference data in database 1 and want to replicate that data to database 2?
If so, then use cross database querying if you are in the same logical server. See my article on this for details.
2 - Can you have a duplicate copy of the database in a different region? If so, use active geo-replication to keep the database in sync. See my article on this for details.
3 - If you just need a couple tables replicated and the data volume is low, then just write a simple PowerShell program (workflow) to trickle load the target from the source.
Schedule the program in Azure Automation on a timing of your choice. I would use a flag to indicate which records have been replicated.
Place the insert into the target and update of the source flag in a transaction to guarantee consistency. This pattern is a row by agonizing row pattern.
You can even batch the records. Look into using the SQLBulkCopy in the system.data.sqlclient library of .Net.
4 - Last but not least, Azure SQL database now supports the OPENROWSET command. Unfortunately, this feature is a read only from blob storage file pattern when you are in the cloud. The older versions of the on premise command allows you to write to a file.
I hope these suggestions help.
Happy Coding.
John
The Crafty DBA
If you wanted to use Azure Data Factory, in order to do incremental updates, you would need to change your query to look at a created/modified date on the source table. You can then take that data and put it into a "Staging Table" on the destination side, then use a stored proc activity to do your insert/update into the "Real table" and finally truncate the staging table.
Hope this helps.
I am able to achive Cloud to Cloud Migration using Data Sync Preview from Azure ASM Portal
Below are the limitations
Maximum number of sync groups any database can belong to : 5
Characters that cannot be used in object names : The names of objects (databases, tables, columns) cannot contain the printable characters period (.), left square bracket ([) or right square bracket (]).
Supported limits on DB Dimensions
Reference: http://download.microsoft.com/download/4/E/3/4E394315-A4CB-4C59-9696-B25215A19CEF/SQL_Data_Sync_Preview.pdf
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for detailed answers to the question : What are the pros and cons of using an Excel file as a database ?
One of the pros seems to be that users are familiar with Excel and can work with the tables without needing to know about databases. There are however many reasons not to use Excel as a database.
- Even though you can do some validation in Excel it is no match to any good database program.
- When importing data from an Excel file to, for instance, a SQL database you often run into problems because of the misinterpretation of the valuetypes
- Also when importing dates the interpretation may fail
- Strings like 000234 will most likely be read as numbers and end up as 234
- As stated before the sharing of the database is very limited
- But one of my main concerns using Excel as a database is the fact that it is a single file that can easily be copied to various locations which may cause you to end up with several versions of it with different data
Cons: size/performance, sharing
Pro: none
P.S. If VBA is an issue, why not Access?
I wouldn't really suggest that Excel is or can properly act like database - as it lacks the features, data protection and security to act as such.
If the reason to use this is based upon ease of use and end user familiarity - it is quite easy to connect Excel as a front end to a database - using it as a reading and writing device, whilst taking advantage of the speed and stability issues of a 'true' database.
Pros:
Very familiar
VBA is easy to use to create fairly simple to use sheets
Lots of functions to manipulate data
Cons:
Slow and VERY clunky with large data set
Hard to validate on imported data
Prone to crashing with large datasets
Lack ability to use intelligent queries or views
Many more..