Kukreja in “Data Engineering with Apache Spark, Delta Lake, and Lakehouse” says that a Kappa architecture has no data lake. Microsoft in https://learn.microsoft.com/en-us/azure/architecture/data-guide/big-data (see picture) mentions a “long term store” without saying what it actually is. It uses that data to “re-compute”. For me this is a data lake.
Does a Kappa Architecture use a data lake or not?
No, it does not.
But others have a more realistic perception. It states recompute... Imagine a mistake has been made, handy to be able to correct. From the long term store.
Related
If data lake is a repository with unstructured, semi-structured and structured data, is it physically implemented in a single DB technology? Which ones support the three types of data?
That's a broad question, but Delta Lake supports all of these data types. Of course many things are dependent on the specific access patterns, but it's all doable with Delta.
I am new to Databricks and have the following doubt -
Databricks proposes 3 layers of storage Bronze (raw data), Silver (Clean data) and Gold (aggregated data).It is clear in terms of what these storage layers are meant to store. But my doubt is how are these actually created or identified. How do we specify when retrieving data from Silver or Gold. Are these different databases or different formats or anything else ?
Please help me in getting this concept clear.
These a logical layers:
the Bronze layer stores the original data without modification - most common change is usually just changing the data format, like, take input data as CSV and store data as Delta. The main goal of having Bronze layer is to make sure that you have original data, and you can rebuild the Silver & Gold data if necessary, for example, if you found errors in your code that produces the Silver layer. The necessity of having the Bronze layer heavily dependent on the source of the data. For example, if your data is coming from some database, then you can expect that data there is already clean, in this case you can ingest them directly into Silver layer. Bronze layer usually isn't accessed directly by end users
the Silver layer is created from Bronze by applying some transformations, enrichment, and cleanup procedures. For example, if data in some column must be non-null, or be in a certain range, you can add code like bronze_df.filter("col1 is not null") and store results. Silver layer could be regenerated from the Bronze if you found error in your transformations, or was need to add an additional check. Silver layer is usually accessible by end users who need detailed data on the row level
the Gold layer is usually some kind of aggregated data that will be used for reporting, dashboards, etc. There could be multiple tables in the Gold layer generated from one or more Silver tables.
Databricks usually recommend to use Delta Lake for all these layers as it's easier to process data incrementally between layers, usually using the Structured Streaming. But you're not limited by that. I've seen many customers who output results of Gold layer into Azure SQL database, NoSQL databases, or something else, from which it could be consumed by applications that may work only with these systems.
I started exploring ADX a few days back. I imported my data from Azure SQL to ADX using ADF pipeline but when I query those data, it is taking a long time. To find out some workaround I researched for Table Data Partitioning and I am much clear on partition types and tricks.
The problem is, I couldn't find any sample (Kusto Syntax) that guide me to define Paritionging on ADX Database Tables. Can anyone please help me with this syntax?
partition operator is probably what you are looking for:
T | partition by Col1 ( top 10 by MaxValue )
T | partition by Col1 { U | where Col2=toscalar(Col1) }
ADX doesn't currently have the notion of partitioning a table, though it may be added in the future.
that said, with the lack of technical information currently provided, it's somewhat challenging to understand how you got to the conclusion that partitioning your table is required and is the appropriate solution, as opposed to other (many) directions that ADX does allow you pursue.
if you would be willing to detail what actions you're performing, the characteristics of your data & schema, and which parts are performing slower than expected, that may help in providing you a more meaningful and helpful answer.
[if you aren't keen on exposing that information publicly, it's ok to open a support ticket with these details (through the Azure portal)]
(update: the functionality is available for a while now. read more # https://yonileibowitz.github.io/blog-posts/data-partitioning.html)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?
IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH
By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.
To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.
A new project came to my hands and looks interesting for my own.
I need to stored all the coming data from industrial PLCs (control the machinery inside a factory) and every event in the plc generated a output that need to be saved for after data analysis.
I was wondering what will be the perfect match for this type of data (time series) to make a hole architecture to manage data IO and at the moment only querying it for graphics (later will be applied machine learning analysis for predictive maintenance).
I don't know if I working in the correct direction and will be good to have some knowledge from experts in that subject.
IO producer (this a own made project and cant not be change)
IO events layer --> Is apache kafka a option for manage a big amount of signal coming for a lot of different computers (collected to plcs) and also manage the data saving to a nosql database. (it is suitable for that?any better option)
nosql database--> This point is more clear choosing Cassandra for time series storing.
queryng nosql data--> We are choosing spark for make fast queries and later on some data analysis.
The layer where I have more doubts is the layer involved in administrate the io data before storing and I have serious doubts that kafka is the correct option.
Thanks for reading and sorry for my bad English ;) Feel free to give your point of view.
we have a similar project based on sensor data. we have about 30 GB of data coming per day. We use kafka to stream the data and store it in hdfs. we have a set up of python ( numpy , pandas and pyspark) along with spark for any data crunching basically for prediction part.
As far as your doubts on kafka goes ... its capable enuf to handle large datasets. The other benfit would be that kafka can handle multiple sources and will be easier to scale.
As far as data storage is concerned i would recommend you to go with HDFS as it can be used in multiple ways to consume the data. You can leverage hive or hbase if required in future.