Which Data Lake Platform support three types of data? - delta-lake

If data lake is a repository with unstructured, semi-structured and structured data, is it physically implemented in a single DB technology? Which ones support the three types of data?

That's a broad question, but Delta Lake supports all of these data types. Of course many things are dependent on the specific access patterns, but it's all doable with Delta.

Related

Does a Kappa Architecture use a data lake or not?

Kukreja in “Data Engineering with Apache Spark, Delta Lake, and Lakehouse” says that a Kappa architecture has no data lake. Microsoft in https://learn.microsoft.com/en-us/azure/architecture/data-guide/big-data (see picture) mentions a “long term store” without saying what it actually is. It uses that data to “re-compute”. For me this is a data lake.
Does a Kappa Architecture use a data lake or not?
No, it does not.
But others have a more realistic perception. It states recompute... Imagine a mistake has been made, handy to be able to correct. From the long term store.

What are the benefits of using Hyperspace indexes over Z-ordering in deltaLake?

I have streaming data in Azure Databricks being stored in delta table format. For optimization,I am currently using Z-ordering. Are there any benefits of using Hyperspace indexing subsytem over Z-ordering?
Disclaimer: I didn't use Hyperspace myself, just read documentation & code examples.
Hyperspace by functionality is closer to the Data Skipping functionality of the Databricks Delta implementation - it allows to read only that data that is necessary. But on Databricks, indexing of data happens automatically when they are written, while with Hyperspace you need to build indexes & maintain them.
ZOrder is a different functionality - it optimizes placement of the data, so there is a higher probability that data that are used often together are really placed together, so you'll read less files. Hyperspace doesn't have that - it just indexes data, and placement of the data is defined by the underlying file format.
P.S. Here is the good blog post from Databricks about Data Skipping and ZOreder.

Table relationship in hive, Spark Sql and Azure USQL

Is there any way i can maintain table relationship in Hive, Spark SQL and Azure U SQL. Does any of these support creating relationships like in Oracle or SQL Server. In oracle i can get the relationships using user_constraints table. Looking for something similar to that for processing in ML.
I don't believe any of the big data platforms support integrity constraints. They are designed to load data as quickly as possible. Having constraints will only slow down the import. Hope this makes sense
None of the big data tools maintain constraints. If we just consider hive it doesn't even bother while you are writing data to the table whether the table schema is maintained or not as Hive follows schema on read approach.
While reading the data if you want to establish any relation, one has to work with joins.

What is the point of a table in a data lake? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?
IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH
By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.
To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.

Cassandra store data in BLOB

We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.
We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016

Resources