What is the point of a table in a data lake? [closed]

What is the point of a table in a data lake? [closed] - azure

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?

IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH

By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.

To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.

Related

Which Data Lake Platform support three types of data?

If data lake is a repository with unstructured, semi-structured and structured data, is it physically implemented in a single DB technology? Which ones support the three types of data?

That's a broad question, but Delta Lake supports all of these data types. Of course many things are dependent on the specific access patterns, but it's all doable with Delta.

Precalculate OLAP cube inside Azure Synapse

We have dimensinal model with fact tables of 100-300 GBs in parquet each. We build PBI reports on top of Azure Synapse (DirectQuery) and experience performance issues on slicing/dicing and especially on calculating multiple KPIs. In the same time data volume is pretty expensive to be kept in Azure Analysis Services. Because of number of dimensions, the fact table can't be aggregated significantly, so PBI import mode or composite model isn't an option as well.
Azure Synapse Analytics faciliates OLAP operations, like GROUP BY ROLLUP/CUBE/GROUPING SETS.
How can I benefit from Synapse's OLAP operations support?
Is that possible to pre-calculate OLAP cubes inside Synapse in order to boost PBI reports performance? How?
If the answer is yes, is that recomended to pre-calculate KPIs? Means moving KPIs definition to DWH OLAP cube level - is it an anti-pattern?
P.S. using separate aggreagations for each PBI visualisation is not an option, it's more an exception from the rule. Synapse is clever enough to take the benefit from materialized view aggregation even on querying a base table, but this way you can't implement RLS and managing that number of materialized views also looks cumbersome.
Upd for #NickW
Could you please answer the following sub-questions:
Have I got it right - OLAP operations support is mainly for downstream cube providers, not for Warehouse performance?
Is spawning Warehouse with materialized views in order to boost performance is considered a common practice or an anti-pattern? I've found (see the link) Power BI can create materialized views automatically based on query patterns. Still I'm afraid it won't be able to provide a stable testable solution, and RLS support again.
Is KPIs pre-calculation at Warehouse side considered as a common way or an anti-pattern? As I understand this is usually done no cube provider side, but if I haven't got one?
Do you see any other options to boost the performance? I can think only about reducing query parallelism by using PBI composite model and importing all dimensions to PBI. Not sure if it'd help.

Synapse Result Set Caching and Materialized Views can both help.
In the future the creation and maintence of Materialized Views will be automated.
Azure Synapse will automatically create and manage materialized views
for larger Power BI Premium datasets in DirectQuery mode. The
materialized views will be based on usage and query patterns. They
will be automatically maintained as a self-learning, self-optimizing
system. Power BI queries to Azure Synapse in DirectQuery mode will
automatically use the materialized views. This feature will provide
enhanced performance and user concurrency.
https://learn.microsoft.com/en-us/power-platform-release-plan/2020wave2/power-bi/synapse-integration
Power BI Aggregations can also help. If there are a lot of dimensions, select the most commonly used to create aggregations.

to hopefully answer some of your questions...
You can't pre-calculate OLAP cubes in Synapse; the closest you could get is creating aggregate tables and you've stated that this is not a viable solution
OLAP operations can be used in queries but don't "pre-build" anything that can be used by other queries (ignoring CTEs, sub-queries, etc.). So if you have existing queries that don't use these functions then re-writing them to use these functions might improve performance - but only for each specific query
I realise that your question was about OLAP but the underlying issue is obviously performance. Given that OLAP is unlikely to be a solution to your performance issues, I'd be happy to talk about performance tuning if you want?
Update 1 - Answers to additional numbered questions
I'm not entirely sure I understand the question so this may not be an answer: the OLAP functions are there so that it is possible to write queries that use them. There can be an infinite number of reasons why people might need to to write queries that use these functions
Performance is the main (only?) reason for creating materialised views. They are very effective for creating datasets that will be used frequently i.e. when base data is at day level but lots of reports are aggregated at week/month level. As stated by another user in the comments, Synapse can manage this process automatically but whether it can actually create aggregates that are useful for a significant proportion of your queries is obviously entirely dependent on your particular circumstances.
KPI pre-calculation. In a DW any measures that can be calculated in advance should be (by your ETL/ELT process). For example, if you have reports that use Net Sales Amount (Gross Sales - Tax) and your source system is only providing Gross Sales and Tax amounts then your should be calculating Net Sales as a measure when loading your fact table. Obviously there are KPIs that can't be calculated in advance (i.e. probably anything involving averages) and these need to be defined in your BI tool
Boosting Performance: I'll cover this in the next section as it is a longer topic
Boosting Performance
Performance tuning is a massive subject - some areas are generic and some will be specific to your infrastructure; this is not going to be a comprehensive review but will highlight a few areas you might need to consider.
Bear in mind a couple of things:
There is always an absolute limit on performance - based on your infrastructure - so even in a perfectly tuned system there is always going to be a limit that may not be what you hoped to achieve. However, with modern cloud infrastructure the chances of you hitting this limit are very low
Performance costs money. If all you can afford is a Mini then regardless of how well you tune it, it is never going to be as fast as a Ferrari
Given these caveats, a few things you can look at:
Query plan. Have a look at how your queries are executing and whether there are any obvious bottlenecks you can then focus on. This link give some further information Monitor SQL Workloads
Scale up your Synapse SQL pool. If you throw more resources at your queries they will run quicker. Obviously this is a bit of a "blunt instrument" approach but worth trying once other tuning activities have been tried. If this does turn out to give you acceptable performance you'd need to decide if it is worth the additional cost. Scale Compute
Ensure your statistics are up to date
Check if the distribution mechanism (Round Robin, Hash) you've used for each table is still appropriate and, on a related topic, check the skew on each table
Indexing. Adding appropriate indexes will speed up your queries though they also have a storage implication and will slow down data loads. This article is a reasonable starting point when looking at your indexing: Synapse Table Indexing
Materialised Views. Covered previously but worth investigating. I think the automatic management of MVs may not be out yet (or is only in public preview) but may be something to consider down the line
Data Model. If you have some fairly generic facts and dimensions that support a lot of queries then you might need to look at creating additional facts/dimensions just to support specific reports. I would always (if possible) derive them from existing facts/dimensions but you can create new tables by dropping unused SKs from facts, reducing data volumes, sub-setting the columns in tables, combining tables, etc.
Hopefully this gives you at least a starting point for investigating your performance issues.

Cassandra store data in BLOB

We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.

We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016

What's faster for Stata: manipulating data in a flat database (i.e. Excel) or in a relational database?

I'm an entry-level optimization analyst at a company that publishes risk ratings data for various companies. We have tons of data (to the point where our history is currently solely limited by the number of rows possible in Excel).
We currently use many .do files in Stata to perform all manipulations and statistical analyses (the largest production we run takes 9 hours, with one insheet taking half a minute). I'm trying to convince the company to move away from using a flat database to using a relational database but have been having trouble finding information online about whether flat or relational is better in Stata. So--which is better, and why?

I would hypothesise that you answered your own questions by emphasising that limitations of Excel prevent you from capitalising on the full potential of your data. Excel is not a proper analytical tool or data warehousing solution and as such there is no point in using it in analytical projects involving anything more complex than doing some basic sums for a small business / household needs.
To answer your question:
Flat file databases are an archaic technology dating to the beginnings of computer science: they were never designed to meet modern analytical needs of working with Big Data, live data streams, etc.
Relational databases
help to avoid data duplication
help to avoid inconsistent records
are easier when changing the data format

Querying big data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working with a system that takes in a 50/s 10kb write stream which runs 24 hours a day. The data is ingested via a messaging system in to a sql database, and then used in an overnight aggregation that takes around 15 hours to produce queryable data for an application.
This is currently all in sql, but we are moving to a new architecture.
The plan is to move the ingested writes in to a distributed database like Cassandra or dynamodb, and then perform the aggregation in hadoop. This makes those parts of the system scalable.
My question is, when people have this architecture, where do they put the data after the writes and aggregation have been performed so that it can be queried.
In more detail:
The query model our application uses is quite complicated, to make the data queryable in cassandra, we would have to denormalise it for all queries, this is possible, but would mean a massive growth in data size. Is this normal practice? Or would you prefer to move the data back in to sql?
We could move the data in to redshift, but this seems to be more for ad hoc data analytics, and its purpose is not to be the backend for a data analytics application. I also think the queries are too complicated in their current form to be written in an orm which is what is required for redshift.
Does this mean that I still need to put the data in to sql server?
I am looking for examples of what people are doing at the moment.
I am sorry this question is a bit abstract, please do not close it, I will add more detail. I have read lots on big data, but most articles are about the ingestion of data using messaging / workers and distributed databases, but I have not found any that show what they do with this ingested data and how it is queried from the application.
*answer to JosefN's comment: Yes, we are not planning to denormalise in to a sql db. The choice is, denormalise in to cassandra, for all clients and queries, this would probably mean 100x the current data size, as there will be so much duplication in the denormalised model. The other option is to store it as it is now, so that it is queryable, but then, is my only option a sql db?
*after more research I have more information. The best options at the moment seems to be:
store back in sql
denormalise in cassandra
use one of the real time sql engines on top of hadoop / hdfs like impala
drpc with storm
I do not have any experience with Impala or DRPC with storm, so if anyone has any info on latency and the type of queries that can be performed with these that would be great.
Please do not refer to documentation or blog posts, I know how these technologies work, I only want to know if someone has used them in production and has their own information on this subject. thanks

I would suggest moving the aggregated data into HDFS. Using Hive, which provides a relational view over data stored inside HDFS, you can very well use adhoc sql like queries. At the same time you will be benefitted from parallelism of MapReduce jobs that gets invoked when you use Hive. This would help you to decrease query latencies that you would be having while using a RDBMS. Also think about doing the aggregation jobs in Hadoop itself.

Since the data after aggregation is small and you are looking for good latency keeping it in hdfs and query it using hive is not preferable.
I have seen people using hbase to store aggregated data and query it but as you mentioned earlier you will have to denormalize the data. For this case I will recommend writing aggregated data back to mysql and query it there if aggregated data is not big.

I think one traditional approach is to run your Hadoop/Hive jobs to aggregate across all possible dimensions, and then store in a key/value store like HBase, and look up at runtime with a key based on the aggregation done ( ie. /state=NJ/dt=20131225/ ) This can cause an explosion in size, especially if there are many dimensions to roll up
If you want/need a more realtime solution as well, take a look at Twitter's summingbird.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string