We are in the process of migrating a Hadoop Workload to Azure Databricks. In the existing Hadoop ecosystem, we have some HBase tables which contains some data(not big). Since, Azure Databricks does not support Hbase, we were planning if we can replace the HBase tables with Delta tables.
Is this technically feasible, if yes, is there any challenges or issues we might face during the migration or in the target system.
It all comes to the access patterns. HBase is OLTP system where you usually operate on individual records (read/insert/update/delete) and expect subsecond (or millisecond) response time. Delta Lake, on other side is OLAP system designed for efficient processing of many records together, but it could be slower when you read individual records, and especially when you update or delete them.
If your application needs subseconds queries, especially with updates, then it make sense to setup a test to check if Delta Lake is the right choice for that - you may need to look onto Databricks SQL that is doing a lot of optimizations for fast data access.
If it won't fulfill your requirements, then you may look onto other products in Azure ecosystem, like, Azure Redis or Azure CosmosDB that are designed for OLTP-style data processing.
Related
I am using HBase as backend for Janusgraph. I have to migrate to Cassandra as backend. What is the best way to migrate the old data?
one way to go for it is to read data from Hbase and put into Cassandra using java code.
Migrating data from JanusGraph is not well supported, so I would prefer myself to start from copies of the data that were made before ingesting it into JanusGraph. If that is not an option, your suggestion of using java code to read from one graph and ingest into the other comes first.
Naturally, you want to parallellize this, because millions of operations on a single thread and process take too long for being practical. Although JanusGraph supports OLAP traversals for reading vertices and edges in parallel, JanusGraph OLAP has its own problems and you are probably better of segmenting the data using a mixed index in JanusGraph and have each process/thread read the segment assigned to it using an OLTP traversal.
I know ACID Transactions are one of the important feature of the delta lake while performing read and write. Is this also true for merge operation? What if two pipelines try to perform update operation based on the different condition on the same record.Can it cause any data inconsistency?
Well, it depends.
Delta Lake uses Optimistic Control to handle concurrency, this means that it would likely work IF you're writing to HDFS, since delta needs the underlying object store to support "compare-and-swap" operations or a way for it to fail if two writers are tying to overwrite each other’s log entries, and HDFS supports that.
On S3, this is not supported:
Delta Lake has built-in support for S3. Delta Lake supports concurrent
reads from multiple clusters, but concurrent writes to S3 must
originate from a single Spark driver in order for Delta Lake to
provide transactional guarantees. This is because S3 currently does
provide mutual exclusion, that is, there is no way to ensure that only
one writer is able to create a file.
On the proprietary Delta Engine, Databricks does support multi cluster writes to S3 using a propriety server that handles those calls.
So to sum it up:
It should be possible if you're writing to HDFS.
On S3, it won't work, unless you're using the paid version of Delta Lake.
Given the quote from Apache KUDU official documentation, namely: https://kudu.apache.org/overview.html
Kudu isn't designed to be an OLTP system, but if you have some subset
of data which fits in memory, it offers competitive random access
performance. We've measured 99th percentile latencies of 6ms or below
using YCSB with a uniform random access workload over a billion rows.
Being able to run low-latency online workloads on the same storage as
back-end data analytics can dramatically simplify application
architecture.
Does this statement imply that KUDU can be used for replication from a JDBC source - the simplest form possible?
Elsewhere I have used KUDU for replicating to from SAP and other COTS, so that reports could run against the KUDU tables as opposed to Hana. That was an architecture decided upon by others.
For pure replication of data, primarily for subsequent extractions from a Data Lake, for data with embellished history with a size < 1TB, this is feasible as well. Cloudera confirmed this after discussion. Even though KUDU has a columnar format and a row format would be desirable, it simply works as well.
I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.
I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.