On-premise delta lake - delta-lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed?
I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache spark. Not sure what other tools are required.
Are there any other tool suggestions to implement on-premise data lake concept?

Yes, you can use Delta Lake on-premise. It's just a matter of the using correct version of the Delta library (0.6.1 for Spark 2.4, 0.8.0 for Spark 3.0). Or running the spark-shell/pyspark as following (for Spark 3.0):
pyspark --packages io.delta:delta-core_2.12:0.8.0
then you can write data in Delta format, like this:
spark.range(1000).write.format("delta").mode("append").save("1.delta")
It can work with local files as well, but if you need to build a real data lake, then you need to use something like HDFS that is also supported out of the box.

Related

Spark SQL vs Databricks SQL

I recently started working with spark and was eager to know if I have to perform queries which would be better spark sql or databricks sql and why?
We need to distinguish two things here:
Spark SQL as a dialect of the SQL language. Originally started as Shark & Hive on Spark projects (blog), it's now going close to ANSI SQL.
Spark SQL as execution engine inside Spark.
As was mentioned in this answer, Databricks SQL as language is primarily based on Spark SQL with some additions specific to Delta Lake tables (like CREATE TABLE CLONE, ...). ANSI compatibility in Databricks SQL is controlled with ANSI_MODE setting, and will be enabled by default in the future.
But when it comes to the execution, Databricks SQL is different from Spark SQL engine because it uses Photon engine heavily optimized for modern hardware and BI/DW workloads. With Photon you can get significant speedup (2-3x) compared to standard Spark SQL engine on the complex queries that process a lot of data.
In basic nut shell, you can download Apache Spark with pre-built Hadoop. You need to download the package from free. Additionally you can add Delta Lake and other third-party software.
Now Databricks is platform where you have to pay, it contains Apache SPARK + Delta Lake + many built in extras.
As expected, performance and SQL dialect between Hadoop and Delta Lake are different since they are different databases.
You can install Delta Lake in Apache Spark so you compare Hadoop vs Delta Lake

Delta Lake independent of Apache Spark?

I have been exploring the data lakehouse concept and Delta Lake. Some of its features seem really interesting. Right there on the project home page https://delta.io/ there is a diagram showing Delta Lake running on "your existing data lake" without any mention of Spark. Elsewhere it suggests that Delta Lake indeeds runs on top of Spark. So my question is, can it be run independently from Spark? Can I, for example, set up Delta Lake with S3 buckets for storage in Parquet format, schema validation etc, without using Spark in my architecture?
You might keep an eye on this: https://github.com/delta-io/delta-rs
It's early and currently read-only, but worth watching as the project evolves.
tl;dr No
Delta Lake up to and including 0.8.0 is tightly integrated with Apache Spark so it's impossible to have Delta Lake without Spark.

Difference Between Spark SQL and Hive

Can you please help me to understand the difference between Spark SQl and Hive?
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Built on top of Apache Hadoop, Hive provides the following features:
Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
A mechanism to impose structure on a variety of data formats
Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.
Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.
Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.
First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.

Resources