If I have data in a Hadoop Cluster or SQL Elastic DB, is ML bringing that data onto ML servers, or leaving it on Hadoop/sql and running its analysis there?
Currently, Azure Machine Learning will bring that data onto ML servers.
Execution for each of the modules is done on AzureML's backend servers. But you can connect to the databases through either "Reader" modules, or say Python code using ODBC to issue queries to a database and get the results as the return type, in which case the query is being down on the data servers, and the results are sent to AzureML. This is useful if you want to do a data aggregation queries in Hive or SQL to reduce the size of your dataset before bringing it into AzureML.
Related
I want to use Spark SQL (installed on Machine 1) with connectors for different data stores like HBase, Hive, Cassandra, and MySQL (installed on Machine 2 to perform simple analytics like Min/Max, averaging, etc.
My Question: Is the processing of these queries done on Machine 1 or Spark SQL acts as just an interface to perform different analytics but on the data store end (ie. Machine 2)?
Yes and no. It depends on your spark job.
Spark SQL is a separate implementation. It is datastore agnostic. When you implement a spark sql job , spark transforms it into something called DAG.
It is a similar technique to a database query plan, but running completely on the spark cluster.
In case of simple min / max, it might be translated into a direct underlying store query. But it might also be translated into something which is preselecting bunch of records, then doing an own data processing. This way it is also possible to join and aggregate data from different data sources.
You can analyze the spark sql plan with common explain statement or via spark web ui.
I'm looking for, with no success, how to read a Azure Synapse table from Scala Spark. I found in https://learn.microsoft.com connectors for others Azure Databases with Spark but nothing with the new Azure Data Warehouse.
Does anyone know if it is possible?
It is now directly possible, and with trivial effort (there is even a right-click option added in the UI for this), to read data from a DEDICATED SQL pool in Azure Synapse (the new Analytics workspace, not just the DWH) for Scala (and unfortunately, ONLY Scala right now).
Within Synapse workspace (there is of course a write API as well):
val df = spark.read.sqlanalytics("<DBName>.<Schema>.<TableName>")
If outside of the integrated notebook experience, need to add imports:
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.SqlAnalyticsConnector._
It sounds like they are working on expanding to SERVERLESS SQL pool, as well as other SDKs (e.g. Python).
Read top portion of this article as reference: https://learn.microsoft.com/en-us/learn/modules/integrate-sql-apache-spark-pools-azure-synapse-analytics/5-transfer-data-between-sql-spark-pool
maybe I misunderstood your question, but normally you would use jdbc connection in Spark to use data from remote database
check this doc
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
keep in mind, Spark would have to ingest data from Synapse tables into memory for processing and perform transformations there, so it is not going to push down operations into Synapse.
Normally, you want to run SQL query against source database and only bring results of SQL into Spark dataframe.
I have multiple Terabyte files that needs to be loaded into a database which sits on top of a high performance AZURE SQL server in cloud.
For now i'm trying to load these files via an SSIS package and its taking more than 12 hours to complete for 5 files.
I believe HDInsight/ Data Bricks are in Azure to do big data ETL process and analyze data using Ambari and other UI. But is it possible to use the same(HDInsight or DataBricks) to load the huge data files into a SQL table/database ? (Like using clusters to do load mutiple files in a parallel execution mode)
Any suggestion/help is much appreciated
Since you mentioned SSIS , I was wondering if you have considered the option of using Azure data factory ( personally I consider that to be the next version of SSIS on cloud ) ,the copy activity should do the trick and it does support parallel execution . Since you are considering the SQL Azure , we need to consider the congestion issue on the sink side , i meant the scenario where all the terabytes of files try to write to the SQL table at the same time .
I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.
I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.