I have been using Databricks Community Edition for over 4 years and suddenly I'm unable to create a single node cluster as I have always been doing.
I keep on getting the message 'Only professional or enterprise tier customers can create autoscaling clusters' see image, but I don't see an option no not create an autoscaling cluster.
Have Databricks pulled the plug on Databricks Community Edition users? Or is there something that I'm doing wrong?
I just want to use the Community Edition in simple manner I have been using it for the past 4 years .. with a single node.
Related
SparkSessionExtensions injectFunction works locally, but I can't get it working in the Databricks environment.
The itachi project defines Catalyst expressions, like age that I can successfully use locally via spark-sql:
bin/spark-sql --packages com.github.yaooqinn:itachi_2.12:0.1.0 --conf spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions
spark-sql> select age(timestamp '2000', timestamp'1990');
10 years
I'm having trouble getting this working in the Databricks environment.
I started up a Databricks community cluster with the spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions configuration option set.
Then I attached the library.
The array_append function that's defined in itachi isn't accessible like I expected it to be:
Confirm configuration option is properly set:
spark-alchemy has another approach that works in the Databricks environment. Do we need to mess around with Spark internals to get this working in the Databricks environment? Or is there a way to get injectFunction working in Databricks?
The spark.sql.extensions works just fine on full Databricks (until it's going too deep into the internals of the Spark - sometimes there are incompatibilities), but not on Community Edition. The problem is that spark.sql.extensions are called during session initialization, and library specified in UI is installed afterwards, so this happens after/in parallel with initialization. On full Databricks that's workarounded by using init script to install library before cluster starts, but this functionality is not available on Community Edition.
The workaround would be to register functions explicitly, like this:
%scala
import org.apache.spark.sql.catalyst.expressions.postgresql.{Age, ArrayAppend, ArrayLength, IntervalJustifyLike, Scale, SplitPart, StringToArray, UnNest}
import org.apache.spark.sql.extra.FunctionAliases
spark.sessionState.functionRegistry.registerFunction(Age.fd._1, Age.fd._2, Age.fd._3)
spark.sessionState.functionRegistry.registerFunction(FunctionAliases.array_cat._1, FunctionAliases.array_cat._2, FunctionAliases.array_cat._3)
spark.sessionState.functionRegistry.registerFunction(ArrayAppend.fd._1, ArrayAppend.fd._2, ArrayAppend.fd._3)
spark.sessionState.functionRegistry.registerFunction(ArrayLength.fd._1, ArrayLength.fd._2, ArrayLength.fd._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyDays._1, IntervalJustifyLike.justifyDays._2, IntervalJustifyLike.justifyDays._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyHours._1, IntervalJustifyLike.justifyHours._2, IntervalJustifyLike.justifyHours._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyInterval._1, IntervalJustifyLike.justifyInterval._2, IntervalJustifyLike.justifyInterval._3)
spark.sessionState.functionRegistry.registerFunction(Scale.fd._1, Scale.fd._2, Scale.fd._3)
spark.sessionState.functionRegistry.registerFunction(SplitPart.fd._1, SplitPart.fd._2, SplitPart.fd._3)
spark.sessionState.functionRegistry.registerFunction(StringToArray.fd._1, StringToArray.fd._2, StringToArray.fd._3)
spark.sessionState.functionRegistry.registerFunction(UnNest.fd._1, UnNest.fd._2, UnNest.fd._3)
After that it works:
It's not so handy as extensions, but that's a limitation of CE.
My goal is to migrate from freeipa v3 to v4. Both versions are a cluster of two nodes.
v3 is centos 6 and v4 is centos 7.
I want to migrate the dns entries from the old cluster to the new one. Both have the same dns zone(s) and after all dns entries are on both clusters I will migrate all hosts to the new cluster.
Also the users will be created manually. Goal is to have a fresh freeipa environment.
Which commands do I need to know or use to achieve that?
Also an export and import function will do the trick.
This is all documented in the official documentation in the "MIGRATING IDENTITY MANAGEMENT FROM RED HAT ENTERPRISE LINUX 6 TO VERSION 7" section:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/linux_domain_identity_authentication_and_policy_guide/migrate-6-to-7
The key is not to start from scratch but rather add CentOS 7 replicas to CentOS 6 deployment, move services over to CentOS 7 replicas and then decommission CentOS 6 machines. This is, in general, much easier and more reliable to do than to start from scratch by importing older data.
There is an old (year 2014) talk on Youtube where the speaker visualized a query plan right inside a Databricks notebook. Here is the screenshot:
I am using databricks runtime 5.5 LTS ML and whenever I try to call viz on a query plan, I get this kind of error:
error: value viz is not a member of org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
Has this feature been deprecated in Spark 2.0+ or do I need to install/import other libraries to get this feature?
I don't think there is a visualization of it anymore, but they do explain SQL queries, which I think is what you are looking for.
spark.sql("your SQL query").explain(true)
or
yourDataframe.explain(true)
I saw an email indicating the sunset of support for 1.6 apache spark within IBM Cloud. I am pretty sure my version is 2.x, but I wanted to confirm. I couldn't find anywhere in the UI that indicated the version, and the bx cli command that I thought would show it didn't.
[chrisr#oc5287453221 ~]$ bx service show "Apache Spark-bc"
Invoking 'cf service Apache Spark-bc'...
Service instance: Apache Spark-bc
Service: spark
Bound apps:
Tags:
Plan: ibm.SparkService.PayGoPersonal
Description: IBM Analytics for Apache Spark for IBM Cloud.
Documentation url: https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html
Dashboard: https://spark-dashboard.ng.bluemix.net/dashboard
Last Operation
Status: create succeeded
Message:
Started: 2018-01-22T16:08:46Z
Updated: 2018-01-22T16:08:46Z
How do I determine the version of spark that I am using? Also, I tried going to the "Dashboard" URL from above, and I got an "Internal Server Error" message after logging in.
The information found on How to check the Spark version doesn't seem to help, because it seems to be related to locally installed spark instances. I need to find out the information from the IBM Cloud (ie. Bluemix) using either the UI or the bluemix CLI. Other possibilities would be running some command from a Jupyter Notebook in iPython running in Data Science Experience (part of IBM Cloud).
The answer was given by ptitzler above, just adding an answer as requested by the email I was sent.
The Spark service itself is not version specific. To find out whether
or not you need to migrate you need to inspect the apps/tools that
utilize the service. For example if you've created notebooks in DSX
you associated them with a kernel that was bound to a specific Spark
version and you'd need to open each notebook to find out which Spark
version they are utilizing. – ptitzler Jan 31 at 16:32
What i'm doing
I'm working on microsoft Azure, and here is the thing. I'm trying to create an R cluster on azure with hadoop 3.6 but i need some default tools like nifi, kafka and storm which are available on an HDF.
Problem
When i create the cluster, i cant't chose the instance of ambari so i tried to create the cluster with a template wich i activate every morging to create the cluster and another one to delete the cluster every night. I was wondering if it's possible to chose the Ambari instrance while using the template.
anyone has an idea ?
AFAIK, you cannot change the version of Ambari. Since it comes by default with HDI versions.
You can find more details on this documentation regarding Hadoop components available with different HDInsight versions.