Can we configure presto's data base connector information from its GUI - presto

I am using presto version 179 and I need to manually create a database.properties file in /etc/presto/catalog through the CLI.
Can I do the same process from the GUI of presto?

Presto's built-in web interface does not provide any configuration capabilities.
Usually, such things are handled as part of deployment/configuration management on a cluster. Thus, configuration is provided by some external means just as is Presto installation.

Related

Creating catalog/schema/table in prestosql/presto container

I would like to use prestosql/presto container for automated tests. For this purpose I want to receive the ability to programmatically to create catalog/schema/table. Unfortunately, I didn't find the option via docker environment variables. If I trying to do it via jdbc connector, I receive following error:"This connector does not support creating tables"
How can I create schemas or tables using prestosql/presto container?
If you are writing tests in Java (as suggested by JDBC tag), you can use testcontainers library. It comes with Presto module.
uses prestosql/presto container under the hood
comes with Presto memory connector pre-installed, so you can create schemas & tables there

Alternative to presto with fallback mechanism

I am using presto as querying layer over cassandra for various aggregations but facing an issue where if the node goes down or timeout occurs for some reason, the running query fails. I need to have some kind of fallback mechanism.
Is there any alternative to presto with which i can implement fallback mechanism or if there's any way to implement it in presto itself.
Some vendors like Qubole have implemented a retry mechanism for specifically these kind of issues (which are more visible in the cloud, especially if you use spot nodes on AWS or pre-emptible VMs on GCP.
Note : This works only with Qubole's Managed Presto Service.
Disclaimer: I work for Qubole

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

How do you install custom software on worker nodes in Azure HDInsight?

I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
edit:
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action

HDInsight persistent Hive settings

Every few days the Azure HDInsight cluster is being (randomly?) restarted by Microsoft, and in the process any custom changes to hive-site.xml (such as adding a JsonSerde) are lost without any prior warning, and as a result the hive queries from Excel/PowerPivot starts breaking.
How are you supposed to deal with this scenario - are we forced to store our data as CSV files ?
In order to preserve customization during os update or node re-image, you should think of using script action. Here is the link: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
If you specify the Hive config parameter with a custom configuration object at the time of cluster creation, it should persist. The link here http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management has some more details on creating a cluster with custom configuration.
This blog post on MSDN has a table showing what customizations are supported via the different methods, as well as examples for using PowerShell or the SDK to create a cluster with custom Hive configuration parameters (Line 62-64 in the Powershell example): http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx
This is the only way to persist these settings because the cluster nodes can be reset for Azure servicing events such as security updates, and the configurations are set back to the initial values when this occurs.

Resources