Cannot configure a GCP project when using DataProcPySparkOperator - python-3.x

I am using a Cloud Composer environment to run workflows in a GCP project. One of my workflows creates a Dataproc cluster in different project using the DataprocClusterCreateOperator, and then attempts to submit a PySpark job to that cluster using the DataProcPySparkOperator from the airflow.contrib.operators.dataproc_operator module.
To create the cluster, I can specify a project_id parameter to create it in another project, but it seems like DataProcPySparkOperator ignores this parameter. For example, I expect to be able to pass a project_id, but I end up with a 404 error when the task runs:
from airflow.contrib.operators.dataproc_operator import DataProcPySparkOperator
t1 = DataProcPySparkOperator(
project_id='my-gcp-project',
main='...',
arguments=[...],
)
How can I use DataProcPySparkOperator to submit a job in another project?

The DataProcPySparkOperator from the airflow.contrib.operators.dataproc_operator module doesn't accept a project_id kwarg in its constructor, so it will always default to submitting Dataproc jobs in the project the Cloud Composer environment is in. If an argument is passed, then it is ignored, which results in a 404 error when running the task, because the operator will try to poll for a job using an incorrect cluster path.
One workaround is to copy the operator and hook, and modify it to accept a project ID. However, an easier solution is to use the newer operators from the airflow.providers packages if you are using a version of Airflow that supports them, because many airflow.contrib operators are deprecated in newer Airflow releases.
Below is an example. Note that there is a newer DataprocSubmitPySparkJobOperator in this module, but it is deprecated in favor of DataprocSubmitJobOperator. So, you should use the latter, which accepts a project ID.
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
t1 = DataprocSubmitJobOperator(
project_id='my-gcp-project-id',
location='us-central1',
job={...},
)
If you are running an environment with Composer 1.10.5+, Airflow version 1.10.6+, and Python 3, the providers are preinstalled and can be used immediately.

Related

Is it possible to stream Cloud Build logs with the Node.js library?

Some context: Our Cloud Build process relies on manual triggers and about 8 substitutions to customize deploys to various firebase projects, hosting sites, and preview channels. Previously we used a bash script and gcloud to automate the selection of these substitution options, the "updating" of the trigger (via gcloud beta builds triggers import: our needs require us to use a single trigger, it's a long story), and the "running" of the trigger.
This bash script was hard to work with and improve, and through the import-run shenanigans actually led to some faulty deploys that caused all kinds of chaos: not great.
However, recently I found a way to pass substitution variables as part of a manual trigger operation using the Node.js library for Cloud Build (runTrigger with subs passed as part of the request)!
Problem: So I'm converting our build utility to Node, which is great, but as far as I can tell there isn't a native way to steam build logs from a running build in the console (except maybe with exec, but that feels hacky).
Am I missing something? Or should I be looking at one of the logging libraries?
I've tried my best scanning Google's docs and APIs (Cloud Build REST, the Node client library, etc.) but to no avail.

Why are Cloud Function Runtime Environment Variables being deleted on deploy?

I recently (2 days ago) upgraded the node runtime engine on our Cloud Functions instance from Node 10 to 12. (Not sure that is a factor, but it is a recent change.)
Since the upgrade I have been using the Cloud Functions project without trouble. Today is the first time I have done a deploy SINCE the deployment to change the node engine. After I did the deploy, ALL of the Runtime Environment Variables were deleted except one labeled FIREBASE_CONFIG.
As a test, I added another test environment variable via the Cloud Functions console UI. I refreshed the page to ensure the variable was there. Then, I ran another deploy, using this command:
firebase use {project_name} && firebase deploy --only functions:{function_name}
After the deploy completed, I refreshed the environment variables page and found that the test variable I had created was now missing.
I'm quite stumped. Any ideas? Thank you!
It is true that the Firebase CLI manages enviroment configuration and does not allow us to set the ENV variables of the runtime during deployment. This has been explained on other post as well, like this one.
I guess you are aware of the difference between the Cloud Functions Runtime Variables and the Firebase Environment configuration, so I will just leave it here as a friendly reminder.
Regarding the actual issue (New deployments erasing previously set "Cloud Functions Runtime Variables"), I believe this must have been something they fixed already, because I have tested with version 9.10.2 of the firebase CLI, and I could not replicate the issue on my end.
I recommend checking the CLI version that you have (firebase --version) and, if you still experience the same issue, provide us with the steps you took.

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

/bin/sh: 1: gcloud: not found

I have my NodeJS service running on Cloud App Engine. From this NodeJS service, I want to execute gcloud command. I am getting the below error and my app engine NodeJS service failed to run the gcloud command.
/bin/sh: 1: gcloud: not found
Connect to your instance and check if you have the gcloud SDK installed in the default runtime image supplied by Google.
If it isn't installed (not impossible - it doesn't appear included in the standard environment either, see System Packages Included in the Node.js Runtime) then you could try to treat it just like any other non-node.js dependency and build a custom runtime with it - see Google App Engine - specify custom build dependencies
If it is installed check if you need to tweak your app's environment to access it.
But in general the gcloud command isn't really designed to be executed on the deployed instances. Depending on what exactly you're trying to achieve, there may be better suited/more direct/programmatic API alternatives (which, probably in most cases, is what the gcloud command invokes under the hood as well).

How to get Selenium working with Jenkins2 in GCP

I'm trying to get Selenium Grid and Jenkins working together in GKE.
I found the Selenium plugin (https://plugins.jenkins.io/selenium) for Jenkins, but I'm not sure it can be used to get what I want.
I stood Jenkins up by following the steps here:
https://github.com/GoogleCloudPlatform/kube-jenkins-imager
( I changed the image for the jenkins node to use Jenkins 2.86 )
This creates an instance of Jenkins running in kubernetes that spawns slaves into the cluster as needed.
But I don't believe that this is compatible with the Selenium plug-in. What's the best way to take what I have and get it working with this instance of Jenkins?
I was also able to get an instance of Selenium up and going in the same cluster using this:
https://gist.github.com/elsonrodriguez/261e746cf369a60a5e2d
( I dropped the version 2.x from the instances to pull in the latest containers. )
I had to bump the k8s nodes up to n1-standard-2 (2 vCPUs, 7.5 G Memory ) to get those containers to run.
For this proof of concept, the SE nodes don't need to be ephemeral. But I'm unsure what kind of permanent node container image I can deploy in k8s that would have the necessary SE drivers.
On the other hand, maybe it would be easier to just use the stand-alone SE containers that I found. If so, how do I use them with Jenkins2?
Has anyone else gone down this path?
Edit: I'm not interested in third-party selenium services at this time.
SauceLabs is a selenium grid in the cloud.
I wrote Saucery to make integrating from C# or Java with NUnit2, NUnit3 or JUnit 4 easy.
You can see the source code here, here and here or take a look at the Github Pages site here for more information.
Here is what I figured out.
I saw many indications that it was a hassle to run your own instance of Selenium grid. Enough time may have passed for this to be a little easier than it used to be. There seem to be a few ways to do it.
Jenkins itself has a plugin that is supposed to turn your Jenkins cluster into a Selenium 3 grid: https://plugins.jenkins.io/selenium . The problem I had with this is that I'm planning on hosting these instances in the cloud, and I wanted the Jenkins slaves to be ephemeral. I was unable to figure out how to get the plugin to work with ephemeral slaves.
I was trying to get this done as quickly as I could, so I only spent three days total on this project.
These are the forked repos that I'm using for the Jenkins solution:
https://github.com/jnorment-q2/kube-jenkins-imager
which basically implements this:
https://github.com/jnorment-q2/continuous-deployment-on-kubernetes
I'm pointing to my own repos to give a reference to exactly what I used in late October 2017 to get this working. Those repos are forked from the main repos, and it should be easy to compare the differences.
I had contacted google support with a question, they responded that this link might actually be a bit clearer:
https://cloud.google.com/solutions/jenkins-on-container-engine-tutorial
From what I can tell, this is a manual version of the more automated scripts I referenced.
To stand up Selenium, I used this:
https://github.com/jnorment-q2/selenium-on-k8s
This is a project I built from a gist referenced in the Readme, which references a project maintained by SeleniumHQ.
The main trick here is that Selenium is resource hungry. I had to use the second tier of google compute engines in order for it to deploy in Kubernetes. I adapted the script I used to stand up Jenkins to deploy Selenium Grid in a similar fashion.
Also of note, there appear to be only Firefox and Chrome options in the project from SeleniumHQ. I have yet to determine if it is even possible to run an instance of Safari.
For now, this is what we're going to go with.
The piece left is how to make a call to the Selenium grid from Jenkins. It turns out that selenium can be pip-installed into ephemeral slaves, and webdriver.Remote can be used to make the call.
Here is the demo script that I wrote to prove that everything works:
https://github.com/jnorment-q2/demo-se-webdriver-pytest/blob/master/test/testmod.py
It has a Jenkinsfile, so it should work with a fresh instance of Jenkins. Just create a new pipeline, change definition to 'Pipeline script from SCM', Git, https://github.com/jnorment-q2/demo-se-webdriver-pytest, then scroll up and click 'run with parameters' and add the parameter SE_GRID_SERVER with the full url ( including port ) of the SE grid server.
It should run three tests and fail on the third. ( The third test requires additional parameters for TEST_URL and TEST_URL_TITLE )

Resources