Read Large Data Set to Jupyter Notebook and Manipulate - python-3.x

I am trying to load data from BigQuery to Jupyter Notebook, where I will do some manipulation and plotting. The datasets is 25 millions rows with 10 columns, which definitely exceeds my machine's memory capacity(16 GB).
I have read this post about using HDFStore, but the problem here is that I still need to read the data to Jupyter Notebook to do the manipulation.
I am using Google Cloud Platform, so setting a huge cluster in Dataproc might be an option, though that could be costly.
Anyone gets similar issue and has a solution?

Concerning products within Google Cloud Platform you can create a Datalab instance to run your notebooks and specify the desired machine type with the --machine-type flag (docs). You can use a high-memory machine if needed.
Of course, you can also use Dataproc as you already proposed. For easier setup you can use the predefined initialization action with the following parameter upon cluster creation:
--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh
Edit
As you are using a GCE instance, you can also use a script to autoshutdown the VM when you are not using it. You can edit ~/.bash_logout so that it checks if it's the last session and, if so, stops the VM
if [ $(who|wc -l) == 1 ];
then
gcloud compute instances stop $(hostname) --zone $(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4) --quiet
fi
Or, if you prefer a curl approach:
curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" https://www.googleapis.com/compute/v1/projects/$(gcloud config get-value project 2>\dev\null)/zones/$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4)/instances/$(hostname)/stop -d ""
Keep in mind that you might need to update Cloud SDK components to get the gcloud command to work. Either use:
gcloud components update
or
sudo apt-get update && sudo apt-get --only-upgrade install kubectl google-cloud-sdk google-cloud-sdk-datastore-emulator google-cloud-sdk-pubsub-emulator google-cloud-sdk-app-engine-go google-cloud-sdk-app-engine-java google-cloud-sdk-app-engine-python google-cloud-sdk-cbt google-cloud-sdk-bigtable-emulator google-cloud-sdk-datalab -y
You can include one of these and ~/.bash_logout edits in your startup-script.

Related

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

The environment variable AWS_ACCESS_KEY_ID must be set

I am using Linux 18.04 and I want to lunch a spark cluster on EC2.
I used the export command to set environment variables
export AWS_ACCESS_KEY_ID=MyAccesskey
export AWS_SECRET_ACCESS_KEY=Mysecretkey
but when I run the command to lunch the spark cluster I get
ERROR: The environment variable AWS_ACCESS_KEY_ID must be set
I put all the commands I used in case I made a mistake:
sudo mv ~/Downloads/keypair.pem /usr/local/spark/keypair.pem
sudo mv ~/Downloads/credentials.csv /usr/local/spark/credentials.csv
# Make sure the .pem file is readable by the current user.
chmod 400 "keypair.pem"
# Go into the spark directory and set the environment variables with the credentials information
cd spark
export AWS_ACCESS_KEY_ID=ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=SECRET_KEY
# To install Spark 2.0 on the cluster:
sudo spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
I am new to these things and any help is really appreciated
You can also retrieve the value AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by using the get subcommand for aws configure:
AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
In command line:
sudo AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id) AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key) spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
source: AWS Command Line Interface User Guide
Environment variables can be simply passed after sudo in form ENV=VALUE and they'll be accepted by followed command. It's not known to me if there are restrictions to this usage, so my example problem can be solved with:
sudo AWS_ACCESS_KEY_ID=ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=SECRET_KEY spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch

Proper way for keep process running on a container

I'm not aware if this could be considered as a duplicate since it's a problem for an specific case.
Currently, I have created a docker outside docker image for handling my Jenkins agent which will perform auto restarts without using supervisor as a solution ( lack of python 3.7 support ), and by that, since I'm using openjdk:slim as base image and I don't want to install any additional dependencies I opted to compensate the lack of tools like lsof and ps, or others for checking if the process is running or not, by writing the started process pid on a file which will be used for validating if the process exists or not under the path /proc/pid/status. Currently this works and the main reason of creating this solution for handling the auto start of the agents.
But my question is, Is this the best or more appropriated approach?
Please find the following code with the implementation:
#!/bin/bash
set -e
agent_runner() {
while :
do
if [ ! -f "/proc/$(cat /tmp/agent.pid)/status" ]
then
curl $JNLP_AGENT_DOWNLOAD_URL -o agent.jar
java \
-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 \
-Dhttps.protocols=TLSv1.2 \
-jar agent.jar \
-jnlpUrl $JNLP_AGENT_URL \
-secret $JENKINS_SECRET \
-workDir "$JENKINS_WORKDIR" &
echo $! > /tmp/agent.pid
else
:
fi
sleep 10
done
}
while :
do
if [ cat < /dev/tcp/"$TARGET" ]; then
echo "Starting Agent"
agent_runner
else
echo "Jenkins master is offline, waiting...."
fi
sleep 10
done
Link for the repository: https://github.com/thcp/jenkins-agent-dod
If the main process in the container dies, you should let the container die with it.
Docker and the various layers above it have functionality to restart whole containers. There is a docker run --restart option for the basic Docker CLI, and equivalent Docker Compose option, and restarting dying containers after some backoff is the default behavior for Kubernetes pods.
So, if you just let a container die on its own, you’ll have out-of-the-box support for the container engine to restart itself, without adding any special support into your image; just set the CMD to the thing you actually need the container to do and go. This approach also has the benefit that if you detect your environment has become unstable (“I depend on a database and it’s unreachable”) the process can choose to abort itself and let it be restarted later when hopefully the environment has improved.

How to get status of Spark jobs on YARN using REST API?

A spark application can run many jobs. My spark is running on yarn. Version 2.2.0.
How to get job running status and other info for a given application id, possibly using REST API?
job like follows:
enter image description here
This might be late but putting it for convenience. Hope it helps. You can use below Rest API command to get the status of any jobs running on YARN.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343/state'
O/P - {"state":"RUNNING"}
Throughout the job cycle the state will vary from NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
You can use jq for a formatted output.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343'| jq .app.state
O/P - "RUNNING"
YARN has a Cluster Applications API. This shows the state along with other information. To use it:
$ curl 'RMURL/ws/v1/cluster/apps/APP_ID'
with your application id as APP_ID.
It provides:

Compile a node's catalog to JSON without a master?

I'm looking to improve my org's Puppet work lifecycle by comparing differences for actual node catalogs. We came across this project which compiles catalogs for nodes and creates a diff for them, but it seems to require an online master.
I need to be able to do what this tool does, albeit without a master - I'd just like to compile a deterministic JSON or YAML blob which describes all of the resources that would be managed by Puppet for a given node and given a set of facts.
Is there a way for me to do this without an online master?
If you have Rspec-puppet set up, there is an easy way to do this. Just add a File.write statement inside one of your it blocks:
require 'spec_helper'
describe 'myclass' do
it {
File.write(
'myclass.json',
PSON.pretty_generate(catalogue)
)
#is_expected.to compile.with_all_deps
}
end
I have more information in a blog post on this here.
If you can't use Rspec-puppet to do it (recommended), have a look at another blog post I wrote, Compiling a puppet catalog – on a laptop.
I have been looking for it for a long time.
In the end, Vagrant in combination with Puppet brought me the solution.
The only thing you need is to install puppet.
First you have to set the Facts. After that you can compile your catalog.
FACTER_server_role='webstack' \
... \
FACTER_hostname='hostname' \
FACTER_fqdn='hostnamne.fqdn' \
puppet catalog compile hostnamne.fqdn \
--modulepath "./modules" \
--hiera_config "./hiera.yaml" \
--environmentpath ./environments/ \
--environment production
If you want to get a clean json file, pipe the output in sed and send the output to a file.
| sed -n '1!p' > $file

Resources