Attach aws emr cluster to remote jupyter notebook using sparkmagic - python-3.x

I am trying to connect and attach an AWS EMR cluster (emr-5.29.0) to a Jupyter notebook that I am working on my local windows machine. I have started a cluster with Hive 2.3.6, Pig 0.17.0, Hue 4.4.0, Livy 0.6.0, Spark 2.4.4 and the subnets are public. I found that this can be done with Azure HDInsight, so was hoping something similar can be done using EMR. The issue I am having is with passing the correct values in the config.json file. How should I attach a EMR cluster?
I could work on the EMR notebooks native to AWS, but thought I can go the develop locally route and have hit a road block.
{
"kernel_python_credentials" : {
"username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
"password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
"url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "{IAM ACCESS KEY ID}",
"password": "{IAM SECRET ACCESS KEY}",
"url": "{Master public DNS}",
"auth": "None"
},
"kernel_r_credentials": {
"username": "{}",
"password": "{}",
"url": "{}"
},
Update 1/4/2021
On 4/1, I got sparkmagic to work on my local jupyter notebook. Used these documents as a references (ref-1, ref-2 & ref-3) to setup local port forwarding (if possible avoid using sudo).
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Configuration details
Release label:emr-5.32.0
Hadoop distribution:Amazon 2.10.1
Applications:Hive 2.3.7, Livy 0.7.0, JupyterHub 1.1.0, Spark 2.4.7, Zeppelin 0.8.2
Updated config file
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 5,
"livy_server_heartbeat_timeout_seconds": 60,
"heartbeat_retry_seconds": 1,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
Second update 1/9
Back to square one. Keep getting this error and spent days trying to debug. Not sure what I did previously to get things going. Also checked my security group config and it looks fine, ssh on port 22.
An error was encountered:
Error sending http request and maximum retry encountered.

Created a local port forwarding (ssh tunneling) to livy server on port 8998 and it works like magic.
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Did not change my config.json file from 1/4 update

Related

How to provide cluster name in Azure Databricks Notebook Run Now JSON

I am able to use the below JSON through POSTMAN to run my Databricks notebook.
I want to be able to give a name to the cluster that is created through the "new_cluster" options.
Is there any such option available?
{
"tasks": [
{
"task_key": "Job_Run_Api",
"description": "To see how the run and trigger api works",
"new_cluster": {
"spark_version": "9.0.x-scala2.12",
"node_type_id": "Standard_E8as_v4",
"num_workers": "1",
"custom_tags": {
"Workload": "Job Run Api"
}
},
"libraries": [
{
"maven": {
"coordinates": "net.sourceforge.jtds:jtds:1.3.1"
}
}
],
"notebook_task": {
"notebook_path": "/Shared/POC/Job_Run_Api_POC",
"base_parameters": {
"name": "Junaid Khan"
}
},
"timeout_seconds": 2100,
"max_retries": 0
}
],
"job_clusters": null,
"run_name": "RUN_API_TEST",
"timeout_seconds": 2100
}
When the above API call is done, the cluster created has a name like "job-5975-run-2" and that is not super explanatory.
I have tried to use the tag "cluster_name" inside the "new_cluster" tag but I got an error that I can't do that, like this:
{
"error_code": "INVALID_PARAMETER_VALUE",
"message": "Cluster name should not be provided for jobs."
}
Appreciate any help here
Cluster name for jobs are automatically generated and can't be changed. If you want somehow track specific jobs, use tags.
P.S. If you want to have more "advanced" tracking capability, look onto Overwatch project.

Is it possible for someone log into MongoDB without the correct password if authentication is enabled?

I recently setup my first MongoDB database in an production environment. I looked up some guides for deployment and followed them.
I had the following in my config:
# Where and how to store data.
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
# engine:
# mmapv1:
# wiredTiger:
# where to write logging data.
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
# network interfaces
net:
port: 27017
bindIp: 127.0.0.1
# how the process runs
processManagement:
timeZoneInfo: /usr/share/zoneinfo
security:
authorization: "enabled"
And I created an admin user (the only user) that looks like this:
{
"_id" : "admin.admin",
"userId" : UUID("6dfe010f-1e62-4801-9c07-5a408b8c75c6"),
"user" : "admin",
"db" : "admin",
"credentials" : {
[omitted, but contains SCRAM-SHA-1 and SCRAM-SHA-256 hashes]
},
"roles" : [
{
"role" : "userAdminAnyDatabase",
"db" : "admin"
},
{
"role" : "readWriteAnyDatabase",
"db" : "admin"
},
{
"role" : "dbAdminAnyDatabase",
"db" : "admin"
}
]
I also switched the outward facing port for the database (through nginx) to a nearby port that wasn't the default.
With all of that, I still got hacked and I was greeted with this when I got onto nosqlbooster.
Fortunately for me, I wasn't storing any sensitive information (just an aggregation of data pulled from a variety of other services) and all of the data can easily be regenerated. However, I'd rather not have this type of thing happen again.
I did some digging in the logs and found the moment they connected:
{
"t": {
"$date": "2021-02-04T22:10:38.614+00:00"
},
"s": "I",
"c": "ACCESS",
"id": 20250,
"ctx": "conn75191",
"msg": "Successful authentication",
"attr": {
"mechanism": "SCRAM-SHA-256",
"principalName": "admin",
"authenticationDatabase": "admin",
"client": "127.0.0.1:39722"
}
}{
"t": {
"$date": "2021-02-04T22:11:21.521+00:00"
},
"s": "I",
"c": "NETWORK",
"id": 51800,
"ctx": "conn75192",
"msg": "client metadata",
"attr": {
"remote": "127.0.0.1:39918",
"client": "conn75192",
"doc": {
"driver": {
"name": "PyMongo",
"version": "3.11.2"
},
"os": {
"type": "Linux",
"name": "Linux",
"architecture": "x86_64",
"version": "5.8.0-41-generic"
},
"platform": "CPython 3.8.6.final.0"
}
}
}
Shortly after that login, I can see them drop my database and insert the note. Funny enough, it looks like they never saved the data, so the whole thing is obviously a scam. I also checked the auth.log for the server to ensure that nobody logged into the server itself, so I'm pretty sure they haven't tampered with the filesystem unless they did some magic through nginx.
I did some testing with authentication off and found that you can have an "authenticated" connection with the incorrect password if authentication is off. At this point, my question is: Are there any ways to get in without knowing the password if my config is set as specified above and my mongo server has been restarted since the last configuration change? Or is the only possibility that they have my password? I'm completely stumped. Around a week before the attack, I tried logging in with incorrect passwords to ensure that authorization was enabled correctly. I got denied as expected.
In case it's relevant, here's the rule for my MongoDB port in Nginx:
stream {
server {
listen 27018;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass stream_mongo_backend;
}
upstream stream_mongo_backend {
server 0.0.0.0:27017;
}
}
Yes, when authentication is enabled you can connect to the Mongo database without any credentials. However, apart from harmless commands like db.help(), db.getMongo(), db.listCommands(), db.version(), etc. you can't execute anything.
Obviously the hacker connected from localhost with valid credentials, so it looks like he got access to your machine. Maybe he read your application python script which has the password.
NB, you write only the admin user was created. You should use the admin account only for administrative task and keep the password private. The application should not run under such admin account, it should use a dedicated account having only the permissions which are required to run the application.

AWS ECS create service waits for user input while showing output on console

I am using circleci job to create an ECS service. Below is the aws cli command that I'm using to create ECS service.
aws ecs create-service --cluster "test-cluster" --service-name testServiceName \
--task-definition testdef:1 \
--desired-count 1 --launch-type EC2
When this command is executing following error is occurred and the CircleCI job is failed.
{ress RETURN)
"service": {
"serviceArn": "arn:aws:ecs:*********:<account-id>:service/testServiceName",
"serviceName": "testServiceName",
"clusterArn": "arn:aws:ecs:*********:<account-id>:cluster/test-cluster",
"loadBalancers": [],
"serviceRegistries": [],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:*********:<account-id>:task-definition/testdef*********:1",
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/1585305191116328179",
"status": "PRIMARY",
:
Too long with no output (exceeded 10m0s): context deadline exceeded
Running the command locally on a minimized terminal window gives the following output
{
"service": {
"serviceArn": "arn:aws:ecs:<region>:<account-id>:service/testServiceName",
"serviceName": "testServiceName",
"clusterArn": "arn:aws:ecs:<region>:<account-id>:cluster/test-cluster",
"loadBalancers": [],
"serviceRegistries": [],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:<region>:<account-id>:task-definition/testdef:1",
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/8313453507891259676",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:<region>:<account-id>:task-definition/testdef:1",
"desiredCount": 1,
:
The further execution is stopped until I hit some key. This is the reason that CircleCI job is failing after 10m threshold limit. When I run the command in a full screen terminal locally then it does not wait and shows the output.
Is there any way that the command is run in such a way that it does not wait for any key to be hit and execution is completed so that the pipeline does not fail. Please note that the ECS service is created successfully.

In iOS,How to sync CBL with server through terminal without using CouchDB dmg?

I trying to run sync gateway code into the terminal but i don't understand how it works, even through needless response. see the code below,
{
"log": ["HTTP+"],
"databases": {
"grocery-sync": {
"server": "http://localhost:8091",
"bucket": "grocery-sync",
"users": {
"GUEST": {"disabled": false, "admin_channels": ["*"] }}}}
}
But,i getting below response but i couldn't understand what exactly need to do for auto replication.

creating my first image for rocket (serviio with java dependency)

I have CoreOS stable (1068.10.0) installed and I want to create a serviio streaming media server image for rocket.
this is my manifest file:
{
"acVersion": "1.0.0",
"acKind": "ImageManifest",
"name": "tux-in.com/serviio",
"app": {
"exec": [
"/opt/serviio/bin/serviio.sh"
],
"user":"serviio",
"group":"serviio"
},
"labels": [
{
"name": "version",
"value": "1.0.0"
},
{
"name": "arch",
"value": "amd64"
},
{
"name": "os",
"value": "linux"
}
],
"ports": [
{
"name": "serviio",
"protocol": "tcp",
"port": 8895
}
],
"mountPoints": [
{
"name": "serviio-config",
"path": "/config/serviio",
"kind": "host",
"readOnly": false
}
],
"environment": {
"JAVA_HOME": "/opt/jre1.8.0_102"
}
}
I couldn't find on google how to add java package depenency, so I just downloaded jre, opened it to /rootfs/opt and set a JAVA_HOME environment variable. is that the right way to go?
welp.. because I configured serviio to run on user and group called serviio, I created /etc/group with serviio:x:500:serviio and /etc/passwd with serviio:x:500:500:Serviio:/opt/serviio:/bin/bash. is this ok? should I added and configured users differently ?
then I crated a rocket image with actool build serviio serviio-1.0-linux-amd64.aci, signed it and ran it with rkt run serviio-1.0-linux-amd64.aci. then with rkt list i see that the container started and exited immediately.
UUID APP IMAGE NAME STATE CREATED STARTED NETWORKS
bea402d9 serviio tux-in.com/serviio:1.0.0 exited 11 minutes ago 11 minutes ago
rkt status bea402d9 returns:
state=exited
created=2016-09-03 12:38:03.792 +0000 UTC
started=2016-09-03 12:38:03.909 +0000 UTC
pid=15904
exited=true
app-serviio=203
no idea how to debug this issue further. how can I see the output of the sh command that was executed? any other error related information?
have I configured things properly? I'm pretty lost so any information regarding the issue would be greatly appreciated.
thanks!

Resources