Remote access to leaf node when using memsql-spark-connector - apache-spark

I'm trying to test memsql-spark-connector and for this I created a single node MemSQL cluster on AWS (https://docs.memsql.com/docs/quick-start-with-amazon-webservices).
On my laptop I want to run a Spark application in local mode. This application should simply create Dataframe for a table and collect all rows. Here is the code:
val conf = new SparkConf()
.setAppName("Test App")
.setMaster("local[*]")
.set("memsql.host", "x.x.x.x")
.set("memsql.port", "3306")
.set("memsql.user", "root")
.set("memsql.password", "1234")
.set("memsql.defaultDatabase", "dataframes_test")
val sc = new SparkContext(conf)
val memsql = new MemSQLContext(sc)
val df = memsql.table("person")
df.collect().foreach(println(_))
where x.x.x.x is the address of my AWS instance.
The problem is although I can connect to MemSQL server from my laptop, memsql-spark-connector tries to access leaf node directly (i.e. connect to port 3307 instead of 3306). And when this happens I get the following error:
java.sql.SQLException: Access denied for user 'root'#'108.208.196.149' (using password: YES)
But root user actually does have all permissions:
memsql> show grants for 'root'#'%';
+--------------------------------------------------------------------------------------------------------------------------------+
| Grants for root#% |
+--------------------------------------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'#'%' IDENTIFIED BY PASSWORD '*A49656EC00D74D3524072F3452C1FBA7A1F3B561' WITH GRANT OPTION |
+--------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Is it possible to grant permissions to leaf nodes so that this connection to x.x.x.x:3307 is successful as well?
I realize that it's probably not the way it's designed to be used, but I want to do it this way only for testing. It's convinient to debug when everything is in a single JVM, and I don't want to bother about Spark installation for now. I could install MemSQL locally to solve my problem, but I can't do this on Mac (is this right, BTW?).
Any help appreciated!
UPDATE: Just tried to connect locally on the server and still doesn't work:
ubuntu#ip-x-x-x-x:~$ memsql -P 3307 -u root -p
Enter password:
ERROR 1045 (28000): Access denied for user 'root'#'localhost' (using password: YES)
Password I'm providing is correct, on AWS it's an instance ID, so very hard to make a mistake.
This means that it wouldn't work even if I had Spark executor on the same instance with the leaf node. Feels like something is wrong with my setup, but I actually didn't change any settings, all are defaults.
Are master node and leaf node supposed to use the same credentials? Is there a way to setup them for the leaf separately?

That error means that the login was denied, i.e. incorrect username/password (not that the user doesn't have enough permissions). Make sure the password you are using in the spark connector matches the password on all the nodes.

Related

How to connect to Cassandra Database using Python code

I had followed the steps given in https://docs.datastax.com/en/developer/python-driver/3.25/getting_started/ to connect to cassandra database using python code, but still after running the code snippet I am getting
NoHostAvailable: ('Unable to connect to any servers', {'hosts"port': OperationTimedOut('errors=None, last_host=None'),
Python version 2.7 and 3 (classpath is set for both the python versions)
Java 1.8 (class path has been set)
Apache cassandra 3.11.6 (apache home classpath has been set)
I tend to use a very simple app to test connectivity to a Cassandra cluster:
from cassandra.cluster import Cluster
cluster = Cluster(['10.1.2.3'], port=45678)
session = cluster.connect()
row = session.execute("SELECT release_version FROM system.local").one()
if row:
print(row[0])
Then run it:
$ python HelloCassandra.py
4.0.6
In your comment you mentioned that you're getting OperationTimedOut which indicates that the driver never got a response back from the node within the client timeout period. This usually means (a) you're connecting to the wrong IP, (b) you're connecting to the wrong CQL port, or (c) there's a network connectivity issue between your app and the cluster.
Make sure that you're using the IP address that you've set in rpc_address of cassandra.yaml. Also make sure that the node is listening for CQL clients on the right port. You can easily verify this by checking the output of either of these Linux utilities like netstat or lsof, for example:
$ sudo lsof -nPi -sTCP:LISTEN
Cheers!
So that error message suggests that the host/port combination either does not have Cassandra running on it or is under heavy load and unable to respond.
Can you edit your question to include the Cassandra connection portion of your code, as well as maybe how you're calling it? I have a test script which I use (and you're welcome to check it out), and here is the connection portion:
protocol=4
hostname=sys.argv[1]
username=sys.argv[2]
password=sys.argv[3]
nodes = []
nodes.append(hostname)
auth_provider = PlainTextAuthProvider(username=username, password=password)
cluster = Cluster(nodes,auth_provider=auth_provider, protocol_version=protocol)
session = cluster.connect()
I call it like this:
$ python3 testCassandra.py 127.0.0.1 aaron notReallyMyPassword
local
One thing you might try too, would be to run a nodetool status on the cluster just to make sure it's running ok.
Edit
local variable 'session' referenced before assignment
So this sounds to me like you're attempting a session.execute before session = cluster.connect(). Have a look at my Git repo (linked above) to see the correct order for instantiating session.
I am not using default port
In that case, make sure the port is being set in the cluster definition. Ex:
port = 19099
cluster = Cluster(nodes,auth_provider=auth_provider, port=port)

Influxdb says not authorized to exectute statement

I'm facing some issues when I try to run a simple SELECT query on influxdb via the Python library.
I'm trying to run the following query:
influx_client.query('SELECT * FROM "measurements" LIMIT 10;')
Of course I switched to the according database (and connected to the server) before executing the query. Also I tried those variants of the query:
influx_client.query("SELECT * FROM \"measurements\" LIMIT 10;")
influx_client.query("SELECT * FROM 'measurements' LIMIT 10;")
influx_client.query('SELECT * FROM \'measurements\' LIMIT 10;')
influx_client.query('SELECT * FROM {0} LIMIT 10;'.format("measurements"))
influx_client.query("SELECT * FROM {0} LIMIT 10;".format("measurements"))
however they all lead to the same issue.
The result (or more the error) that I get is the following:
influxdb.exceptions.InfluxDBClientError: 403: {"error":"error authorizing query: myuser not authorized to execute statement 'SELECT * FROM \"measurements\" LIMIT 10', requires READ on True"}
I know that my user have the required permissions because when connecting to the DB with a CLI I can execute the query. On top of that I checked the permissions with SHOW GRANTS and I could see that all requirements are satisfied (the user actualy does have all privileges).
I saw some simillar issues already (for instance in this issue) however this does not fit my case since I'm quoting the query.
Informations about the environment:
InfluxDB version: 1.8.0
InfluxDB-python version: 5.3.1
Python version: 3.6.8
Operating system version: CentOS 7
Any ideas ?
There are two things you need to check for the authentication issue:
https configuration with given private key and password certificate Link
Passing the user credentials for the influx db connection (Check the case sensitivity as well.
Have used influx and these are key configuration will lead to authentication issue.
using command CLI you need to provide the user permission to the given database
Using <you-database>
GRANT ALL PRIVILEGES TO <username>
Grant Permission To User

Kerberos: Spark UGI credentials are not getting passed down to Hive

I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the spark-sql shell.
The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data.
When I launch the spark-sql shell with DEBUG logging enabled, this is what I see in the logs:
> bin/spark-sql --proxy-user proxy_user
...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_HOST#REALM.COM at thrift://hive-metastore:9083
DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_host#REALM.COM (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130)
This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. This is the piece of code in Spark which does that. I also verified in the metastore logs that the get_delegation_token() call was being made.
Now when I run a simple query like create table test_table (id int) location "s3://some/prefix"; I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized (org/apache/hadoop/hive/metastore/Warehouse.java):
public static FileSystem getFs(Path f, Configuration conf) throws MetaException {
...
try {
// get the current user
UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
LOG.info("UGI information: " + ugi);
Collection<Token<? extends TokenIdentifier>> tokens = ugi.getCredentials().getAllTokens();
// print all the tokens it has
for(Token token : tokens) {
LOG.info(token);
}
} catch (IOException e) {
e.printStackTrace();
}
...
}
In the metastore logs, this does print the correct UGI information:
UGI information: proxy_user (auth:PROXY) via hive/hive-metastore#REALM.COM (auth:KERBEROS)
but there are no tokens present in the UGI. Looks like Spark code adds it with the alias hive.server2.delegation.token but I don't see it in the UGI. This makes me suspect that somehow the UGI scope is isolated and not being shared between spark-sql and hive metastore. How do I go about solving this?
Spark is not picking up your Kerberos identity -it asks each FS to issue some "delegation token" which lets the caller interact with that service and that service alone. This is more restricted and so more secure.
The problem here is that spark collects delegation tokens from every filesystem which can issue them -and as your S3 connector isn't issuing any, nothing is coming down.
Now, Apache Hadoop 3.3.0's S3A connector can be set to issue your AWS credentials inside a delegation token, or, for bonus security, ask AWS for session credentials and send only those over. But (a) you need a spark build with those dependencies, and (b) Hive needs to be using those credentials to talk to S3.

Spark RDD.pipe run bash script as a specific user

I notice that RDD.pipe(Seq("/tmp/test.sh")) runs the shell script with the user yarn . that is problematic because it allows the spark user to access files that should only be accessible to the yarn user.
What is the best way to address this ?
Calling sudo -u sparkuser is not a clean solution . I would hate to even consider that .
I am not sure if this is the fault of Spark to treat the Pipe() differently, but I opened a similar issue on JIRA: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-26101
Now on to the problem. Apparently in YARN cluster Spark Pipe() asks for a container, whether your Hadoop is nonsecure or is secured by Kerberos is the difference between whether container runs by user yarn/nobody or the user who launches the container your actual user.
Either use Kerberos to secure your Hadoop or if you don't want to go through securing your Hadoop, you can set two configs in YARN which uses the Linux users/groups to launches the container. Note, you must share the same users/groups across all the nodes in your cluster. Otherwise, this won't work. (perhaps use LDAP/AD to sync your users/groups)
Set these:
yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false
yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
Source: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html
(this is the same even in Hadoop 3.0)
This fixed worked on Cloudera latest CDH 5.15.1 (yarn-site.xml):
http://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/m-p/82572/highlight/true#M3882
Example:
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
est: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at repartition at <console>:25
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[5] at pipe at <console>:25
c: Array[String] = Array(maziyar)
This will return the username who started the Spark session after setting those configs in yarn-site.xml and sync all the users/groups among all the nodes.

How to reset a lost Cassandra admin user's password?

I have full access to the Cassandra installation files and a PasswordAuthenticator configured in cassandra.yaml. What do I have to do to reset admin user's password that has been lost, while keeping the existing databases intact?
The hash has changed for Cassandra 2.1:
Switch to authenticator: AllowAllAuthenticator
Restart cassandra
UPDATE system_auth.credentials SET salted_hash = '$2a$10$H46haNkcbxlbamyj0OYZr.v4e5L08WTiQ1scrTs9Q3NYy.6B..x4O' WHERE username='cassandra';
Switch back to authenticator: PasswordAuthenticator
Restart cassandra
Login as cassandra/cassandra
CREATE USER and ALTER USER to your heart's content.
Solved with the following steps:
Change authenticator in cassandra.yaml to AllowAllAuthenticator and restart Cassandra
cqlsh
update system_auth.credentials set salted_hash='$2a$10$vbfmLdkQdUz3Rmw.fF7Ygu6GuphqHndpJKTvElqAciUJ4SZ3pwquu' where username='cassandra';
Exit cqlsh
Change authenticator back to PasswordAuthenticator and restart Cassandra
Now you can log in with
cqlsh -u cassandra -p cassandra
and change the password to something else.
As of cassandra 2.0
ALTER USER cassandra WITH PASSWORD 'password';
If you want to add a user.
// CREATE USER uname WITH PASSWORD 'password'; // add new user
// GRANT all ON ALL KEYSPACES to uname; // grant permissions to new user
Verify your existing users with LIST USERS;
EDIT
Oh boy, this is gona be fun! So, I found one hacktastic way but it requires changing sourcecode.
First a high level overview:
Edit source so you can make changes to the system_auth.credentials column family
Change the authenticator to AllowAllAuthenticator
Start C*
Log in with cqlsh without needing a password
Update the cassandra user's hash password
Undo the source changes and change back to PasswordAuthenticator.
Step 1 - edit source
Open the C* source and go to package org.apache.cassandra.service.ClientState;
Find the validateLogin() and ensureNotAnonymous() functions and comment all contained coude out so you end up with:
public void validateLogin() throws UnauthorizedException
{
// if (user == null)
// throw new UnauthorizedException("You have not logged in");
}
public void ensureNotAnonymous() throws UnauthorizedException
{
validateLogin();
// if (user.isAnonymous())
// throw new UnauthorizedException("You have to be logged in and not anonymous to perform this request");
}
Step2 - Change to AllowAllAuthenticator in cassandra.yaml
Step3 & 4 - Simple!
Step 5 - Execute this insert statement from cqlsh:
insert into system_auth.credentials (username, options, salted_hash)
VALUES ('cassandra', null, '$2a$10$vbfmLdkQdUz3Rmw.fF7Ygu6GuphqHndpJKTvElqAciUJ4SZ3pwquu');
Note* step 5 will work assuming the user named 'cassandra' has already been created. If you have another user created just switch the username you are inserting (this procedure resets a password, it doesn't add a new user).
Step 6 Fix the source by uncommenting validateLogin() and ensureNotAnonymous() and switch back to the PasswordAuthenticator in cassandra.yaml, you should now have access to cqlsh via ./cqlsh -u cassandra -p cassandra

Resources