Pivotal greenplum - gpload issue with talend - linux

When i try to run the gpload process from talend etl server.In that,I need to configure tgreenpluGPload Component first. While configuration to component it is looking for Remote Greenplum server files instead of Local windows based talend ETL files .
ENV Details
Talend server is based on - windows server 2012
Greenplum Cluster based on - centos 7
Main cause :
Greenplum database server (Linux) is remote to ETL talend server (window). hence , when i am running the job from window server greenplum db server is remote for it . ALSO, i am not able to configure component tgreenplumGPload.
Screenshot tgreenplumGPload setting :
More Detail :
1) gpfdist program is running at the Greenplum master host.
[gpadmin#mdw ~]$ ps -A | grep gpfdist
20071 pts/0 00:00:00 gpfdist
[gpadmin#mdw ~]$
2)Checked merge operation from gpdb commandline - following process is running in greenplum server .
[gpadmin#mdw ~]$ gpload -f gpload.yml
2017-02-25 20:20:48|INFO|gpload session started 2017-02-25 20:20:48
2017-02-25 20:20:48|INFO|started gpfdist -p 8081 -P 8082 -f "/home/gpadmin/demo/gp_RevenueReport_stg0.txt" -t 30
2017-02-25 20:20:48|INFO|running time: 0.20 seconds
2017-02-25 20:20:48|INFO|rows Inserted = 0
2017-02-25 20:20:48|INFO|rows Updated = 3
2017-02-25 20:20:48|INFO|data formatting errors = 0
2017-02-25 20:20:48|INFO|gpload succeeded
Q1:
How to set up a shared folder on Linux for Windows to access.so that, We can utilize in tgreenplumGPload setting. Or is there any alternate way to do this.
Any help would be much appreciated !

gpfdist will run on the ETL server, not on the Master host.
You will have to add the ETL server ip and name to the /etc/hosts file on all of the nodes in the Greenplum cluster. You will then need to make sure the ETL server can communicate directly with the segment hosts in the private network of Greenplum. This will require connecting the 10GB private switches used by Greenplum to your 10GB LAN and create a VLAN so you can access the nodes or you can run a 10GB cables from your ETL server to open ports of the 10GB switches and assign it an IP address that doesn't conflict with the existing hosts.

Related

Open a port on mac for locally running spark

I am running a stand-alone Spark 3.2.1 locally, on my mac, installed via brew. This is for low-cost (free) unit testing purposes. I am starting this instance via pyspark command from terminal and able to access the instance web ui.
I am also trying to run spark-submit locally (from the same mac) to run a pyspark script on the pyspark instance described above. When specifying the --master :7077 I am getting the "connection refused" error. It does not look like the port 7077 is open on my mac.
How do I open the port 7077 on my mac such that I can access it from my mac via spark-submit, but other machines on the same network cannot?
Could someone share clear steps with explanations?
Much appreciated :)
Michael
Check your spark master process is running.
It must be like following output.
jps
$PID Master
$PID Worker
If spark process is not running,
run script $SPARK_HOME/sbin/start-master.sh in your shell first.
also $SPARK_HOME/sbin/start-worker.sh.
and then check if process listen on 7077 port with following command.
sudo lsof -nP -i:7077 | grep LISTEN

Connecting to Cassandra Instance Remotely using Linux Shell Script

I want to connect to Cassandra installed in a remote server from my dev environment. Dev Environment doesn't have cassandra installed and hence it is not allowing me to do the below for connecting to my cassandra server running on a different machine.
Client System - Dev System without Cassandra
Destination System - Prod Environment where Cassandra is installed
I am trying the below command over my dev terminal to connect to Prod Cassandra.
/opt/cassandra/dse-4.8.7/bin/cqlsh -e "select * from
/"IasService/"./"Table/" limit 10"
remote.stress.py1.s.com 9160 -u test-p test2;
Any leads would be helpful.
tldr;
Remove the 9160 from your command.
It would be easier to help you if you provided the error message or result of your command.
That being said, DSE 4.8.7 has Cassandra 2.1.14 at its core. As of Cassandra 2.1, cqlsh connects using the native binary protocol on port 9042. So forcing it to 9160 (as you are) will definitely not work.
$ cqlsh -e "SELECT release_version FROM system.local" 192.168.6.5 9042
-u cassdba -p superSecret
release_version
-----------------
2.1.13
(1 rows)
And since 9042 is the default port used by cqlsh now, you don't need to specify it at all.

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so
ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop#EMR_DNS
1) How do I find out what the Spark WebUI's assigned port is?
2) How do I verify the Spark WebUI is running?
Spark on EMR is configured for YARN, thus the Spark UI is available by the application url provided by the YARN Resource Manager (http://spark.apache.org/docs/latest/monitoring.html). So the easiest way to get to it is to setup your browser with SOCKS using a port opened by SSH then from the EMR console open Resource Manager and click the Application Master URL provided to the right of the running application. Spark History server is available at the default port 18080.
Example of socks with EMR at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
Here is an alternative if you don't want to deal with the browser setup with SOCKS as suggested on the EMR docs.
Open a ssh tunnel to the master node with port forwarding to the machine running spark ui
ssh -i path/to/aws.pem -L 4040:SPARK_UI_NODE_URL:4040 hadoop#MASTER_URL
MASTER_URL (EMR_DNS in the question) is the URL of the master node that you can get from EMR Management Console page for the cluster
SPARK_UI_NODE_URL can be seen near the top of the stderr log. The log line will look something like:
16/04/28 21:24:46 INFO SparkUI: Started SparkUI at http://10.2.5.197:4040
Point your browser to localhost:4040
Tried this on EMR 4.6 running Spark 2.6.1
Glad to announce that this feature is finally available on AWS. You won't need to run any special commands (or to configure a SSH tunnel) :
By clicking on the link to the spark history server ui, you'll be able to see the old applications logs, or to access the running spark job's ui :
For more details: https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
I hope it helps !
Just run the following command:
ssh -i /your-path/aws.pem -N -L 20888:ip-172-31-42-70.your-region.compute.internal:20888 hadoop#ec2-xxx.compute.amazonaws.com.cn
There are 3 places you need to change:
your .pem file
your internal master node IP
your public DNS domain.
Finally, on the Yarn UI you can click your Spark Application Tracking URL, then just replace the url:
"http://your-internal-ip:20888/proxy/application_1558059200084_0002/"
->
"http://localhost:20888/proxy/application_1558059200084_0002/"
It worked for EMR 5.x
Simply use SSH tunnel
On your local machine do:
ssh -i /path/to/pem -L 3000:ec2-xxxxcompute-1.amazonaws.com:8088 hadoop#ec2-xxxxcompute-1.amazonaws.com
On your local machine browser hit:
localhost:3000

DataStax OpsCenter not starting on centos dse cluster

I am trying to setup cassandra cluster with 5 nodes. I have installed dse on all nodes and started dse on all the nodes by command.
sudo service dse start
dse is running fine on all nodes.
Now I am trying to configure opscenter following http://www.datastax.com/documentation/opscenter/3.2/webhelp/index.html#opsc/install/../../opsc/install/opscInstallRHEL_t.html
When I execute "sudo service opscenterd start", it starts without any problem and even log doesnt show any problem.
But when I try doing "netstat -a | grep 8888", it doesnt show any listener.
Can anybody please help me in identifying issue?
Thanks,
Jenish
I would first figure out if the service is indeed starting. When you say you checked log, was that /var/log/messages or the opscenter logs? I would check both.
Next I would see if it stays running. You can also check for the process running with
ps -eaf | grep opscenterd
If everything is running but not listening on the right port, you should check your opscenterd.conf file for proper port and interface:
[webserver]
port = 8888
interface = 127.0.0.1
Note that your interface definition may be different - for example, it may be 0.0.0.0 which signifies binding to all interfaces (rather than just localhost as above), but you should validate that it is correct for your environment.

Shutdown Cassandra server and then restart it in windows 7

I installed single node cluster in my local dev box which is running Windows 7 and it was working fine. Due to some reason, I need to restart my desktop and then after that whenever I am doing like this on the command prompt, it always gives me the below exception-
S:\Apache Cassandra\apache-cassandra-1.2.3\bin>cassandra -f
Starting Cassandra Server
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 7199; nested exception is:
java.net.BindException: Address already in use: JVM_Bind
Meaning port being used somewhere. I have made some changes in cassandra.yaml file so I need to shutdown the Cassandra server and then restart it again.
Can anybody help me with this?
Thanks for the help.
in windows7, with apache cassandra, a pid.txt file gets created at the root folder of cassandra. Give following instruction to stop the server:
d:/cassandra/bin> stop-server -p ../pid.txt -f
Running -f starts the server as a service, you can stop it through the task manager.
It sounds like your Cassandra server starts on it's own as a service in the background when your machine boots. You can configure windows startup services. To run cassandra in the foreground on windows simply use:
> cassandra.bat
If your are using Cassandra bundled with DataStax Community Edition and running as a service on startup of your machine then you can execute following commands to start and stop Cassandra server.
Start command prompt with admin rights
run following commands
net start DataStax_Cassandra_Community_Server
net stop DataStax_Cassandra_Community_Server

Resources