pyspark installation on windows 10 fails - apache-spark

I install spark according to all available tutorials I found on internet. Set up all environmental variables yet I am still not able to launch it. Please see the attached report.

make sure your environment variables are setup properly for spark home and path, for example:
SPARK_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7
PATH += D:\Spark\spark-2.3.0-bin-hadoop2.7\bin

Related

Activate venv in vs code

I have been trying to activate a virtual environment using Python built-in module venv from VSCode, But it didn't work properly And I didn't receive any error message. However, and also If I use venv\Scripts\activate.bat command in terminal it doesn't work.
Are you correctly setting up the venv?
python3 -m venv env
Then in the below section of your vscode taskbar you will find
Then select your interpreter(env) to use:
There are three things that I would check:
Does your vs-code have a default virtual environment defined as a user setting? Is that the one your current workspace is using?
Have you moved the folder you are working in since creating a virtual environment? If so, you should edit your venv/bin/activate script so that it has the correct value for the VIRTUAL_ENV variable.
In your project, do you have a .vscode/settings.json file that is referring to the wrong location or a location which doesn't exist? Specifically thinking of the "python.defaultInterpreterPath" setting.
These are things that I came across today when I had a similar problem. Hopefully that helps someone else!
{Ctrl+shift+'}
this will open a new terminal and automatically activate your virtual environment, found this in vs code documentation for flask virtual environments.
I also tried venv\Scripts\activate.bat and it wasn't having it; however cant remember the issue I was having.
Hope this saves some one a lot of time.

Can I have more than one connection in databricks-connect?

I have setup on my PC a miniconda python environment where I have installed the databricks-connect package and configured the tool with databricks-connect configure to connect to a databricks instance I want to use when developing code in the US.
I have a need to connect to a different a different databricks instance for developing code in the EU and I thought I could do this by setting up a different miniconda environment and installing databricks-connect in that environment and setting the configuration in that environment to point to the new databricks instance.
Alas, this did not work. When I look at databricks-connect configure in either miniconda environment, I see the same configuration in both which is the configuration I last configured.
My question therefore is: Is there a way to have multiple databricks-connect connections at the same time and toggle between the two without having to reconfigure each time?
Thank you for your time.
Right now, databricks-connect relies on the central configuration file, and this causes problems. There are two approaches to workaround that:
Use environment variables as described in the documentation, but they should be set somehow, plus you need to have different python environments for different versions of databricks-connect
Specify parameters as spark configuration (see in the same documentation)
For each DB cluster, do following:
separate python environment with name <name> & activate it
install databricks-connect into it
configure databricks-connect
move ~/.databricks-connect into ~/.databricks-connect-<name>
write wrapper script, that will activate python environment & symlink ~/.databricks-connect-<name> into ~/.databricks-connect (I have such script for Zsh, it could be too long to paste it here.)

Setting Up Databricks Connect

After running databricks-connect configure, when I run databricks-connect test, I am getting "The system cannot find the path specified." and then nothing happens, no error nothing. Please help me resolve this. Since there is no error message as well I am short pressed on what to google as well.
Update: I resolved this by matching the JAVA versions. The Databricks runtime in the cluster is 6.5 and on checking the documentation, it said JAVA 1.8.0_252 and so I had to look for a version closer to this and it is working now (both JDK and JRE are working).
There is still a caveat though. For tables that belong to a data lake I am still unable to make it work with
sparklyr::spark_read_parquet(sc = sc, path = "/.../parquet_table", header = TRUE, memory = FALSE)
It does work for the tables that belong to the "default" database in databricks. Not sure if this is just in my case but I am tired of all the tweaking I have been doing for the past week lol. Please comment if anyone has been able to get this working!
One of the hints is that you have JDK 15

Change Python path in CDH for pyspark

I need to change the python that is being used with my CDH5.5.1 cluster. My research pointed me to set PYSPARK_PYTHON in spark-env.sh. I tried that manually without success. I then used Cloudera Manager to set the variable in both the 'Spark Service Environment Advanced Configuration Snippet' and 'Spark Service Advanced Configuration Snippet' & about everywhere else that referenced spark-env-sh. This hasn't worked and I'm at a lost where to go next.
You need to add the PYSPARK_PYTHON variable to the YARN configuration :
YARN (MR2 Included) Service Environment Advanced Configuration Snippet (Safety Valve)
Do that, restart the cluster and you are good to go.

How to set Database credentials for different environment in TALEND Open Studio

I have a requirement to set Database connections credentials for my TALEND jobs for different environment as Runtime Values. Means if i want to run my jobs in Development environment, then it should pick DB credentials from a development csv/excel/text file and if i am running it in Production then it should pick credentials from Prod text file. Someone please tell me is this is possible and if yes can someone please guide me how to do this. I read this link but in this i am not able to configure the values in a text/csv file.
http://blog.iadvise.eu/2014/05/27/use-of-contexts-within-talend/
Use context variables. This tutorial will help you set up context variable which has different values for different environment. You can have different configuration for different environment for same job.

Resources