How can I install flashtext on every executor? - apache-spark

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.

I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Related

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line.
So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket? Please note I haven't included all parameters as I ran the command without the previous flag and it succeeded to create the cluster. However, when passing this, I get: "failed: Cannot start master: Insufficientnumber of DataNodes reporting."
If anyone managed to create a dataproc cluster by setting the fs.defaultFS that'd be great? Thanks.
It's true there are still known issues due to certain dependencies on actual HDFS; the docs were not intended to imply that setting fs.defaultFS to a GCS path at cluster-creation time would work, but to simply provide a convenient example of a property that appears in core-site.xml; in theory it would work to set fs.defaultFS to a different preexisting HDFS cluster, for example. I've filed a ticket to change the example in the documentation to avoid confusion.
Two options:
Just override fs.defaultFS at job-submission time using per-job properties
Workaround some of the known issues by setting fs.defaultFS explicitly using an initialization action instead of cluster properties.
Option 1 is better understood to work because cluster-level HDFS dependencies won't change. Option 2 works because most of the incompatibilities occur during initial startup only, and initialization actions run after the relevant daemons start up already. To override the setting in an init action, you'd use bdconfig:
bdconfig set_property \
--name 'fs.defaultFS' \
--value 'gs://my-bucket' \
--configuration_file /etc/hadoop/conf/core-site.xml \
--clobber

Pyspark has different versions in driver (python3.5) and worker(python2.7)

I am using both hdfs and normal user mode.Default Python version in local is 3.5 and in hdfs is 2.7.This error popped up when I was trying to load files in hdfs and trying to display it in jupyter.
I tried to edit the spark-env.sh file.But when I looked for it there are multple spark -env.sh files and I edited all of them but in vain.I found similar questions in stack overflow, but nothing seems to work and suit my particluar problem.
If you require information on anything, please let me know in the comments, as I dont know what kind of information is required here.
You have to make sure that the following environment variables in your spark-env.sh point to python binary executables with the same(!) version on all(!) your nodes:
PYSPARK_DRIVER_PYTHON
PYSPARK_PYTHON
If PYSPARK_PYTHON is currently not set, please set it. PYSPARK_PYTHON defines the executable for executor and driver. When you only set PYSPARK_DRIVER_PYTHON to python3.5, the executor will still use the default python executeable which is python2.7 and this raise the exeception you see.

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Error While inserting rows into Kudu using Spark Shell

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

Which config file to use for each GG example

Which spring-????-config.xml I should use to star GG nodes so the .net example GridClientApiExample works?
Each GridGain example provides a short description of how to run remote nodes in the example documentation.
Usually there are two ways to run remote nodes for the example. The first and, probably, the most convenient one is to run corresponding *NodeStartup class from IDE in the examples project. The name of startup class is specified in example documentation. The second way is to start a node with ggstart.{sh|bat} script with a configuration file specified in the documentation (if available).
GridClientApiExample works only with node started from IDE with ClientExampleNodeStartup, and there is a reason for it. The example expects a specific task class (org.gridgain.examples.misc.client.api.ClientExampleTask) to be in the node's classpath. Since this is an example-only class, it is not present in node classpath when running ggstart.{sh|bat}.
If for some reason you want to run a node with command line script for this example, you should build examples jar file and drop it to $GRIDGAIN_HOME/libs/ext (startup script will automatically pick up all additional libraries placed in this folder). Then you can use the same config which ClientExampleNodeStartup uses, namely examples/config/example-compute.xml
You can use ClientExampleNodeStartup or start node with ggstart.sh examples/config/example-compute.xml

Resources