Getting started with Spark (Datastax Enterprise)

Getting started with Spark (Datastax Enterprise) - cassandra

I'm trying to setup and run my first Spark query following the official example.
On our local machines we have already setup last version of Datastax Enterprise packet (for now it is 4.7).
I do everything exactly according documentation, I appended latest version of dse.jar to my project but errors comes right from the beginning:
Here is the snippet from their example
SparkConf conf = DseSparkConfHelper.enrichSparkConf(new SparkConf())
.setAppName( "My application");
DseSparkContext sc = new DseSparkContext(conf);
Now it appears that DseSparkContext class has only default empty constructor.
Right after these lines comes the following
JavaRDD<String> cassandraRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(String.class))
.select("my_column");
And here comes the main problem, CassandraJavaUtil.javaFunctions(sc)method accepts only SparkContext on input and not DseSparkContext (SparkContext and DseSparkContext are completely different classes and one is not inherited from another).
I assume that documentation is not up to date with the realese version and if anyone met this problem before, please share with me your experience,
Thank you!

There looks like a bug in the docs. That should be
DseSparkContext.apply(conf)
Since DseSparkContext is a Scala object which uses the Apply function to create new SparkContexts. In Scala you can just write DseSparkContext(conf) but in Java you must actually call the method. I know you don't have access to this code so I'll make sure that this gets fixed in the documentation and see if we can get better API docs up.

Related

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?
Would it introduce compatibility issues for the users pyspark code?
The reason for this question is in many settings, folks dont have a rich regression test suites to test for pyspark/spark changes.
Here are the background info:
On 2022-10-13, the Apache Commons Text team disclosed CVE-2022-42889 (also tracked as QID 377639, and named Text4Shell): that prior to V1.10, using StringSubstitutor could trigger unwanted network access or code execution.
Pyspark packages include commons-jar-1.6.0 in lib/jars directory. The presence of such jar could trigger a security finding and require security remediation in a enterprise setting.
In going through the source code of both spark (master branch, 3.2+ ), StringSubstitutor is used in spark ErrorClassesJSONReader.scala only. Pyspark does not seem to use StringSubstitutor directly, but it is not clear if pyspark code uses this ErrorClassesJSONReader or not. (Grep of pyspark 3.1.2 source code does not yield any result. Grep of json yields several files in sql and ml direcotries)
I have assembled a conda env with pyspark, and then replace the commons-text-1.6.jar by commons-text-1.10.jar. The several test cases I tried did work OK.
So the questions are: does anyone know if there is any compatibility issue in replacing commons-text-1.6.jar by commons-text-1.10.jar ? (Will it break user pyspark/spark code?)
Thanks,

There appears to be the similar item under the spark issue https://issues.apache.org/jira/browse/SPARK-40801 and it has completed PRs that went into that changed the versions for commons-text to 1.10.0

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.

I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Failed to execute 'table' on org.apache.spark.sql.SparkSession

I have a Spark + Hive application.
It works fine. But at some point I had to create another Hive environment.
So I ran show create table ... and recreated the same view (with underlying tables). And added some data.
I can query the data from hive cli, etc.
but whenever I run my application it fails with
ERROR Failed to execute 'table' on 'org.apache.spark.sql.SparkSession' with args=([Type=java.lang.String, Value: <view name>])
I believe it refers to the line code when I can sparkSession.table(<view-name>)
What steps can be executed to troubleshoot a such issue?
UPD
Session declaration (definitely tried to create a session without this configuration)
.Config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.Config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "some.file")
.Config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.Config("spark.sql.debug.maxToStringFields", int64 2048)
.Config("spark.debug.maxToStringFields", int64 2048)

Maybe a bit trivial, but when it comes to troubleshooting this kind of an issue, really try and get to the root of the problem with a minimal set up:
I generally start off by starting the spark-shell.
Check whether it is possible to run spark.sql("SHOW DATABASES").show(20, false). If this fails, it's probably something with your Hive configuration, indeed.
Try and see whether you can run spark.table("your_table"). If not, it'll probably give you a clearer error (such as Table or view not found: ...).
If all of the above works, try to strip your application such that it only does that spark.table, which did work in your spark-shell at that point in time. If that suddenly doesn't work, it might have to do with how the SparkSession is created in your application.
If that works, try and uncomment the code piece by piece, until you're back to your original code to better pinpoint where it fails.

Running spark code locally on eclipse with spark installed on remote server

I have configured eclipse for scala and created a maven project and wrote a simple word count spark job on windows. Now my spark+hadoop are installed on linux server. How can I launch my spark code from eclipse to spark cluster (which is on linux)?
Any suggestion.

Actually this answer is not so simple, as you would expect.
I will make many assumptions, first that you use sbt, second is that you are working in a linux based computer, third is the last is that you have two classes in your project, let's say RunMe and Globals, and the last assumption will be that you want to set up the settings inside the program. Thus, somewhere in your runnable code you must have something like this:
object RunMe {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("mesos://master:5050") //If you use Mesos, and if your network resolves the hostname master to its IP.
.setAppName("my-app")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext()
//your code comes here
}
}
The steps you must follow are:
Compile the project, in the root of it, by using:
$ sbt assembly
Send the job to the master node, this is the interesting part (assuming you have the next structure in your project target/scala/, and inside you have a file .jar, which corresponds to the compiled project)
$ spark-submit --class RunMe target/scala/app.jar
Notice that, because I assumed that the project has two or more classes you would have to identify which class you want to run. Furthermore, I bet that both approaches, for Yarn and Mesos are very similar.

If you are developing a project in Windows and you want to deploy it in Linux environment then you would want to create an executable JAR file and export it to the home directory of your Linux and specify the same in your spark script (on your terminal). This is possible all because of the beauty of Java Virtual Machine. Let me know if you need more help.

To achieve what you want, you would need:
First: Build the jar (if you use gradle -> fatJar or shadowJar)
Second: In your code, when you generate the SparkConf, you need to specify Master address, spark.driver.host and relative Jar location, smth like:
SparkConf conf = new SparkConf()
.setMaster("spark://SPARK-MASTER-ADDRESS:7077")
.set("spark.driver.host", "IP Adress of your local machine")
.setJars(new String[]{"path\\to\\your\\jar file.jar"})
.setAppName("APP-NAME");
And third: Just Right Click and run from your IDE. That's it... !

What you are looking for is the master where the SparkContext should be created.
You need to set your master to be the cluster you want to use.
I invite you to read the Spark Programming Guide or follow an introductory course to understand these basic concepts. Spark is not a tool you can begin work with overnight, it takes some time.
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark

Labelling Neo4j database using Neo4django

This question is related to the github issue of Neo4django. I want to create multiple graphs using Neo4j graph DB from Django web framework. I'm using Django 1.4.5, neo4j 1.9.2 and neo4django 0.1.8.
As of now Neo4django doesn't support labeling but the above is my core purpose and I want to be able to create labels from Neo4django. So I went into the source code and tried to tweak it a little to see if I can make this addition. In my understanding, the file 'db/models/properties.py' has class BoundProperty(AttrRouter) which calls gremlin script through function save(instance, node, node_is_new). The script is as follows:
script = '''
node=g.v(nodeId);
results = Neo4Django.updateNodeProperties(node, propMap);
'''
The script calls the update function from library.groovy and all the function looks intuitive and nice. I'm trying to add on this function to support labeling but I have no experience of groovy. Does anyone have any suggestions on how to proceed? Any help would be appreciated. If it works it would be a big addition to neo4django :)
Thank you

A little background:
The Groovy code you've highlighted is executed using the Neo4j Gremlin plugin. First it supports the Gremlin graph DSL (eg node=g.v(nodeId)), which is implemented atop the Groovy language. Groovy itself is a dynamic superset of Java, so most valid Java code will work with scripts sent via connection.gremlin(...). Each script sent should define a results variable that will be returned to neo4django, even if it's just null.
Anyway, accessing Neo4j this way is handy (though will be deprecated I've heard :( ) because you can use the full Neo4j embeddeded Java API. Try something like this to add a label to a node
from neo4django.db import connection
connection.gremlin("""
node = g.v(nodeId)
label = DynamicLabel.label('Label_Name')
node.rawVertex.addLabel(label)
""", nodeId=node_id)
You might also need to add an import for DynamicLabel- I haven't run this code so I'm not sure. Debugging code written this way is a little tough, so make liberal use of the Gremlin tab in the Neo4j admin.
If you come up with a working solution, I'd love to see it (or an explanatory blog post!)- I'm sure it could be helpful to other users.
HTH!
NB - Labels will be properly supported shortly after Neo4j 2.0's release- they'll replace the current in-graph type structure.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string