Spark 1.4 image for Google Cloud? - apache-spark

With bdutil, the latest version of tarball I can find is on spark 1.3.1:
gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz
There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround?
UPDATE:
Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this exception is possible, GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem. Will continue investigate on this.
Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:504)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:356)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:171)
Asked a separate question about the exception here
UPDATE:
The error turned out to be a Spark defect; resolution/workaround provided in the above question.
Thanks!
Haiying

If a local workaround is acceptable, you can copy the spark-1.4.1-bin-hadoop2.6.tgz from an apache mirror into a bucket that you control. You can then edit extensions/spark/spark-env.sh and change SPARK_HADOOP2_TARBALL_URI='<your copy of spark 1.4.1>' (make certain that the service account running your VMs has permission to read the tarball).
Note that I haven't done any testing to see if Spark 1.4.1 works out of the box right now, but I'd be interested in hearing your experience if you decide to give it a go.

Related

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?
Would it introduce compatibility issues for the users pyspark code?
The reason for this question is in many settings, folks dont have a rich regression test suites to test for pyspark/spark changes.
Here are the background info:
On 2022-10-13, the Apache Commons Text team disclosed CVE-2022-42889 (also tracked as QID 377639, and named Text4Shell): that prior to V1.10, using StringSubstitutor could trigger unwanted network access or code execution.
Pyspark packages include commons-jar-1.6.0 in lib/jars directory. The presence of such jar could trigger a security finding and require security remediation in a enterprise setting.
In going through the source code of both spark (master branch, 3.2+ ), StringSubstitutor is used in spark ErrorClassesJSONReader.scala only. Pyspark does not seem to use StringSubstitutor directly, but it is not clear if pyspark code uses this ErrorClassesJSONReader or not. (Grep of pyspark 3.1.2 source code does not yield any result. Grep of json yields several files in sql and ml direcotries)
I have assembled a conda env with pyspark, and then replace the commons-text-1.6.jar by commons-text-1.10.jar. The several test cases I tried did work OK.
So the questions are: does anyone know if there is any compatibility issue in replacing commons-text-1.6.jar by commons-text-1.10.jar ? (Will it break user pyspark/spark code?)
Thanks,
There appears to be the similar item under the spark issue https://issues.apache.org/jira/browse/SPARK-40801 and it has completed PRs that went into that changed the versions for commons-text to 1.10.0

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Error While inserting rows into Kudu using Spark Shell

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

What happens - NoSuchMethodError: com.datastax.driver.core.ResultSet.fetchMoreResults

cassandra-connector-assembly-2.0.0 built from github project.
with Scala 2.11.8, cassandra-driver-core-3.1.0
sc.cassandraTable("mykeyspace", "mytable").select("something").where("key=?", key).mapPartitions(par => {
par.map({ row => (row.getString("something"), 1 ) })
})
.reduceByKey(_ + _).collect().foreach(println)
The same job works fine for reading less mass data
java.lang.NoSuchMethodError: com.datastax.driver.core.ResultSet.fetchMoreResults()Lshade/com/datastax/spark/connector/google/common/util/concurrent/ListenableFuture;
at com.datastax.spark.connector.rdd.reader.PrefetchingResultSetIterator.maybePrefetch(PrefetchingResultSetIterator.scala:26)
at com.datastax.spark.connector.rdd.reader.PrefetchingResultSetIterator.next(PrefetchingResultSetIterator.scala:39)
at com.datastax.spark.connector.rdd.reader.PrefetchingResultSetIterator.next(PrefetchingResultSetIterator.scala:17)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Can any one suggest or point out to the issue, and a possible solution?
had the same problem
There were two dependencies in the project which both had cassandra-driver-core as a dependency
spark-cassandra-connector_2.11-2.0.0-M3 &
job-server-api_2.10-0.8.0-SNAPSHOT
spark-cassandra-connecter expected ResultSet.fetchMoreResults to have a different return type, due to its shading of guava
expected . shade.com.datastax.spark.connector.google.common.util.concurrent.ListenableFuture
found .
com.google.common.util.concurrent.ListenableFuture
switched to an unshaded version of cassandra-connector to correct the issue
It is a conflict with the Cassandra driver-core that
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3"
brings in.
If you go into the ~/.ivy2/cache/com.datastax.spark/spark-cassandra-connector_2.11 you will find a file called ivy-2.0.0-M3.xml
In that file the dependency is
com.datastax.cassandra" name="cassandra-driver-core" rev="3.0.2" force="true"
Note that it is the 3.0.2 version of Cassandra driver core which gets overrun by the more recent one.
It just so happens that the latest source on Github does not show a implementation for fetchMoreResults which is inherited from interface PagingIterable
If you roll back the Git version to 3.0.x on Github, you'll find
public ListenableFuture<ResultSet> fetchMoreResults();
So it looks like the newest Cassandra core drivers were rushed out the door incomplete. Or I might be missing something. Hope this helps.
tl;dr; Remove the latest driver and use the one embedded in the spark cassandra connector.
The problem is resolved by removing cassandra-driver-core-3.1.0-shaded.jar from spark/jars/
a topical java duplicated classes conflicted problem?!
Need to confirm all jars included, if there're any duplicated jars involved.
The solution mentioned above is only one of case.
For all these problems run below command and check if there is any overlapping dependency exists-
mvn dependency:tree

TaskMemoryManager is disabled

i am trying to execute tasktracker on Cygwin but following error occur's as:-
mapred.TaskTracker: Process Tree implementation is missing on this system. TaskMemoryManager is disabled.
Rest all (i.e. Namenode,Secondarynamenode,Jobtracker and Datanode) working properly through cygwin but the issue is with the Tasktracker.I am hadoop version:hadoop-19.0.1
So,How I get rid of it.If anybody knows please help!.
Your Help will be appreciated!
I didn't encountered this specific problem but ...
Make sure that you are using the same hadoop version that it is in use on the cluster.
Update Hadoop to more recent version if possible.
The following patches may address (or maybe not) your problem:
https://issues.apache.org/jira/browse/HADOOP-6230
https://issues.apache.org/jira/browse/MAPREDUCE-834

Resources