Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)? - apache-spark

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?
Would it introduce compatibility issues for the users pyspark code?
The reason for this question is in many settings, folks dont have a rich regression test suites to test for pyspark/spark changes.
Here are the background info:
On 2022-10-13, the Apache Commons Text team disclosed CVE-2022-42889 (also tracked as QID 377639, and named Text4Shell): that prior to V1.10, using StringSubstitutor could trigger unwanted network access or code execution.
Pyspark packages include commons-jar-1.6.0 in lib/jars directory. The presence of such jar could trigger a security finding and require security remediation in a enterprise setting.
In going through the source code of both spark (master branch, 3.2+ ), StringSubstitutor is used in spark ErrorClassesJSONReader.scala only. Pyspark does not seem to use StringSubstitutor directly, but it is not clear if pyspark code uses this ErrorClassesJSONReader or not. (Grep of pyspark 3.1.2 source code does not yield any result. Grep of json yields several files in sql and ml direcotries)
I have assembled a conda env with pyspark, and then replace the commons-text-1.6.jar by commons-text-1.10.jar. The several test cases I tried did work OK.
So the questions are: does anyone know if there is any compatibility issue in replacing commons-text-1.6.jar by commons-text-1.10.jar ? (Will it break user pyspark/spark code?)
Thanks,

There appears to be the similar item under the spark issue https://issues.apache.org/jira/browse/SPARK-40801 and it has completed PRs that went into that changed the versions for commons-text to 1.10.0

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Azure ML release bug AZUREML_COMPUTE_USE COMMON_RUNTIME

On 2021-10-13 in our application in Azure ML platform we get this new error that causes failures in pipeline steps - python module import failures - warning stack <- warning that leads to pipeline runtime error
we needed to set it to false. Why is it failing? What exactly are exact (and long term) consequences when opting out? Also, Azure ML users - do you think it was rolled out appropriately?
Try to add into your envirnoment new variable like this:
environment.environment_variables = {"AZUREML_COMPUTE_USE_COMMON_RUNTIME":"false"}
Long term (throughout 2022), AzureML will be fully migrating to the new Common Runtime on AmlCompute. Short term, this change is a large undertaking, and we're on the lookout for tricky functionality of the old Compute Runtime we're not yet handling correctly.
One small note on disabling Common Runtime, it can be more efficient (avoids an Environment rebuild) to add the environment variable directly to the RunConfig:
run_config.environment_variables["AZUREML_COMPUTE_USE_COMMON_RUNTIME"] = "false"
We'd like to get more details about the import failures, so we can fix the regression. Are you setting the PYTHONPATH environment variable to make your custom scripts importable? If so, this is something we're aware isn't working as expected and are looking to fix it within the next two weeks.
We identified the issue and have rolled out a hotfix on our end addressing the issue. There are two problems that could've caused the import issue. One is that we are overwriting the PYTHONPATH environment variable. The second is that we are not adding the python script's containing directory to python's module search path if the containing directory is not the current working directory.
It would be great if you can please try again without setting the AZUREML_COMPUTE_USE_COMMON_RUNTIME environment variable and see if the problem is still there. If it is, please reply to either Lucas's thread or mine with a minimal repro or description of where the module you are trying to import is located at in relation to the script being run and the root of the snapshot (which is the current working directory).

Warnings trying to read Spark 1.6.X Parquet into Spark 2.X

When attempting to load a spark 1.6.X parquet file into spark 2.X I am seeing many WARN level statements.
16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
at [rest of stacktrace omitted]
I am running 2.1.0 release and there are multitudes of these warnings. Is there any way - short of changing logging level to ERROR - to suppress these?
It seems these were the result of a fix made - but the warnings may not yet be removed. Here are some details from that JIRA:
https://issues.apache.org/jira/browse/SPARK-17993
I have built the code from the PR and it indeed succeeds reading the
data. I have tried doing df.count() and now I'm swarmed with
warnings like this (they are just keep getting printed endlessly in
the terminal):
Setting the logging level to ERROR is a last ditch approach: it is swallowing messages we rely upon for standard monitoring. Has anyone found a workaround to this?
For the time being - i.e until/unless this spark/parquet bug were fixed - I will be adding the following to the log4j.properties:
log4j.logger.org.apache.parquet=ERROR
The location is:
when running against external spark server: $SPARK_HOME/conf/log4j.properties
when running locally inside Intellij (or other IDE): src/main/resources/log4j.properties

Spark 1.4 image for Google Cloud?

With bdutil, the latest version of tarball I can find is on spark 1.3.1:
gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz
There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround?
UPDATE:
Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this exception is possible, GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem. Will continue investigate on this.
Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:504)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:356)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:171)
Asked a separate question about the exception here
UPDATE:
The error turned out to be a Spark defect; resolution/workaround provided in the above question.
Thanks!
Haiying
If a local workaround is acceptable, you can copy the spark-1.4.1-bin-hadoop2.6.tgz from an apache mirror into a bucket that you control. You can then edit extensions/spark/spark-env.sh and change SPARK_HADOOP2_TARBALL_URI='<your copy of spark 1.4.1>' (make certain that the service account running your VMs has permission to read the tarball).
Note that I haven't done any testing to see if Spark 1.4.1 works out of the box right now, but I'd be interested in hearing your experience if you decide to give it a go.

Getting started with Spark (Datastax Enterprise)

I'm trying to setup and run my first Spark query following the official example.
On our local machines we have already setup last version of Datastax Enterprise packet (for now it is 4.7).
I do everything exactly according documentation, I appended latest version of dse.jar to my project but errors comes right from the beginning:
Here is the snippet from their example
SparkConf conf = DseSparkConfHelper.enrichSparkConf(new SparkConf())
.setAppName( "My application");
DseSparkContext sc = new DseSparkContext(conf);
Now it appears that DseSparkContext class has only default empty constructor.
Right after these lines comes the following
JavaRDD<String> cassandraRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(String.class))
.select("my_column");
And here comes the main problem, CassandraJavaUtil.javaFunctions(sc)method accepts only SparkContext on input and not DseSparkContext (SparkContext and DseSparkContext are completely different classes and one is not inherited from another).
I assume that documentation is not up to date with the realese version and if anyone met this problem before, please share with me your experience,
Thank you!
There looks like a bug in the docs. That should be
DseSparkContext.apply(conf)
Since DseSparkContext is a Scala object which uses the Apply function to create new SparkContexts. In Scala you can just write DseSparkContext(conf) but in Java you must actually call the method. I know you don't have access to this code so I'll make sure that this gets fixed in the documentation and see if we can get better API docs up.

Resources