Warnings trying to read Spark 1.6.X Parquet into Spark 2.X - apache-spark

When attempting to load a spark 1.6.X parquet file into spark 2.X I am seeing many WARN level statements.
16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
at [rest of stacktrace omitted]
I am running 2.1.0 release and there are multitudes of these warnings. Is there any way - short of changing logging level to ERROR - to suppress these?
It seems these were the result of a fix made - but the warnings may not yet be removed. Here are some details from that JIRA:
https://issues.apache.org/jira/browse/SPARK-17993
I have built the code from the PR and it indeed succeeds reading the
data. I have tried doing df.count() and now I'm swarmed with
warnings like this (they are just keep getting printed endlessly in
the terminal):
Setting the logging level to ERROR is a last ditch approach: it is swallowing messages we rely upon for standard monitoring. Has anyone found a workaround to this?

For the time being - i.e until/unless this spark/parquet bug were fixed - I will be adding the following to the log4j.properties:
log4j.logger.org.apache.parquet=ERROR
The location is:
when running against external spark server: $SPARK_HOME/conf/log4j.properties
when running locally inside Intellij (or other IDE): src/main/resources/log4j.properties

Related

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?

Is it OK to replace commons-text-1.6.jar by commons-text-1.10.jar (related to security alert CVE-2022-42889 / QID 377639 Text4Shell)?
Would it introduce compatibility issues for the users pyspark code?
The reason for this question is in many settings, folks dont have a rich regression test suites to test for pyspark/spark changes.
Here are the background info:
On 2022-10-13, the Apache Commons Text team disclosed CVE-2022-42889 (also tracked as QID 377639, and named Text4Shell): that prior to V1.10, using StringSubstitutor could trigger unwanted network access or code execution.
Pyspark packages include commons-jar-1.6.0 in lib/jars directory. The presence of such jar could trigger a security finding and require security remediation in a enterprise setting.
In going through the source code of both spark (master branch, 3.2+ ), StringSubstitutor is used in spark ErrorClassesJSONReader.scala only. Pyspark does not seem to use StringSubstitutor directly, but it is not clear if pyspark code uses this ErrorClassesJSONReader or not. (Grep of pyspark 3.1.2 source code does not yield any result. Grep of json yields several files in sql and ml direcotries)
I have assembled a conda env with pyspark, and then replace the commons-text-1.6.jar by commons-text-1.10.jar. The several test cases I tried did work OK.
So the questions are: does anyone know if there is any compatibility issue in replacing commons-text-1.6.jar by commons-text-1.10.jar ? (Will it break user pyspark/spark code?)
Thanks,
There appears to be the similar item under the spark issue https://issues.apache.org/jira/browse/SPARK-40801 and it has completed PRs that went into that changed the versions for commons-text to 1.10.0

AWS RDS Postgres PostGIS upgrade problems

I have an RDS instance running Postgres 11.16. I'm trying to upgrade it to 12.11 but it's giving me errors on PostGIS. If I try a "modify" I get the following error in the precheck log file:
Upgrade could not be run on Sun Sep 18 06:05:13 2022
The instance could not be upgraded from 11.16.R1 to 12.11.R1 because of following reasons. Please take appropriate action on databases that have usages incompatible with requested major engine version upgrade and try again.
Following usages in database 'XXXXX' need to be corrected before upgrade:
-- The instance could not be upgraded because there are one or more databases with an older version of PostGIS extension or its dependent extensions (address_standardizer, address_standardizer_data_us, postgis_tiger_geocoder, postgis_topology, postgis_raster) installed. Please upgrade all installations of PostGIS and drop its dependent extensions and try again.
----------------------- END OF LOG ----------------------
First, I tried just removing postgis to upgrade then add it back again. I used: drop extension postgis cascade;. However, this generated the same error
Second, I tried running SELECT postgis_extensions_upgrade();. However, it gives me the following error:
NOTICE: Updating extension postgis_raster from unpackaged to 3.1.5
ERROR: function st_convexhull(raster) does not exist
CONTEXT: SQL statement "ALTER EXTENSION postgis_raster UPDATE TO "3.1.5";"
PL/pgSQL function postgis_extensions_upgrade() line 82 at EXECUTE
SQL state: 42883
Third, I tried to do a manual snapshot and upgrade the snapshot. Same results.
One additional piece of information, I ran SELECT PostGIS_Full_Version(); and this is what it returns:
"POSTGIS=""3.1.5 c60e4e3"" [EXTENSION] PGSQL=""110"" GEOS=""3.7.3-CAPI-1.11.3 b50468f"" PROJ=""Rel. 5.2.0, September 15th, 2018"" GDAL=""GDAL 2.3.1, released 2018/06/22"" LIBXML=""2.9.1"" LIBJSON=""0.12.1"" LIBPROTOBUF=""1.3.0"" WAGYU=""0.5.0 (Internal)"" TOPOLOGY RASTER (raster lib from ""2.4.5 r16765"" need upgrade) (raster procs from ""2.4.4 r16526"" need upgrade)"
As you'll notice, the raster lib is old but I can't really figure out how to upgrade it. I think this is what is causing me problems but I don't know how to overcome it.
I appreciate any thoughts.
I ended up finally giving up on this after many failed attempts. I ended up solving this by:
Spinning up a new instance on the desired postgres version
Using pg_dump on the old version (schema and data)
Using pg_restore on the new version
I'm not sure if I did something wrong with the above but I found my sequences were out of sync on a number of tables. I wrote some scripts to reset the sequence values after doing this. I had to use something like this to re-sync those sequences:
SELECT setval('the_sequence', (SELECT MAX(the_primary_key) FROM the_table)+1);
I wasted enough time and this got me past the issue. Hopefully the next upgrade doesn't give me this much trouble.

Syntax error for time travel of delta sql

I ran the example in delta doc:
SELECT * FROM delta.`/delta/events` VERSION AS OF 1
But got the following error:
mismatched input ‘AS’ expecting {<EOF>, ‘;’}(line 3, pos 44)
Does anyone know what is the correct syntax ?
Spark version: 3.1.2
Delta version: 1.0.0
Configure spark as follows:
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
This syntax is not supported in the open source version right now as it requires changes in Spark to support that syntax (the required changes are committed already). Specifically, this is a bug in the documentation that was copied from the Databricks Delta documentation. The issue with documentation will be fixed in the next major release - it was already reported.

Unable to start geomesa-accumulo

hduser#Neha-PC:/usr/local/geomesa-tutorials$ java -cp geomesa-tutorials-accumulo/geomesa-tutorials-accumulo-quickstart/target/geomesa-tutorials-accumulo-quickstart-2.3.0-SNAPSHOT.jar org.geomesa.example.accumulo.AccumuloQuickStart --accumulo.instance.id accumulo --accumulo.zookeepers localhost:2184 --accumulo.user root --accumulo.password PASS1234 --accumulo.catalog table1
Picked up JAVA_TOOL_OPTIONS: -Dgeomesa.hbase.coprocessor.path=hdfs://localhost:8020/hbase/lib/geomesa-hbase-distributed-runtime_2.11-2.2.0.jar
Loading datastore
java.lang.IncompatibleClassChangeError: Method org.locationtech.geomesa.security.AuthorizationsProvider.apply(Ljava/util/Map;Ljava/util/List;)Lorg/locationtech/geomesa/security/AuthorizationsProvider; must be InterfaceMethodref constant
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildAuthsProvider(AccumuloDataStoreFactory.scala:234)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildConfig(AccumuloDataStoreFactory.scala:162)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:48)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:36)
at org.geotools.data.DataAccessFinder.getDataStore(DataAccessFinder.java:121)
at org.geotools.data.DataStoreFinder.getDataStore(DataStoreFinder.java:71)
at org.geomesa.example.quickstart.GeoMesaQuickStart.createDataStore(GeoMesaQuickStart.java:103)
at org.geomesa.example.quickstart.GeoMesaQuickStart.run(GeoMesaQuickStart.java:77)
at org.geomesa.example.accumulo.AccumuloQuickStart.main(AccumuloQuickStart.java:25)
You need to ensure that all versions of GeoMesa on the classpath are the same. Just from your command, it seems you are at least mixing 2.3.0-SNAPSHOT with 2.2.0. Try checking out the git tag for tutorial project that corresponds to the GeoMesa version you want, as described here. If you want to use a SNAPSHOT version, you need to make sure that you have pulled the latest changes for each project.

Spark 1.4 image for Google Cloud?

With bdutil, the latest version of tarball I can find is on spark 1.3.1:
gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz
There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround?
UPDATE:
Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this exception is possible, GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem. Will continue investigate on this.
Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:504)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:356)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:171)
Asked a separate question about the exception here
UPDATE:
The error turned out to be a Spark defect; resolution/workaround provided in the above question.
Thanks!
Haiying
If a local workaround is acceptable, you can copy the spark-1.4.1-bin-hadoop2.6.tgz from an apache mirror into a bucket that you control. You can then edit extensions/spark/spark-env.sh and change SPARK_HADOOP2_TARBALL_URI='<your copy of spark 1.4.1>' (make certain that the service account running your VMs has permission to read the tarball).
Note that I haven't done any testing to see if Spark 1.4.1 works out of the box right now, but I'd be interested in hearing your experience if you decide to give it a go.

Resources