Error while read or write Parquet format data - apache-spark

I have created an external table pointing to Azure ADLS with parquet storage and while inserting the data to that table I am getting the below error. I am using Databricks for the execution
org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
This was perfectly working fine yesterday and I have started getting this error from today.
I couldn't find any answer in the internet on why is this happenning.

This issue has been fixed, the reason for the error was, we installed spark sqldb connector provided by Azure with uber jar which also got dependencies wrt parquet file formatter.

If you want a workaround without cleaning up dependencies. Here is how you choose one of the sources (exemplified with "org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat"):
Replace:
spark.read.parquet("<path_to_parquet_file>")
With
spark.read.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").load("<path_to_parquet_file>")

You may have more than 1 jar file in spark/jars/ directory for example -
spark-sql_2.12-2.4.4 and spark-sql_2.12-3.0.3 which may lead to multiple class issue.

I had a similar issue, which is Caused by Jar package dependency conflict. I use maven to package my spark jar with maven-shade-plugin, the plugin exclude the conflicting jar. And it works for me.
this is the code of pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.0</version>
<configuration>
<artifactSet>
<excludes>
<exclude>org.scala-lang:*:*</exclude>
<exclude>org.apache.spark:*:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>spark_anticheat_shaded</finalName>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>

Related

How to run Cucumber Junit tests parallely without sharing data between invoked threads

I'm running cucumber tests parallelly using below maven configuration:
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-failsafe-plugin</artifactId>
<version>3.0.0-M5</version>
<executions>
<execution>
<goals>
<goal>integration-test</goal>
<goal>verify</goal>
</goals>
</execution>
</executions>
<configuration>
<includes>
<include>TestRunner.java</include>
</includes>
<testFailureIgnore>true</testFailureIgnore>
<parallel>methods</parallel>
<threadCount>${parallelCount}</threadCount>
<forkCount>${parallelCount}</forkCount>
<reuseForks>false</reuseForks>
<perCoreThreadCount>false</perCoreThreadCount>
</configuration>
</plugin>
</plugins>
Versions:
<serenity.version>3.2.0</serenity.version>
<cucumber.version>7.2.3</cucumber.version>
<junit.version>4.13.2</junit.version>
Now issue is code is running fine, tests are running parallely but static variables are shared among threads even after using reuseForks = False
Tried various combinations for failsafe config parallel, perCoreThreadCount,
useUnlimitedThreads, reuseForks but no luck.
Any idea what changes need to be done to make so that static data is not shared between threads. Thanks!
Any idea what changes need to be done to make so that static data is not shared between threads. Thanks!
Fundamentally, it is a property of static fields that there is only one. This means that you can not have a static fields that is not shared by all threads.
Instead you may want to look at using Dependency Injection. This will allow you avoid the use of static fields by injecting data into your step definition files. This data will be scoped to a scenario and not leak out (unless you use static fields ofcourse).

Conditional VDM Generation odata-generator-maven-plugin parameters

I am using following Maven Plugin to generate the VDMs for OData consumption.
<plugin>
<groupId>com.sap.cloud.sdk.datamodel</groupId>
<artifactId>odata-generator-maven-plugin</artifactId>
<version>3.13.0</version>
<executions>
<execution>
<id>generate-consumption</id>
<phase>process-resources</phase>
<goals>
<goal>generate</goal>
</goals>
<configuration>
<overwriteFiles>true</overwriteFiles>
<inputDirectory>/src/main/resources/connectedsystem/edmx</inputDirectory>
<outputDirectory>${project.basedir}/src/gen/java</outputDirectory>
<deleteOutputDirectory>false</deleteOutputDirectory>
<packageName>com.sap.requisitioning.vdm</packageName>
</configuration>
</execution>
</executions>
</plugin>
However I do not want the VDM's to be generated in every maven build.
I would like to achieve the following behaviour
VDM are not generated in mvn clean install by default
VDM classes are generated when we pass come explicit parameter mvn clean install -D<>
Could you please suggest how can this be achieved ?
Regards
atanu
You can use Maven profiles to achieve this. Declare the plugin under a specific profile that is only active given a specific parameter like in this example.
Additionally you should take care that when running clean the generated sources are not deleted. This could happen if you generate them into the output directory (typically target).

How to execute integrationtests for own OData service in SAP Cloud SDK

We currently provide an own OData service in our Spring Boot application with the SAP Cloud Platform Provisioning SDK which is part of the SAP Cloud SDK. We are creating integration tests in the respective maven module, but when executing this via Maven it fails with the following stack trace:
[http-nio-auto-1-exec-1] ERROR com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate - Error initializing the service <service-name>
java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.<init>(File.java:418)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getFilefromFileName(CDXRuntimeDelegate.java:410)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getFileForService(CDXRuntimeDelegate.java:387)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.initialize(CDXRuntimeDelegate.java:252)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getModelProvider(CDXRuntimeDelegate.java:204)
at com.sap.gateway.core.api.provider.delegate.ProviderFactory.createModelProvider(ProviderFactory.java:202)
at com.sap.gateway.core.api.provider.delegate.ProviderFactory.getEdmModelProvider(ProviderFactory.java:128)
at com.sap.gateway.core.odata4sap.ServiceFactory.createService(ServiceFactory.java:135)
Looking at the code this seems to be related to the following post:
Why is my URI not hierarchical?
In the SDK the OData EDMX file is read as a file however since during maven execution it is in a separate JAR file (of the application module) it cannot be accessed that way. Instead it would need to be read as a stream, which in turn seems to require some refactoring.
As a workaround I copied the EDMX file to the src/test/resources/edmx of the integration-tests module.
I'm now wondering if I am missing something here, or if the execution of the integration-tests as usually done per SAP Cloud SDK is not compatible with the provisioning framework?
Although I'm not too familiar with the use case you explained, I would recommend checking out the Maven documentation on additional resource folders. You can probably point your integration-tests module to the respective /resources folder of application modules, in addition to its own /resources folder. I think relative paths should be possible.
As an alternative to what Alexander already posted, you could also automate the copying of the files via maven, like in this snippet:
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>2.6</version>
<executions>
<!-- Copying the edmx files to the integration-tests project -->
<execution>
<id>copy-resources</id>
<phase>validate</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<outputDirectory>${basedir}/src/test/resources/edmx</outputDirectory>
<resources>
<resource>
<directory>${project.parent.basedir}/srv/src/main/resources/edmx</directory>
<filtering>true</filtering>
</resource>
</resources>
</configuration>
</execution>
<execution>
<id>default-testResources</id>
<phase>process-test-resources</phase>
<goals>
<goal>testResources</goal>
</goals>
</execution>
<execution>
<id>default-resources</id>
<phase>process-resources</phase>
<goals>
<goal>resources</goal>
</goals>
</execution>
</executions>
</plugin>

Spark application with Jackson version 2.8 is incompatible with Apache Spark 1.6

I have a Spark application which uses Jackson 2.8 API and I use spark 1.6 as provided (scope) dependency in application pom.xml. When I try to deploy the Spark application in cluster mode, the Jackson older version from Spark 1.6 build is been picked causing the application to fail.
I tired supplying 2.8 Jackson jar through "--jars" option, build Uber application jar with latest Jackson dependency included and userClasspathFirst option on executor/driver - None of these options helped.
I placed the latest Jackson jar in all Spark worker nodes in the same location and added the path to the executor classpath option - Only in this option, the latest Jackson version is picked. In this solution every time I add a new worker node to my application, I have to place the latest Jackson which I find as a disadvantage. If someone has a better solution, please let me know.
You can try to shade Jackson. For example in maven you would do something like this:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<relocations>
<relocation>
<pattern>com.fastxml.jackson</pattern>
<shadedPattern>com.mycompany.shaded.com.fastxml.jackson</shadedPattern>
</relocation>
</relocations>
<finalName>FatJarName</finalName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
The idea is that it will basically rename the jackson package and will change your internal access to use it. Then submit the new fat jar.
Note: This does not always work (specifically if you use reflection to access jackson it can point to the wrong version).

Spark: Avoiding Namespace Conflict when building modified spark

I am building a custom spark into a jar file. And I want to use that while using the default spark build.
How do I change the namespace from org.apache.spark.allOfSpark into org.another.spark.allOfSpark without going through all files?
I want to do this in order to avoid conflict when importing modules. Thanks in advance.
Depending on the build tool you are using, you could use Maven's relocation feature to move your custom spark into a new package at build-time. There are similar features in sbt and other build tools.
If you specify what you are using to build your project, I can further help on your issue.
-- UPDATE
Here is a sample code for your pom.xml that should help you getting started :
<project>
<!-- Your project definition here, with the groupId, artifactId, and it's dependencies -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.spark</pattern>
<shadedPattern>shaded.org.apache.spark</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
This will effectively move all of Spark into a new package called shaded.org.apache.spark when you package your application (when you ask Maven to produce a jar).
If you need to exclude certain packages, you can use the <exclude> tag as shown in the link of Maven's relocation.
If what you are trying to achieve is simply to customize some parts of Spark, I would advise you to either fork Spark's code and directly rewrite parts of MLLib, and then build it only for you (or contribue it to the community if it can useful).
Or you could simply pull it as a dependency from Maven and just overwrite the classes you are modifying, Maven should then use your own class instead of the one in the original Spark package.

Resources