Spark: Avoiding Namespace Conflict when building modified spark - apache-spark

I am building a custom spark into a jar file. And I want to use that while using the default spark build.
How do I change the namespace from org.apache.spark.allOfSpark into org.another.spark.allOfSpark without going through all files?
I want to do this in order to avoid conflict when importing modules. Thanks in advance.

Depending on the build tool you are using, you could use Maven's relocation feature to move your custom spark into a new package at build-time. There are similar features in sbt and other build tools.
If you specify what you are using to build your project, I can further help on your issue.
-- UPDATE
Here is a sample code for your pom.xml that should help you getting started :
<project>
<!-- Your project definition here, with the groupId, artifactId, and it's dependencies -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.spark</pattern>
<shadedPattern>shaded.org.apache.spark</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
This will effectively move all of Spark into a new package called shaded.org.apache.spark when you package your application (when you ask Maven to produce a jar).
If you need to exclude certain packages, you can use the <exclude> tag as shown in the link of Maven's relocation.
If what you are trying to achieve is simply to customize some parts of Spark, I would advise you to either fork Spark's code and directly rewrite parts of MLLib, and then build it only for you (or contribue it to the community if it can useful).
Or you could simply pull it as a dependency from Maven and just overwrite the classes you are modifying, Maven should then use your own class instead of the one in the original Spark package.

Related

Quarkus best way to provide beans from library

I want to provide quarkus beans from a separate codebase, brought in as a dependency. What is the best way to do this?
My first thought was to find the artifact that has the annotations such as #ApplicationScoped, etc and making them part of my library dependencies, but after some searching it isn't obvious of the correct dependency.
I have also seen extensions, but making an extension feels fairly heavy; I don't need to change how Quarkus runs, just define some beans in a library.
I wish I could provide more in this question, but unsure of best-practice-wise where to go from here.
Besides using a producer method, as said by #Turing75, you may enable bean discovery by generating a Jandex Index for your library:
A dependency with a Jandex index is automatically scanned for beans.
To generate the index just add the following to your pom.xml:
<build>
<plugins>
<plugin>
<groupId>org.jboss.jandex</groupId>
<artifactId>jandex-maven-plugin</artifactId>
<version>1.2.2</version>
<executions>
<execution>
<id>make-index</id>
<goals>
<goal>jandex</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

Error while read or write Parquet format data

I have created an external table pointing to Azure ADLS with parquet storage and while inserting the data to that table I am getting the below error. I am using Databricks for the execution
org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
This was perfectly working fine yesterday and I have started getting this error from today.
I couldn't find any answer in the internet on why is this happenning.
This issue has been fixed, the reason for the error was, we installed spark sqldb connector provided by Azure with uber jar which also got dependencies wrt parquet file formatter.
If you want a workaround without cleaning up dependencies. Here is how you choose one of the sources (exemplified with "org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat"):
Replace:
spark.read.parquet("<path_to_parquet_file>")
With
spark.read.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").load("<path_to_parquet_file>")
You may have more than 1 jar file in spark/jars/ directory for example -
spark-sql_2.12-2.4.4 and spark-sql_2.12-3.0.3 which may lead to multiple class issue.
I had a similar issue, which is Caused by Jar package dependency conflict. I use maven to package my spark jar with maven-shade-plugin, the plugin exclude the conflicting jar. And it works for me.
this is the code of pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.0</version>
<configuration>
<artifactSet>
<excludes>
<exclude>org.scala-lang:*:*</exclude>
<exclude>org.apache.spark:*:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>spark_anticheat_shaded</finalName>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>

Conditional VDM Generation odata-generator-maven-plugin parameters

I am using following Maven Plugin to generate the VDMs for OData consumption.
<plugin>
<groupId>com.sap.cloud.sdk.datamodel</groupId>
<artifactId>odata-generator-maven-plugin</artifactId>
<version>3.13.0</version>
<executions>
<execution>
<id>generate-consumption</id>
<phase>process-resources</phase>
<goals>
<goal>generate</goal>
</goals>
<configuration>
<overwriteFiles>true</overwriteFiles>
<inputDirectory>/src/main/resources/connectedsystem/edmx</inputDirectory>
<outputDirectory>${project.basedir}/src/gen/java</outputDirectory>
<deleteOutputDirectory>false</deleteOutputDirectory>
<packageName>com.sap.requisitioning.vdm</packageName>
</configuration>
</execution>
</executions>
</plugin>
However I do not want the VDM's to be generated in every maven build.
I would like to achieve the following behaviour
VDM are not generated in mvn clean install by default
VDM classes are generated when we pass come explicit parameter mvn clean install -D<>
Could you please suggest how can this be achieved ?
Regards
atanu
You can use Maven profiles to achieve this. Declare the plugin under a specific profile that is only active given a specific parameter like in this example.
Additionally you should take care that when running clean the generated sources are not deleted. This could happen if you generate them into the output directory (typically target).

How to execute integrationtests for own OData service in SAP Cloud SDK

We currently provide an own OData service in our Spring Boot application with the SAP Cloud Platform Provisioning SDK which is part of the SAP Cloud SDK. We are creating integration tests in the respective maven module, but when executing this via Maven it fails with the following stack trace:
[http-nio-auto-1-exec-1] ERROR com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate - Error initializing the service <service-name>
java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.<init>(File.java:418)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getFilefromFileName(CDXRuntimeDelegate.java:410)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getFileForService(CDXRuntimeDelegate.java:387)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.initialize(CDXRuntimeDelegate.java:252)
at com.sap.cloud.sdk.service.prov.v2.rt.cdx.CDXRuntimeDelegate.getModelProvider(CDXRuntimeDelegate.java:204)
at com.sap.gateway.core.api.provider.delegate.ProviderFactory.createModelProvider(ProviderFactory.java:202)
at com.sap.gateway.core.api.provider.delegate.ProviderFactory.getEdmModelProvider(ProviderFactory.java:128)
at com.sap.gateway.core.odata4sap.ServiceFactory.createService(ServiceFactory.java:135)
Looking at the code this seems to be related to the following post:
Why is my URI not hierarchical?
In the SDK the OData EDMX file is read as a file however since during maven execution it is in a separate JAR file (of the application module) it cannot be accessed that way. Instead it would need to be read as a stream, which in turn seems to require some refactoring.
As a workaround I copied the EDMX file to the src/test/resources/edmx of the integration-tests module.
I'm now wondering if I am missing something here, or if the execution of the integration-tests as usually done per SAP Cloud SDK is not compatible with the provisioning framework?
Although I'm not too familiar with the use case you explained, I would recommend checking out the Maven documentation on additional resource folders. You can probably point your integration-tests module to the respective /resources folder of application modules, in addition to its own /resources folder. I think relative paths should be possible.
As an alternative to what Alexander already posted, you could also automate the copying of the files via maven, like in this snippet:
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>2.6</version>
<executions>
<!-- Copying the edmx files to the integration-tests project -->
<execution>
<id>copy-resources</id>
<phase>validate</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<outputDirectory>${basedir}/src/test/resources/edmx</outputDirectory>
<resources>
<resource>
<directory>${project.parent.basedir}/srv/src/main/resources/edmx</directory>
<filtering>true</filtering>
</resource>
</resources>
</configuration>
</execution>
<execution>
<id>default-testResources</id>
<phase>process-test-resources</phase>
<goals>
<goal>testResources</goal>
</goals>
</execution>
<execution>
<id>default-resources</id>
<phase>process-resources</phase>
<goals>
<goal>resources</goal>
</goals>
</execution>
</executions>
</plugin>

Spark application with Jackson version 2.8 is incompatible with Apache Spark 1.6

I have a Spark application which uses Jackson 2.8 API and I use spark 1.6 as provided (scope) dependency in application pom.xml. When I try to deploy the Spark application in cluster mode, the Jackson older version from Spark 1.6 build is been picked causing the application to fail.
I tired supplying 2.8 Jackson jar through "--jars" option, build Uber application jar with latest Jackson dependency included and userClasspathFirst option on executor/driver - None of these options helped.
I placed the latest Jackson jar in all Spark worker nodes in the same location and added the path to the executor classpath option - Only in this option, the latest Jackson version is picked. In this solution every time I add a new worker node to my application, I have to place the latest Jackson which I find as a disadvantage. If someone has a better solution, please let me know.
You can try to shade Jackson. For example in maven you would do something like this:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<relocations>
<relocation>
<pattern>com.fastxml.jackson</pattern>
<shadedPattern>com.mycompany.shaded.com.fastxml.jackson</shadedPattern>
</relocation>
</relocations>
<finalName>FatJarName</finalName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
The idea is that it will basically rename the jackson package and will change your internal access to use it. Then submit the new fat jar.
Note: This does not always work (specifically if you use reflection to access jackson it can point to the wrong version).

Resources