How to set Unmodifiable collection serializer of Kryo in Spark code - apache-spark

I am using Kryo serialization in Spark (v1.6.1) in Java and while serializing a class which has a collection in its field, it throws the following error -
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1055)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:102)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 27 more
I found out that this is because the default CollectionSerializer of Kryo can not deserialize the collection, because its not modifiable and we should use UnmodifiableCollectionsSerializer instead.
How do I mention specifically in spark code to use UnmodifiableCollectionsSerializer for Kryo?
My current configuration is -
SparkConf conf = new SparkConf().setAppName("ABC");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class<?>[] {*list of classes I want to register*});

In case anybody else face this issue, here is the solution - I got it working by using javakaffee kryo serializers.
Add the following maven dependency:
<dependency>
<groupId>de.javakaffee</groupId>
<artifactId>kryo-serializers</artifactId>
<version>0.42</version>
</dependency>
Write a custom kryo registrator to register UnmodifiableCollectionsSerializer
public class CustomKryoRegistrator implements KryoRegistrator {
#Override
public void registerClasses(Kryo kryo) {
UnmodifiableCollectionsSerializer.registerSerializers(kryo);
}
}
Set spark.kryo.registrator to the custom registrator's fully-qualified name
conf.set("spark.kryo.registrator", "com.abc.CustomKryoRegistrator");
References -
https://github.com/magro/kryo-serializers
Spark Kryo: Register a custom serializer

Related

Quarkus unable to load the cassandra custom retry policy class

I am working on a task to migrate Quarkus from 1.x to 2.x and Quarkus integration with embedded Cassandra failed in unit testing with error -
Caused by: java.lang.IllegalArgumentException: Can't find class com.mind.common.connectors.cassandra.CassandraCustomRetryPolicy
(specified by advanced.retry-policy.class)
**Custom retry policy**
public class CassandraCustomRetryPolicy implements RetryPolicy {
public CassandraCustomRetryPolicy(DriverContext context, String profileName) {
}
//override methods
}
****quarkus test be like** -**
#QuarkusTest
#QuarkusTestResource(CassandraTestResource.class)
class Test {}
**CassandraTestResource class start the embedded cassandra**
public class CassandraTestResource implements QuarkusTestResourceLifecycleManager {
private Cassandra cassandra;
#Override
public Map<String, String> start() {
cassandra = new CassandraBuilder().version("3.11.9")
.addEnvironmentVariable("JAVA_HOME", getJavaHome())
.addJvmOptions("-Xms512M -Xmx512m").build();
cassandra.start();
}
I have override the default Cassandra driver policy in application.conf inside resource folder.
datastax-java-driver {
basic.request {
timeout = ****
consistency = ***
serial-consistency = ***
}
advanced.retry-policy {
class = com.mind.common.connectors.cassandra.CassandraCustomRetryPolicy
}
I have observed that my custom retry policy class comes under banned resource in QuarkusClassLoader.java-
String resourceName = sanitizeName(name).replace('.', '/') + ".class";
boolean parentFirst = parentFirst(resourceName, state);
if (state.bannedResources.contains(resourceName)) {
throw new ClassNotFoundException(name);
}
I have captured the following logs -
java.lang.ClassNotFoundException: com.mind.common.connectors.cassandra.CassandraCustomRetryPolicy
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:438)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:414)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:315)
at com.datastax.oss.driver.internal.core.util.Reflection.loadClass(Reflection.java:57)
at com.datastax.oss.driver.internal.core.util.Reflection.resolveClass(Reflection.java:288)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:235)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfigProfiles(Reflection.java:194)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildRetryPolicies(DefaultDriverContext.java:359)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getRetryPolicies(DefaultDriverContext.java:761)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.init(DefaultSession.java:339)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.access$1100(DefaultSession.java:300)
at com.datastax.oss.driver.internal.core.session.DefaultSession.lambda$init$0(DefaultSession.java:146)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at io.netty.channel.DefaultEventLoop.run(DefaultEventLoop.java:54)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
I am using quarkus version 2.7.2.Final with cassandra driver version 4.14.0
It's not a complete answer but I wanted to leave some notes here in case anybody else can get this over the finish line before I get back to it.
The underlying problem here is that in the Quarkus test case described above the Java driver code is loaded by the QuarkusClassLoader which (a) is more restrictive about where it loads code from and (b) doesn't appear to immediately support calling it's parent if necessary. So in this case executing the following in the test will fail with a ClassNotFoundException:
CqlSession.class.getClassLoader().forName(customretrypolicyclassname)
while the following works without issue:
CqlSession.class.getClassLoader().getParent().forName(customretrypolicyclassname)
The class loader used to load CqlSession is the QuarkusClassLoader instance while it's parent is a stock JVM class loader.
The Java driver uses Class.forName() to load the classes specified for this policy. But since the Quarkus class loader is used to load the driver code itself that's the loader that's used for these reflection ops... and as mentioned above that driver has some specific characteristics that make loading external code harder.
It worked after I initialized CQL session like -
CqlSession.builder()
.addContactPoint(new InetSocketAddress(settings.getAddress(), settings.getPort()))
.withLocalDatacenter("***")
. withClassLoader(Thread.currentThread().getContextClassLoader()).build())

Marshalling with JaxbDataFormat in Apache Camel

In all examples I see over the internet, need to call "setContext". but this function doesn't exist in JaxbDataFormat . Any idea or alternative way yo convert from xml to Json
// XML Data Format
JaxbDataFormat xmlDataFormat = new JaxbDataFormat();
JAXBContext con = JAXBContext.newInstance(Employee.class);
**xmlDataFormat.setContext(con);**
// JSON Data Format
JacksonDataFormat jsonDataFormat = new JacksonDataFormat(Employee.class);
from("file:C:/inputFolder").doTry().unmarshal(xmlDataFormat).
process(new MyProcessor()).marshal(jsonDataFormat).
to("jms:queue:javainuse").doCatch(Exception.class).process(new Processor() {
public void process(Exchange exchange) throws Exception {
Exception cause = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
System.out.println(cause);
}
});
Thanks!
You can't find this function because either you aren't importing this package:
import org.apache.camel.converter.jaxb.JaxbDataFormat;
or because you aren't adding this dependency in your pom.xml:
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-jaxb</artifactId>
<version>${camel-version}</version>
</dependency>
You can check my reference archive here https://metiago.github.io/dev/2017/02/12/java-apache-camel-xml.html?query=camel for a better example.

Cassandra List<UDT> not getting deserialized using datastax java driver 4.0.0

I am currently working on a new project and chose Cassandra as our data store.
I have a use case where I store prices for material and to accomplish it I created list of User-Defined Types (UDTs). But unfortunately, while deserialization using datastax driver. After queries for the required data, I found that the list object is null while in the database there is value for it. Is it a current limitation of Cassandra java driver or am I missing something?
This is how my simplified Entity (table) looks like:
#PrimaryKeyColumn(name = "tenant_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
private long tenantId;
#PrimaryKeyColumn(name = "item_id", ordinal = 1, type = PrimaryKeyType.CLUSTERED)
private String itemId;
#CassandraType(type = DataType.Name.LIST, userTypeName = "volume_scale_1")
private List<VolumeScale> volumeScale1;
}
So I am getting volumeScale1 as null after database select query.
And this is how my UDT looks like:
In Cassandra database:
CREATE TYPE pricingservice.volume_scale (
from_scale int,
to_scale int,
value frozen<price_value>
);
As UDT in java :
#UserDefinedType("volume_scale")
public class VolumeScale
{
#CassandraType(type = DataType.Name.TEXT, userTypeName = "from_scale")
#Column("from_scale")
private String fromScale;
#CassandraType(type = DataType.Name.TEXT, userTypeName = "to_scale")
#Column("to_scale")
private String toScale;
#CassandraType(type = DataType.Name.UDT, userTypeName = "value")
private PriceValue value;
// getter and setter
}
I also tried using Object Mapper from java driver itself as per #Alex suggestion but got stuck at one point where creating an object using ItemPriceByMaterialMapperBuilder is throwing compilation error. Is anything additional required towards annotation processing or am I missing something? do you have any idea how to use Mapper annotation? I used google AutoService also to achieve annotation processing externally but didn't work.
#Mapper
//#AutoService(Processor.class)
public interface ItemPriceByMaterialMapper
// extends Processor
{
static MapperBuilder<ItemPriceByMaterialMapper> builder(CqlSession session) {
return new ItemPriceByMaterialMapperBuilder(session);
}
#DaoFactory
ItemPriceByMaterialDao itemPriceByMaterialDao ();
// #DaoFactory
// ItemPriceByMaterialDao itemPriceByMaterialDao(#DaoKeyspace CqlIdentifier
// keyspace);
}
Version used:
Java Version: 1.8
DataStax OSS java-driver-mapper-processor: 4.5.1
DataStax OSS java-driver-mapper-runtime: 4.5.1
Cassandra: 3.11.4
Spring Boot Framework: 2.2.4.RELEASE
From what I understand, you have multiple problems: if you're using Spring Data Cassandra, then you'll get older driver (3.7.2 for Spring 2.2.6-RELEASE), and it may clash with driver 4.0.0 (it's too old, don't use it) that you're trying to use. Driver 4.x isn't binary compatible with previous drivers, and its support in Spring Data Cassandra could be only in the next major release of Spring.
Instead of Spring Data you can use Object Mapper from java driver itself - it could be more optimized than Spring version.
I decided not to use object mapper and work with Spring Data Cassandra with Spring 2.2.6-RELEASE. Thanks

How to Restrict a column of my pojo to be part of Ignite tables

I have a pojo that I am using to create ignite caches. now I want to add one more column(XXX) to that pojo and don't want that column(XXX) to be part of ignite cache creation.
Caused by: class org.apache.ignite.IgniteException: Failed to prepare Cassandra CQL statement: select "customer_ref", "tenant_id", "event_discount_id", "period_num", "domain_id", "event_source", "prod_group_id", "event_seq", "product_seq", "online_version_num", "total_authorised_mny", "version_num", "bonus_count", "customer_category", "recovery_status", "total_discounted_usage", "external_balance_liid", "total_online_discounted_mny", "anti_event_disc_mny", "total_partials_mny", "counter_usage", "total_partials_usage", "online_event_count", "event_count", "last_rated_dtm", "account_num", "dyn_alloc_charge_data", "anti_event_disc_usage", "total_usage", "anti_event_count", "last_online_event_dtm", "fast_cache_seq", "total_discounted_mny", "latest_event_dtm", "total_online_discounted_usage", "carried_over_boo", "total_authorised_usage", "total_otc_mny", "online_batch_info", "counter_resets", "total_bonus_award" from "smart"."custprodinvoicediscusage" where "customer_ref"=? and "tenant_id"=? and "event_discount_id"=? and "period_num"=? and "domain_id"=? and "event_source"=? and "prod_group_id"=? and "event_seq"=? and "product_seq"=?;
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.prepareStatement(CassandraSessionImpl.java:603)
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.execute(CassandraSessionImpl.java:201)
... 12 more
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Undefined column name recovery_status
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:104)
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.prepareStatement(CassandraSessionImpl.java:585)
... 13 more
ignite takes getter and setter method to read and write.
Changed this method signature in that POJO instead of getXXX() and setXXX()
public void putRecovery_status(Integer RECOVERY_STATUS) { this.RECOVERY_STATUS = RECOVERY_STATUS; }
public Integer fetchRecovery_status() { return RECOVERY_STATUS; }
If you don't want to make a field end up in Ignite cache, you can probably mark it transient.

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set
SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'
deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception:
java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
And what confused me is that GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem, and I even verified in the same spark-shell instance:
scala> var gfs = new com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem()
gfs: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem#46f105c
scala> gfs.isInstanceOf[org.apache.hadoop.fs.FileSystem]
res3: Boolean = true
scala> gfs.asInstanceOf[org.apache.hadoop.fs.FileSystem]
res4: org.apache.hadoop.fs.FileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem#46f105c
Did I miss anything, any workaround? Thanks in advance!
UPDATE: this is my bdutil (version 1.3.1) setting for deployment:
import_env hadoop2_env.sh
import_env extensions/spark/spark_env.sh
CONFIGBUCKET="my_conf_bucket"
PROJECT="my_proj"
GCE_IMAGE='debian-7-backports'
GCE_MACHINE_TYPE='n1-highmem-4'
GCE_ZONE='us-central1-f'
GCE_NETWORK='my-network'
GCE_MASTER_MACHINE_TYPE='n1-standard-2'
PREEMPTIBLE_FRACTION=1.0
PREFIX='my-hadoop'
NUM_WORKERS=8
USE_ATTACHED_PDS=true
WORKER_ATTACHED_PDS_SIZE_GB=200
MASTER_ATTACHED_PD_SIZE_GB=200
HADOOP_TARBALL_URI="gs://hadoop-dist/hadoop-2.6.0.tar.gz"
SPARK_MODE="yarn-client"
SPARK_HADOOP2_TARBALL_URI='gs://my_conf_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'
Short Answer
Indeed it was related to IsolatedClientLoader, and we've tracked down the root cause and verified a fix. I filed https://issues.apache.org/jira/browse/SPARK-9206 to track this issue, and successfully built a clean Spark tarball from my fork with a simple fix: https://github.com/apache/spark/pull/7549
There are a few short-term options:
Use Spark 1.3.1 for now.
In your bdutil deployment, use HDFS as the default filesystem (--default_fs=hdfs); you'll still be able to directly specify gs:// paths in your jobs, just that HDFS will be used for intermediate data and staging files. There are some minor incompatibilities using raw Hive in this mode, though.
Use the raw val sqlContext = new org.apache.spark.sql.SQLContext(sc) instead of a HiveContext if you don't need HiveContext features.
git clone https://github.com/dennishuo/spark and run ./make-distribution.sh --name my-custom-spark --tgz --skip-java-test -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver to get a fresh tarball you can specify in your bdutil's spark_env.sh.
Long Answer
We've verified that it only manifests when fs.default.name and fs.defaultFS are set to a gs:// path regardless of whether trying to load a path from parquetFile("gs://...") or parquetFile("hdfs://..."), and when fs.default.name and fs.defaultFS are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.
The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.
The following lines in that file include everything under com.google.* as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem. We just happen to unluckily share the com.google.* package namespace.
protected def isSharedClass(name: String): Boolean =
name.contains("slf4j") ||
name.contains("log4j") ||
name.startsWith("org.apache.spark.") ||
name.startsWith("scala.") ||
name.startsWith("com.google") ||
name.startsWith("java.lang.") ||
name.startsWith("java.net") ||
sharedPrefixes.exists(name.startsWith)
...
/** The classloader that is used to load an isolated version of Hive. */
protected val classLoader: ClassLoader = new URLClassLoader(allJars, rootClassLoader) {
override def loadClass(name: String, resolve: Boolean): Class[_] = {
val loaded = findLoadedClass(name)
if (loaded == null) doLoadClass(name, resolve) else loaded
}
def doLoadClass(name: String, resolve: Boolean): Class[_] = {
...
} else if (!isSharedClass(name)) {
logDebug(s"hive class: $name - ${getResource(classToPath(name))}")
super.loadClass(name, resolve)
} else {
// For shared classes, we delegate to baseClassLoader.
logDebug(s"shared class: $name")
baseClassLoader.loadClass(name)
}
}
}
This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties:
log4j.logger.org.apache.spark.sql.hive.client=DEBUG
And the output shows:
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem

Resources