Unable to consume Kafka Avro records using Nifi and Schema Registry - apache-spark

I'm trying to consume Avro records from Kafka using Nifi. I have 3 topics filled from an Amazon Lambda and 2 Spark Streaming jobs, all of which use HortonWorks Schema Registry to get the Avro schema.
I tried to use a ConsumeKafkaRecord_0_10 and ConsumeKafkaRecord_2_0 and getting the same error:
I tried with an AvroReader using plain text schema inside, to be sure of the one being used, and got the same error.
When I'm using an AvroReader with the Horton Schema Registry parameter I'm getting this error:
Which could make sens because it's looking at the first byte of the record as a version parameter for the schema, and the first byte is 3. But it doesn't explain why I'm getting ArrayIndexOutOfBound when putting the schema in plain text.
Finally I can consume those topic just fine using Spark Streaming and Schema Registry. Doesn't anyone already encounter such an issue between NiFi and AvroReader when consuming Kafka.
Stack: Horton Works HDP 3.4.1 // Nifi 1.9.0 // Spark 2.3 // Schema Registry 0.7

The issue is related to how Nifi is interpreting first bytes of your Avro message. Those bytes contain information regaring:
Protocol Id - 1 byte
Schema Metadata Id - 8 bytes
Schema Version - 4 bytes
Going through the code of HortonWork Schema Registry we can find that different Protocol ID can be used to serialize your message with the AvroSerDe.
public static final byte CONFLUENT_VERSION_PROTOCOL = 0x0;
public static final byte METADATA_ID_VERSION_PROTOCOL = 0x1;
public static final byte VERSION_ID_AS_LONG_PROTOCOL = 0x2;
public static final byte VERSION_ID_AS_INT_PROTOCOL = 0x3;
public static final byte CURRENT_PROTOCOL = VERSION_ID_AS_INT_PROTOCOL;
Source
The default one used is VERSION_ID_AS_INT_PROTOCOL which means the first byte of the Avro messages is going to be 03.
When going through Nifi code, we see that it's actually using METADATA_ID_VERSION_PROTOCOL only, expecting a 01 and not taking into account anything else.
You have to force Spark to use METADATA_ID_VERSION_PROTOCOL when creating you SchemaRegistryConfig.
val config = Map[String, Object](
"schema.registry.url" -> ConfigManager.config.getProperty("schemaregistry.default.url"),
AbstractAvroSnapshotSerializer.SERDES_PROTOCOL_VERSION -> SerDesProtocolHandlerRegistry.METADATA_ID_VERSION_PROTOCOL.asInstanceOf[Object]
)
implicit val srConfig:SchemaRegistryConfig = SchemaRegistryConfig(config)

Related

Java code executed by Databricks/Spark job - settings for one executor

In our current system there is one Java code that is reading one file and it will generate many JSON documents for the full day - 24h; all JSON docs are written to CosmosDB. When I execute it in the console everything is OK. I have tried to schedule a Databricks job by using the uber-jar file and it failed with the following error:
"Resource with specified id or name already exists."
It seems ok... IMO because the default settings of the existing cluster contain many executors - so each executor will try to write to CosmosDB the same set of JSON docs.
So I changed the main method as below:
public static void main(String[] args) {
SparkConf conf01 = new SparkConf().set("spark.executor.instances","1").set("spark.executor.cores","1");
SparkSession spark = SparkSession.builder().config(conf01).getOrCreate();
...
}
But I received the same error "Resource with specified id or name already exists" from the CosmosDB.
I would like to have only one executor for this specific Java code, how to use only one spark executor?
Any help (link/doc/url/code) will be appreciated.
Thank you !

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Not loading data from titan graph with cassandra backend using gremlin.

I have added data in titan(cassandra backend) using blueprint api with java. I used following configuration in java for inserting data.
TitanGraph getTitanGraph()
{
conf2 = new BaseConfiguration();
conf2.setProperty("storage.backend", "cassandra");
conf2.setProperty("storage.directory","/some/directory");
conf2.setProperty("storage.read-only", "false");
conf2.setProperty("attributes.allow-all", true);
return TitanFactory.open(conf2);
}
Now I am trying to query that database using gremlin. I used following cmd to load it
g = TitanFactory.open("bin/cassandra.local");
following is my cassandra.local file
conf = new BaseConfiguration();
conf.setProperty("storage.backend","cassandra");
conf.setProperty("storage.hostname","127.0.0.1");
conf.setProperty("storage.read-only", "false");
conf.setProperty("attributes.allow-all", true)
but when I am running "g.V", I am getting empty graph. Please help
thanks
Make sure that you commit the changes to your TitanGraph after making graph mutations in your Java program. If you're using Titan 0.5.x, the call is graph.commit(). If you're using Titan 0.9.x, the call is graph.tx().commit().
Note that storage.directory isn't valid for a Cassandra backend, however the default value for storage.hostname is 127.0.0.1 so those should be the same between your Java program and cassandra.local. It might be easier to use a properties file to store your connection properties.

Storm Cassandra Integeration

I am newbie in both Storm and Cassandra. I want to use a Bolt to write the strings emitted by a Spout, in a column family in Cassandra. I have read the example here which seems a little bit complex for me, as it uses different classes for writing in the Cassandra DB. Furthermore, I want to know how many times the strings are written in the Cassandra DB. In the example, for me, it is not clear how we can control the number of strings entered in the Cassandra DB?
Simply, I need a Bolt to write the emitted strings by a Spout to a Cassandra column family e.g., 200 records?
Thanks in advance!
You can either use Datastax Cassandra Driver or your can you the storm-cassandra library you posted earlier.
Your requirements is unclear. You only want to store 200 tuples?
Any way, run the topology with sample data and after the stream is finished, query Cassandra and see what is there.
Apache Storm and Apache Cassandra are quite deep and extensive projects. There is no walk around learning them and do sample projects in order to learn.
hope this will help.
/*Main Class */
TopologyBuilder builder = new TopologyBuilder();
Config conf = new Config();
conf.put("cassandra.keyspace", "Storm_Output"); //Key_space name
conf.put("cassandra.nodes","ip-address-of-cassandra-machine");
conf.put("cassandra.port",9042);
//port on which cassandra is running (Default:9042)
builder.setSpout("generator", new RandomSentenceSpout(), 1);
builder.setBolt("counter", new CassandraInsertionBolt(), 1).shuffleGrouping("generator");
builder.setBolt("CassandraBolt",new CassandraWriterBolt(
async(
simpleQuery("INSERT INTO Storm_Output.tanle_name (field1,field2 ) VALUES (?,?);")
.with(
fields("field1","field2 ")
)
)
), 1).globalGrouping("counter");
// Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(1);
StormSubmitter.submitTopologyWithProgressBar("Cassnadra-Insertion", conf, builder.createTopology());
/*Bolt sending data for insertion into cassandra */
/*CassandraWriter Bolt */
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
Random rand=new Random();
basicOutputCollector.emit(new Values(rand.nextInt(20),rand.nextInt(20)));
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
// TODO Auto-generated method stub
outputFieldsDeclarer.declare(new Fields("field1","field2"));
}
}

Cassandra Client - examining columns as Strings

I am new to Cassandra & I am using the Hector Java Client for writing/reading from it.
I've the following code to insert values -
Mutator<String> mutator = HFactory.createMutator(keyspaceOperator, StringSerializer.get());
mutator.insert("jsmith", stdColumnDefinition, HFactory.createStringColumn("first", "John"));
Now, when I get the values back via the Hector client - it works cleanly -
columnQuery.setColumnFamily(stdColumnDefinition).setKey("jsmith").setName("first");
QueryResult<HColumn<String, String>> result = columnQuery.execute();
However, when I try to get the values from the command line cassandra client, I get the data in bytes, rather than human readable String format. Is there a way for me to fix this so that I can use the cassandra client to spit out Strings -
here is the sample output
[default#keyspaceUno] list StandardUno ;
Using default limit of 100
RowKey: 6a736d697468
=> (column=6669727374, value=4a6f686e, timestamp=1317183324576000)
=> (column=6c617374, value=536d697468, timestamp=1317183324606000)
1 Row Returned.
Thanks.
You can either modify the schema so that Cassandra interprets the column as a string by default, or you can tell the CLI how to interpret the data, either on a one-shot basis, or for the rest of the session, using the "assume" command.
See http://www.datastax.com/docs/0.8/dml/using_cli#reading-rows-and-columns for examples.

Resources