Meta-Question:
We're pulling data from EventHub, running some logic, and saving it off to cosmos. Currently Cosmos inserts are our bottleneck. How do we maximize our throughput?
Details
We're trying to optimize our Cosmos throughput and there seems to be some contention in the SDK that makes parallel inserts only marginally faster than serial inserts.
We're logically doing:
for (int i = 0; i < insertCount; i++)
{
taskList.Add(InsertCosmos(sdkContainerClient));
}
var parallelTimes = await Task.WhenAll(taskList);
Here's the results comparing serial inserts, parallel inserts, and "faking" an insert (with Task.Delay):
Serial took: 461ms for 20
- Individual times 28,8,117,19,14,11,10,12,5,8,9,11,18,15,79,23,14,16,14,13
Cosmos Parallel
Parallel took: 231ms for 20
- Individual times 17,15,23,39,45,52,72,74,80,91,96,98,108,117,123,128,139,146,147,145
Just Parallel (no cosmos)
Parallel took: 27ms for 20
- Individual times 27,26,26,26,26,26,26,25,25,25,25,25,25,24,24,24,23,23,23,23
Serial is obvious (just add each value)
no cosmos (the last timing) is also obvious (just take the min time)
But parallel cosmos doesn't parallelize nearly as well, indicating there's some contention.
We're running this on a VM in Azure (same datacenter as Cosmos), have enough RUs so aren't getting 429s, and using Microsoft.Azure.Cosmos 3.2.0.
Full Code Sample
class Program
{
public static void Main(string[] args)
{
CosmosWriteTest().Wait();
}
public static async Task CosmosWriteTest()
{
var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { ConnectionMode = ConnectionMode.Direct });
var database = cosmosClient.GetDatabase("<ourcontainer>");
var sdkContainerClient = database.GetContainer("<ourcontainer>");
int insertCount = 25;
//Warmup
await sdkContainerClient.CreateItemAsync(new TestObject());
//---Serially inserts into Cosmos---
List<long> serialTimes = new List<long>();
var serialTimer = Stopwatch.StartNew();
Console.WriteLine("Cosmos Serial");
for (int i = 0; i < insertCount; i++)
{
serialTimes.Add(await InsertCosmos(sdkContainerClient));
}
serialTimer.Stop();
Console.WriteLine($"Serial took: {serialTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", serialTimes)}");
//---Parallel inserts into Cosmos---
Console.WriteLine(Environment.NewLine + "Cosmos Parallel");
var parallelTimer = Stopwatch.StartNew();
var taskList = new List<Task<long>>();
for (int i = 0; i < insertCount; i++)
{
taskList.Add(InsertCosmos(sdkContainerClient));
}
var parallelTimes = await Task.WhenAll(taskList);
parallelTimer.Stop();
Console.WriteLine($"Parallel took: {parallelTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", parallelTimes)}");
//---Testing parallelism minus cosmos---
Console.WriteLine(Environment.NewLine + "Just Parallel (no cosmos)");
var justParallelTimer = Stopwatch.StartNew();
var noCosmosTaskList = new List<Task<long>>();
for (int i = 0; i < insertCount; i++)
{
noCosmosTaskList.Add(InsertCosmos(sdkContainerClient, true));
}
var justParallelTimes = await Task.WhenAll(noCosmosTaskList);
justParallelTimer.Stop();
Console.WriteLine($"Parallel took: {justParallelTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", justParallelTimes)}");
}
//inserts
private static async Task<long> InsertCosmos(Container sdkContainerClient, bool justDelay = false)
{
var timer = Stopwatch.StartNew();
if (!justDelay)
await sdkContainerClient.CreateItemAsync(new TestObject());
else
await Task.Delay(20);
timer.Stop();
return timer.ElapsedMilliseconds;
}
//Test object to save to Cosmos
public class TestObject
{
public string id { get; set; } = Guid.NewGuid().ToString();
public string pKey { get; set; } = Guid.NewGuid().ToString();
public string Field1 { get; set; } = "Testing this field";
public double Number { get; set; } = 12345;
}
}
This is the scenario for which Bulk is being introduced. Bulk mode is in preview at this moment and available in the 3.2.0-preview2 package.
What you need to do to take advantage of Bulk is turn the AllowBulkExecution flag on:
new CosmosClient(endpoint, authKey, new CosmosClientOptions() { AllowBulkExecution = true } );
This mode was made to benefit this scenario you describe, a list of concurrent operations that need throughput.
We have a sample project here: https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/BulkSupport
And we are still working on the official documentation, but the idea is that when concurrent operations are issued, instead of executing them as individual requests like you are seeing right now, the SDK will group them based on partition affinity and execute them as grouped (batch) operations, reducing the backend service calls and potentially increasing throughput between 50%-100% depending on the volume of operations. This mode will consume more RU/s as it is pushing a higher volume of operations per second than issuing the operations individually (so if you hit 429s it means the bottleneck is now on the provisioned RU/s).
var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { AllowBulkExecution = true });
var database = cosmosClient.GetDatabase("<ourcontainer>");
var sdkContainerClient = database.GetContainer("<ourcontainer>");
//The more operations the better, just 25 might not yield a great difference vs non bulk
int insertCount = 10000;
//Don't do any warmup
List<Task> operations = new List<Tasks>();
var timer = Stopwatch.StartNew();
for (int i = 0; i < insertCount; i++)
{
operations.Add(sdkContainerClient.CreateItemAsync(new TestObject()));
}
await Task.WhenAll(operations);
serialTimer.Stop();
Important: This is a feature that is still in preview. Since this is a mode optimized for throughput (not latency), any single individual operation you do, won't have a great operational latency.
If you want to optimize even further, and your data source lets you access Streams (avoid serialization), you can use the CreateItemStream SDK methods for even better throughput.
I have created a multithreaded Kafka consumer in which one thread is assigned to each of the partition (I have total 100 partitions). I have followed https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example link.
Below is the init method of my consumer.
consumer = kafka.consumer.Consumer.createJavaConsumerConnector(createConsumerConfig());
System.out.println("Kafka Consumer initialized.");
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topicName, 100);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topicName);
executor = Executors.newFixedThreadPool(100);
In the above init method, I got the list of Kafka streams (total 100) which should be connected to each of the partition (Which is happening as expected).
Then I did submit each of the streams to a different thread using below snippet.
public Object call() {
for (final KafkaStream stream : streams) {
executor.execute(new StreamWiseConsumer(stream));
}
return true;
}
Below is the StreamWiseConsumer class.
public class StreamWiseConsumer extends Thread {
ConsumerIterator<byte[], byte[]> consumerIterator;
private KafkaStream m_stream;
public StreamWiseConsumer(ConsumerIterator<byte[], byte[]> consumerIterator) {
this.consumerIterator = consumerIterator;
}
public StreamWiseConsumer(KafkaStream kafkaStream) {
this.m_stream = kafkaStream;
}
#Override
public void run() {
ConsumerIterator<byte[], byte[]> consumerIterator = m_stream.iterator();
while(!Thread.currentThread().isInterrupted() && !interrupted) {
try {
if (consumerIterator.hasNext()) {
String reqId = UUID.randomUUID().toString();
System.out.println(reqId+ " : Event received by threadId : "+Thread.currentThread().getId());
MessageAndMetadata<byte[], byte[]> messageAndMetaData = consumerIterator.next();
byte[] keyBytes = messageAndMetaData.key();
String key = null;
if (keyBytes != null) {
key = new String(keyBytes);
}
byte[] eventBytes = messageAndMetaData.message();
if (eventBytes == null){
System.out.println("Topic: No event fetched for transaction Id:" + key);
continue;
}
String event = new String(eventBytes).trim();
// Some Processing code
System.out.println(reqId+" : Processing completed for threadId = "+Thread.currentThread().getId());
consumer.commitOffsets();
} catch (Exception ex) {
}
}
}
}
Ideally, it should start processing from all the 100 partitions in parallel. But it is picking some random number of events from one of the threads and processing it then some other thread starts processing from another partition. It seems like it's sequential processing but with different-different threads. I was expecting processing to happen from all the 100 threads. Am I missing something here?
PFB for the logs link.
https://drive.google.com/file/d/14b7gqPmwUrzUWewsdhnW8q01T_cQ30ES/view?usp=sharing
https://drive.google.com/file/d/1PO_IEsOJFQuerW0y-M9wRUB-1YJuewhF/view?usp=sharing
I doubt whether this is the right approach for vertically scaling kafka streams.
Kafka streams inherently supports multi thread consumption.
Increase the number of threads used for processing by using num.stream.threads configuration.
If you want 100 threads to process the 100 partitions, set num.stream.threads as 100.
I iterated over the entire table and received less partitions than expected.
Initially, I thought that it must be something wrong on my end, but after checking the existence of every row (I have a list of billions of keys with which I used) by using simple where query, and also verifying the expected number with the spark connector, I conclude that it can't be anything other than the driver.
I have billions of data rows, yet receiving half a billion less.
anyone else encountered this issue and was able to resolve it?
adding code snippet
The structure of the table is a simple counter table ,
CREATE TABLE counter_data (
id text,
name text,
count_val counter,
PRIMARY KEY (id, name)
) ;
public class CountTable {
private Session session;
private Statement countQuery;
public void initSession(String table) {
QueryOptions queryOptions = new QueryOptions();
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
queryOptions.setFetchSize(100);
QueryLogger queryLogger = QueryLogger.builder().build();
Cluster cluster = Cluster.builder().addContactPoints("ip").withPort(9042)
.build();
cluster.register(queryLogger);
this.session = cluster.connect("ks");
this.countQuery = QueryBuilder.select("id").from(table);
}
public void performCount(){
ResultSet results = session.execute(countQuery);
int count = 0;
String lastKey = "";
results.iterator();
for (Row row : results) {
String key = row.getString(0);
if (!key.equals(lastKey)) {
lastKey = key;
count++;
}
}
session.close();
System.out.println("count is "+count);
}
public static void main(String[] args) {
CountTable countTable = new CountTable();
countTable.initSession("counter_data");
countTable.performCount();
}
}
Upon checking your code, the consistency level requested is ONE, compared to a dirty read in RDBMS world.
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
For stronger consistency, that is to get back all records use local_quorum. Update your code as follows
queryOptions.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
local_quorum guarantees that majority of the nodes in the replica (in your case 2 out of 3) respond to the read request and hence stronger consistency resulting in accurate number of rows. Here is documentation reference on consistency.
Here is what I am trying to achieve :
On the service bus I have a topic which contains 5005 messages.
I need to peek all the messages without completing them and add them to a list (List<BrokeredMessage>)
Here is what I am trying :
IEnumerable<BrokeredMessage> dlIE = null;
List<BrokeredMessage> bmList = new List<BrokeredMessage>();
long i = 0;
while (i < count) //count is the total messages in the subscription
{
dlIE = deadLetterClient.ReceiveBatch(100);
bmList.AddRange(dlIE);
i = i + dlIE.Count();
}
In the above code I can only fetch 100 messages at a time since there is a batch size limit to retrieving the messages.
I have also tried to do asynchronously but it always returns 0 messages in the list. This is the code for that:
static List<BrokeredMessage> messageList = new List<BrokeredMessage>();
long i = 0;
while (i < count)
{
var task = ReceiveMessagesBatchForSubscription(deadLetterClient);
i = i + 100;
}
Task.WaitAny();
public async static Task ReceiveMessagesBatchForSubscription(SubscriptionClient deadLetterClient)
{
while (true)
{
var receivedMessage = await deadLetterClient.ReceiveBatchAsync(100);
messageList.AddRange(receivedMessage);
}
}
Can anyone please suggest a better way to do this?
I have the following code which is used to listen to various ringbuffers. Some are high frequency price data and some are low frequency trade data:
public static void main(String[] args)
{
HazelcastInstance client = Hazelcast.newHazelcastInstance();
Ringbuffer<String> databuffer = client.getRingbuffer("data");
long sequence = databuffer.headSequence();
while(true)
{
String d = null;
try
{
d = databuffer.readOne(sequence);
System.out.println(d);
}
catch (InterruptedException e)
{
StringWriter errors = new StringWriter();
e.printStackTrace(new PrintWriter(errors));
System.out.println(errors.toString());
}
sequence ++;
}
}
The problem is the same code used for the low frequency trade data is working fine: autodiscovering the hazelcast cluster and when data is published into the ringbuffer it is read and acted on. However, for the high frequency data where lots of data is published to the ringbuffer in high volume, the reader above starts up, and autodiscovers the hazelcast cluster, but then does not read any data at all... Although on 1 occasion it did manage to work.
I have also tried with
long sequence = databuffer.tailSequence() + 1;
Any ideas about what might be going wrong?
This was my own problem because I was not actually publishing the data I wanted to listen to. Aaargh!
It works well.