I have the following code to generate unique eventId. This is not a pure GUID generator but a way to generate unique id across different machines/processes.
import java.util.concurrent.atomic.AtomicInteger
object EventIdGenerator extends Serializable {
val totalBits = 64
private val epochBits = 42
private val partitionIdBits = 10
private val sequenceBits = 12
private val sequenceInt = new AtomicInteger(0)
private val maxPartitionId = (Math.pow(2, partitionIdBits) - 1).toInt
private val maxSequenceId = (Math.pow(2, sequenceBits) - 1).toInt
def nextEventId(partitionId: Int): Long = {
assert(partitionId <= maxPartitionId)
val nextSequence = sequenceInt.incrementAndGet() % maxSequenceId
val timestamp = if (nextSequence == 0) {
Thread.sleep(1)
System.currentTimeMillis()
} else System.currentTimeMillis()
timestamp << (totalBits - epochBits) |
partitionId << (totalBits - epochBits - partitionIdBits) |
nextSequence
}
}
This is supposed to be called from a distributed program running over several JVMs in different machines. Here partition id is going to be unique across different JVMs and machine.
When I run this program in a single thread, it works fine. But when I run it in multiple threads where each thread calls nextEventId with a unique partition id, sometimes I get duplicate event Ids. I am not able to figure out what the problem is with the code below.
You have only 1024 different values for nextSequence. So if you have request rate more than 1024 per millisecond (for 1 partition) you have ~100% chance of collision. In fact I believe that collision will appear with even much smaller rates because of clock precision.
I see that you tried to sidestep it with sleep(1) if you detected an overflow of nextSequence. But it doesn't work in a multithreaded environment. A particular partition may even never see nextSequence equal to 0.
Related
I am trying to insert data into Cassandra local cluster using async execution and version 4 of the driver (as same as my Cassandra instance)
I have instantiated the cql session in this way:
CqlSession cqlSession = CqlSession.builder()
.addContactEndPoint(new DefaultEndPoint(
InetSocketAddress.createUnresolved("localhost",9042))).build();
Then I create a statement in an async way:
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
CompletableFuture<AsyncResultSet>[] result = data.stream().map(
(d) -> session.executeAsync(
ps.bind(d.p1,d.p2,d.p3,d.p4)
)
).toCompletableFuture()
).toArray(CompletableFuture[]::new);
return CompletableFuture.allOf(result);
}
);
data is a dynamic list filled with user data.
When I exec the code I get the following exception:
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
at com.datastax.oss.driver.api.core.AllNodesFailedException.fromErrors(AllNodesFailedException.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.sendRequest(CqlPrepareHandler.java:210)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.onThrottleReady(CqlPrepareHandler.java:167)
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.<init>(CqlPrepareHandler.java:153)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:66)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:33)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:210)
at com.datastax.oss.driver.api.core.cql.AsyncCqlSession.prepareAsync(AsyncCqlSession.java:90)
The node is active and some data are inserted before the exception rise. I have also tried to set up a data center name on the session builder without any result.
Why this exception rise if the node is up and running? Actually I have only one local node and that could be a problem?
The biggest thing that I don't see, is a way to limit the current number of active async threads.
Basically, if that (mapped) data stream gets hit hard enough, it'll basically create all of these new threads that it's awaiting. If the number of writes coming in from those threads creates enough back-pressure that node can't keep up or catch up to, the node will become overwhelmed and not accept requests.
Take a look at this post by Ryan Svihla of DataStax:
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
Its code is from the 3.x version of the driver, but the concepts are the same. Basically, provide some way to throttle-down the writes, or limit the number of "in flight threads" running at any given time, and that should help greatly.
Finally, I have found a solution using BatchStatement and a little custom code to create a chucked list.
int chunks = 0;
if (data.size() % 100 == 0) {
chunks = data.size() / 100;
} else {
chunks = (data.size() / 100) + 1;
}
final int finalChunks = chunks;
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
AtomicInteger counter = new AtomicInteger();
final List<CompletionStage<AsyncResultSet>> batchInsert = data.stream()
.map(
(d) -> ps.bind(d.p1,d.p2,d.p3,d.p4)
)
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / finalChunks))
.values().stream().map(
boundedStatements -> BatchStatement.newInstance(BatchType.LOGGED, boundedStatements.toArray(new BatchableStatement[0]))
).map(
session::executeAsync
).collect(Collectors.toList());
return CompletableFutures.allSuccessful(batchInsert);
}
);
In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.
What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!
This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.
I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.
I have a dataset which looks like this, where each user and product ID is a string:
userA, productX
userA, productX
userB, productY
with ~2.8 million products and 300 million users; about 2.1 billion user-product associations.
My end goal is to run Spark collaborative filtering (ALS) on this dataset. Since it takes int keys for users and products, my first step is to assign a unique int to each user and product, and transform the dataset above so that users and products are represented by ints.
Here's what I've tried so far:
val rawInputData = sc.textFile(params.inputPath)
.filter { line => !(line contains "\\N") }
.map { line =>
val parts = line.split("\t")
(parts(0), parts(1)) // user, product
}
// find all unique users and assign them IDs
val idx1map = rawInputData.map(_._1).distinct().zipWithUniqueId().cache()
// find all unique products and assign IDs
val idx2map = rawInputData.map(_._2).distinct().zipWithUniqueId().cache()
idx1map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx1Out)
idx2map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx2Out)
// join with user ID map:
// convert from (userStr, productStr) to (productStr, userIntId)
val rev = rawInputData.cogroup(idx1map).flatMap{
case (id1, (id2s, idx1s)) =>
val idx1 = idx1s.head
id2s.map { (_, idx1)
}
}
// join with product ID map:
// convert from (productStr, userIntId) to (userIntId, productIntId)
val converted = rev.cogroup(idx2map).flatMap{
case (id2, (idx1s, idx2s)) =>
val idx2 = idx2s.head
idx1s.map{ (_, idx2)
}
}
// save output
val convertedInts = converted.map{
case (a,b) => a.toInt.toString + "\t" + b.toInt.toString
}
convertedInts.saveAsTextFile(params.outputPath)
When I try to run this on my cluster (40 executors with 5 GB RAM each), it's able to produce the idx1map and idx2map files fine, but it fails with out of memory errors and fetch failures at the first flatMap after cogroup. I haven't done much with Spark before so I'm wondering if there is a better way to accomplish this; I don't have a good idea of what steps in this job would be expensive. Certainly cogroup would require shuffling the whole data set across the network; but what does something like this mean?
FetchFailed(BlockManagerId(25, ip-***.ec2.internal, 48690), shuffleId=2, mapId=87, reduceId=25)
The reason I'm not just using a hashing function is that I'd eventually like to run this on a much larger dataset (on the order of 1 billion products, 1 billion users, 35 billion associations), and number of Int key collisions would become quite large. Is running ALS on a dataset of that scale even close to feasible?
I looks like you are essentially collecting all lists of users, just to split them up again. Try just using join instead of cogroup, which seems to me to do more like what you want. For example:
import org.apache.spark.SparkContext._
// Create some fake data
val data = sc.parallelize(Seq(("userA", "productA"),("userA", "productB"),("userB", "productB")))
val userId = sc.parallelize(Seq(("userA",1),("userB",2)))
val productId = sc.parallelize(Seq(("productA",1),("productB",2)))
// Replace userName with ID's
val userReplaced = data.join(userId).map{case (_,(prod,user)) => (prod,user)}
// Replace product names with ID's
val bothReplaced = userReplaced.join(productId).map{case (_,(user,prod)) => (user,prod)}
// Check results:
bothReplaced.collect()) // Array((1,1), (1,2), (2,2))
Please drop a comments on how well it performs.
(I have no idea what FetchFailed(...) means)
My platform version : CDH :5.7, Spark :1.6.0/StandAlone;
My Test Data Size:31815167 all data; 31562704 distinct user strings, 4140276 distinct product strings .
First idea:
My first idea is to use collectAsMap action and then use the map idea to change the user/product string to int . With driver memory up to 12G , i got OOM or GC overhead exception (the exception is limited by driver memory).
But this idea can only use on a small data size, with bigger data size , you need a bigger driver memory .
Second idea :
Second idea is to use join method, as Tobber proposaled. Here is some test result:
Job setup:
driver: 2G , 2 cpu;
executor : (8G , 4 cpu) * 7;
I follow the steps:
1) find unique user strings and zipWithIndexes;
2) join the original data;
3) save the encoded data;
The job take about 10 minutes to finish.