I am using Cassandra's CQLSSTableWriter to import a large amount of data into Cassandra. When I use CQLSSTableWriter to write to a table with compound primary key, the memory consumption keeps growing. The GC of JVM cannot collect any used memory. When writing to tables with no compound primary key, the JVM GC works fine.
My Cassandra version is 2.0.5. The OS is Ubuntu 14.04 x86-64. JVM parameters are -Xms1g -Xmx2g. This is sufficient for all other non-compound primary key cases.
The problem can be reproduced by the following test case:
import org.apache.cassandra.io.sstable.CQLSSTableWriter;
import org.apache.cassandra.exceptions.InvalidRequestException;
import java.io.IOException;
import java.util.UUID;
class SS {
public static void main(String[] args) {
String schema = "create table test.t (x uuid, y uuid, primary key (x, y))";
String insert = "insert into test.t (x, y) values (?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/tmp/test/t")
.forTable(schema).withBufferSizeInMB(32)
.using(insert).build();
UUID id = UUID.randomUUID();
try {
for (int i = 0; i < 50000000; i++) {
UUID id2 = UUID.randomUUID();
writer.addRow(id, id2);
}
writer.close();
} catch (Exception e) {
System.err.println("hell");
}
}
}
I figured it out myself. The row should not be too wide. 50000000 is too large for a single row.
Related
I'm trying to change all generated keys in hybris system. I'm using oracle database with hybris 6.3+
I used below groovy to solve generate new pk current value
import de.hybris.platform.core.Registry;
import de.hybris.platform.core.PK.PKCounterGenerator;
import de.hybris.platform.persistence.numberseries.SerialNumberGenerator;
import de.hybris.platform.jalo.numberseries.NumberSeriesManager;
numberseriesmanager = spring.getBean("default.core.numberSeriesManager");
Collection<String> keys = numberseriesmanager.getAllNumberSeriesKeys();
for(String ke : keys){
if(ke.contains('pk_')) {
String[] keyWrd = ke.split("_");
int keyint = Integer.parseInt(keyWrd[1]);
int current = new de.hybris.platform.core.DefaultPKCounterGenerator().fetchNextCounter(keyint);
SerialNumberGenerator generator = Registry.getCurrentTenant().getSerialNumberGenerator();
generator.removeSeries("pk_" + ke);
generator.createSeries("pk_" + ke, 1, current * 3)
}
}
I iterated over the entire table and received less partitions than expected.
Initially, I thought that it must be something wrong on my end, but after checking the existence of every row (I have a list of billions of keys with which I used) by using simple where query, and also verifying the expected number with the spark connector, I conclude that it can't be anything other than the driver.
I have billions of data rows, yet receiving half a billion less.
anyone else encountered this issue and was able to resolve it?
adding code snippet
The structure of the table is a simple counter table ,
CREATE TABLE counter_data (
id text,
name text,
count_val counter,
PRIMARY KEY (id, name)
) ;
public class CountTable {
private Session session;
private Statement countQuery;
public void initSession(String table) {
QueryOptions queryOptions = new QueryOptions();
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
queryOptions.setFetchSize(100);
QueryLogger queryLogger = QueryLogger.builder().build();
Cluster cluster = Cluster.builder().addContactPoints("ip").withPort(9042)
.build();
cluster.register(queryLogger);
this.session = cluster.connect("ks");
this.countQuery = QueryBuilder.select("id").from(table);
}
public void performCount(){
ResultSet results = session.execute(countQuery);
int count = 0;
String lastKey = "";
results.iterator();
for (Row row : results) {
String key = row.getString(0);
if (!key.equals(lastKey)) {
lastKey = key;
count++;
}
}
session.close();
System.out.println("count is "+count);
}
public static void main(String[] args) {
CountTable countTable = new CountTable();
countTable.initSession("counter_data");
countTable.performCount();
}
}
Upon checking your code, the consistency level requested is ONE, compared to a dirty read in RDBMS world.
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
For stronger consistency, that is to get back all records use local_quorum. Update your code as follows
queryOptions.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
local_quorum guarantees that majority of the nodes in the replica (in your case 2 out of 3) respond to the read request and hence stronger consistency resulting in accurate number of rows. Here is documentation reference on consistency.
I use spark v1.6. I have the below dataframe.
Primary_key | Dim_id
PK1 | 1
PK2 | 2
PK3 | 3
I would like to create a new dataframe with a new sequence #s whenever a new record comes in. Lets say, I get 2 new records from the source with values PK4 & PK5, I would like to create new dim_ids with the values 4 and 5. So, my new dataframe should look like below.
Primary_key | Dim_id
PK1 | 1
PK2 | 2
PK3 | 3
PK4 | 4
PK5 | 5
How to generate a running sequence number in spark dataframe v1.6 for the new records?
If you have a database somewhere, you can create a sequence in it, and use it with a user defined function (as you, I stumbled upon this problem...).
Reserve a bucket of sequence numbers, and use it (the incrementby parameter must be the same as the one used to create the sequence). As it's an object, SequenceID will be a singleton on each working node, and you can iterate over the bucket of sequences using the atomiclong.
It's far from being perfect (possible connection leaks, relies on a DB, locks on static class, does), comments welcome.
import java.sql.Connection
import java.sql.DriverManager
import java.util.concurrent.locks.ReentrantLock
import java.util.concurrent.atomic.AtomicLong
import org.apache.spark.sql.functions.udf
object SequenceID {
var current: AtomicLong = new AtomicLong
var max: Long = 0
var connection: Connection = null
var connectionLock = new ReentrantLock
var seqLock = new ReentrantLock
def getConnection(): Connection = {
if (connection != null) {
return connection
}
connectionLock.lock()
if (connection == null) {
// create your jdbc connection here
}
connectionLock.unlock()
connection
}
def next(sequence: String, incrementBy: Long): Long = {
if (current.get == max) {
// sequence bucket exhausted, get a new one
seqLock.lock()
if (current.get == max) {
val rs = getConnection().createStatement().executeQuery(s"SELECT NEXT VALUE FOR ${sequence} FROM sysibm.sysdummy1")
rs.next()
current.set(rs.getLong(1))
max = current.get + incrementBy
}
seqLock.unlock()
}
return current.getAndIncrement
}
}
class SequenceID() extends Serializable {
def next(sequence: String, incrementBy: Long): Long = {
return SequenceID.next(sequence, incrementBy)
}
}
val sequenceGenerator = new SequenceID(properties)
def sequenceUDF(seq: SequenceID) = udf[Long](() => {
seq.next("PK_SEQUENCE", 500L)
})
val seq = sequenceUDF(sequenceGenerator)
myDataframe.select(myDataframe("foo"), seq())
Context: Spring data cassandra official 1.0.2.RELEASE from Maven Central repo, CQL3, cassandra 2.0, datastax driver 2.0.4
Background: The cassandra blob data type is mapped to a Java ByteBuffer.
The sample code below demonstrates that you won't retrieve the correct bytes using select next to an equivalent insert. The data actually retrieved is prefixed by numerous garbage bytes that actually looks like a serialization of the entire row.
This older post relating to Cassandra 1.2 suggested that we may have to start at ByteBuffer.arrayOffset() of length ByteBuffer.remaining(), but a the arrayOffset value is actually 0.
I discovered a spring-data-cassandra 2.0.0. SNAPSHOT but the CassandraOperations API is much different, and its package name too: org.springdata... versus org.springframework...
Help in fixing this will be much welcome.
In the mean time it looks like I have to encode/decode Base64 my binary data to/from a text data type column.
--- here is the simple table CQL meta data I use -------------
CREATE TABLE person (
id text,
age int,
name text,
pict blob,
PRIMARY KEY (id)
) ;
--- follows the simple data object mapped to a CQL table ---
package org.spring.cassandra.example;
import java.nio.ByteBuffer;
import org.springframework.data.cassandra.mapping.PrimaryKey;
import org.springframework.data.cassandra.mapping.Table;
#Table
public class Person {
#PrimaryKey
private String id;
private int age;
private String name;
private ByteBuffer pict;
public Person(String id, int age, String name, ByteBuffer pict) {
this.id = id; this.name = name; this.age = age; this.pict = pict;
}
public String getId() { return id; }
public String getName() { return name; }
public int getAge() { return age; }
public ByteBuffer getPict() { return pict; }
}
}
--- and the plain java application code that simply inserts and retrieves a person object --
package org.spring.cassandra.example;
import java.nio.ByteBuffer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
import org.springframework.data.cassandra.core.CassandraOperations;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.querybuilder.QueryBuilder;
import com.datastax.driver.core.querybuilder.Select;
public class CassandraApp {
private static final Logger logger = LoggerFactory
.getLogger(CassandraApp.class);
public static String hexDump(ByteBuffer bb) {
char[] hexArray = "0123456789ABCDEF".toCharArray();
bb.rewind();
char[] hexChars = new char[bb.limit() * 2];
for ( int j = 0; j < bb.limit(); j++ ) {
int v = bb.get() & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
bb.rewind();
return new String(hexChars);
}
public static void main(String[] args) {
ApplicationContext applicationContext = new ClassPathXmlApplicationContext("app-context.xml");
try {
CassandraOperations cassandraOps = applicationContext.getBean(
"cassandraTemplate", CassandraOperations.class);
cassandraOps.truncate("person");
// prepare data
byte[] ba = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x11, 0x22, 0x33, 0x44, 0x55, (byte) 0xAA, (byte) 0xCC, (byte) 0xFF };
ByteBuffer myPict = ByteBuffer.wrap(ba);
String myId = "1234567890";
String myName = "mickey";
int myAge = 50;
logger.info("We try id=" + myId + ", name=" + myName + ", age=" + myAge +", pict=" + hexDump(myPict));
cassandraOps.insert(new Person(myId, myAge, myName, myPict ));
Select s = QueryBuilder.select("id","name","age","pict").from("person");
s.where(QueryBuilder.eq("id", myId));
ResultSet rs = cassandraOps.query(s);
Row r = rs.one();
logger.info("We got id=" + r.getString(0) + ", name=" + r.getString(1) + ", age=" + r.getInt(2) +", pict=" + hexDump(r.getBytes(3)));
} catch (Exception e) {
e.printStackTrace();
}
}
}
--- assuming you have configured a simple Spring project for cassandra
as explained at http://projects.spring.io/spring-data-cassandra/
The actual execution yields:
[main] INFO org.spring.cassandra.example.CassandraApp - We try id=1234567890, name=mickey, age=50, pict= 0001020304051122334455AACCFF
[main] INFO org.spring.cassandra.example.CassandraApp - We got id=1234567890, name=mickey, age=50, pict=8200000800000073000000020000000100000004000A6D796B657973706163650006706572736F6E00026964000D00046E616D65000D000361676500090004706963740003000000010000000A31323334353637383930000000066D69636B657900000004000000320000000E 0001020304051122334455AACCFF
although the insert looks correct in the database itself, as seen from cqlsh command line:
cqlsh:mykeyspace> select * from person;
id | age | name | pict
------------+-----+--------+--------------------------------
1234567890 | 50 | mickey | 0x0001020304051122334455aaccff
(1 rows)
I had exactly the same problem but have fortunately found a solution.
The problem is that ByteBuffer use is confusing. Try doing something like:
ByteBuffer bb = resultSet.one().getBytes("column_name");
byte[] data = new byte[bb.remaining()];
bb.get(data);
Thanks to Sylvain's for this suggestion here: http://grokbase.com/t/cassandra/user/134brvqzd3/blobs-in-cql
Give a look at the Bytes class of the Datastax Java Driver, it provides what you need to encode/decode your data.
In this post I wrote an usage example.
HTH,
Carlo
I found some exceptions from cassandra when I do batch mutation, it said "already has modifications in this mutation", but the info given are two different operations.
I use Super column with counters in this case, it's like
Key: md5 of urls, utf-8
SuperColumnName: date, utf-8
ColumnName: Counter name is a random number from 1 to 200,
ColumnValue:1L
L
public void SuperCounterMutation(ArrayList<String> urlList) {
LinkedList<HCounterSuperColumn<String, String>> counterSuperColumns;
for(String line : urlList) {
String[] ele = StringUtils.split(StringUtils.strip(line), ':');
String key = ele[0];
String SuperColumnName = ele[1];
LinkedList<HCounterColumn<String>> ColumnList = new LinkedList<HCounterColumn<String>>();
for(int i = 2; i < ele.length; ++i) {
ColumnList.add(HFactory.createCounterColumn(ele[i], 1L, ser));
}
mutator.addCounter(key, ColumnFamilyName, HFactory.createCounterSuperColumn(SuperColumnName, ColumnList, ser, ser));
++count;
if(count >= BUF_MAX_NUM) {
try {
mutator.execute();
} catch(Exception e) {
e.printStackTrace();
}
mutator = HFactory.createMutator(keyspace, ser);
count = 0;
}
}
return;
}
Error info from cassandra log showed that the duplicated operations have the same key only, SuperColumnName are not the same, and for counter name set, some conflicts have intersects and some not.
I'm using Cassandra 0.8.1 with hector 0.8.0-rc2
Can anyone tell me the reason of this problem? Thanks in advance!
Error info from cassandra log showed that the duplicated operations have the same key
Bingo. You'll need to combine operations from the same key into a single mutation.