I'm running a bunch of queries one after the other but It seems like some queries are not having any effect even though no errors are thrown UNLESS I restart the session after each query. I'm using datastax cassandra driver for this.
Here are the queries, which I'm storing in a file seperated by ###.
DROP KEYSPACE if exists test_space;
####
CREATE KEYSPACE test_space WITH replication = {'class': 'NetworkTopologyStrategy','0':'2'};
####
CREATE TABLE test_space.fr_core (
frid text PRIMARY KEY,
attributes text,
pk1 text,
pk2 text,
pk3 text,
pk4 text,
pk5 text,
pk6 text
);
####
Here's the code for executing the above statements :
public class CassandraKeyspaceDelete {
public static void main(String[] args) {
try {
new CassandraKeyspaceDelete().run();
} catch (Exception e) {
e.printStackTrace();
}
}
public void run() {
// Get file from resources folder
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
InputStream is = classloader.getResourceAsStream("create_keyspace.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder out = new StringBuilder();
String line;
try {
while ((line = reader.readLine()) != null) {
out.append(line);
}
// read from input stream
reader.close();
} catch (Exception e) {
System.out.println("Error reading kespace creation script.");
return;
}
// System.out.println();
com.datastax.driver.core.Session readSession = CassandraManager.connect("12.10.1.122", "", "READ");
String selectStmnts[] = out.toString().split("####");// { };
for (String selectStmnt : selectStmnts) {
System.out.println("" + selectStmnt.trim());
if (selectStmnt.trim().length() > 0) {
ResultSet res = readSession.execute(selectStmnt.trim());
}
// readSession.close();
if (readSession.isClosed()) {
readSession = CassandraManager.connect("12.10.1.122", "", "READ");
}
}
System.out.println("Done");
return;
}
}
Here's the CassandraManager class :
public class CassandraManager {
static Cluster cluster;
public static Session session;
static PreparedStatement statement;
static BoundStatement boundStatement;
public static HashMap<String, Session> sessionStore = new HashMap<String, Session>();
public static Session connect(String ip, String keySpace,String type) {
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 400);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect();
return s;
}
}
When I run this, the first two CQL queries run without errors. When the third one runs, I get an error saying Keyspace test_space doesn't exist.
If I uncomment out readSession.close(), all the queries execute though each time the session is closed and then opened resulting in slow execution.
Why aren't the queries working unless session is restarted after each query ?
I created a new project and tried your code in my Cassandra sandbox. It worked with four changes:
My datacenter is defined as "DC1", so the replication factor I used for the test_space keyspace was {'class': 'NetworkTopologyStrategy','DC1':'1'};
My sandbox instance is secured, so I had to use .withCredentials in the Cluster.builder
I couldn't get getResourceAsStream to work, so I replaced that with a FileInputStream instead.
I moved readSession.close(); outside of the for loop.
Based on the fact that it worked on mine, I can't speak to the behavour that you are seeing, so I will offer a few observations:
Is your datacenter really named 0? Your keyspace replication factor {'class': 'NetworkTopologyStrategy','0':'2'} is telling Cassandra to put two replicas in the 0 datacenter. If that really is the case, you should make your datacenter name something a little more intuitive.
None of the statements in your text file return a result set. So doing this ResultSet res = readSession.execute(selectStmnt.trim()); really doesn't get you anything.
Given the name of your keyspace, I can only assume that you are testing some things out. So how do you know that you need all of these options on your cluster builder? My advice to you, is to start simple. Don't add the other options unless you know that you need them, and more importantly, what they do.
cluster = Cluster.builder()
.addContactPoint(ip)
.build();
Session s = cluster.connect();
Make sure that your readSession.close(); is outside of your for loop.
Something else that might help you, is to read through Things You Should Be Doing When Using Cassandra Drivers by DataStax's Rebecca Mills.
Related
I was following up with an example at https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/cassandra.html to connect Cassandra as sink in Flink
My code for is shown below
public class writeToCassandra {
private static final String CREATE_KEYSPACE_QUERY = "CREATE KEYSPACE test WITH replication= {'class':'SimpleStrategy', 'replication_factor':1};";
private static final String createTable = "CREATE TABLE test.cassandraData(id varchar, heart_rate varchar, PRIMARY KEY(id));" ;
private final static Collection<String> collection = new ArrayList<>(50);
static {
for (int i = 1; i <= 50; ++i) {
collection.add("element " + i);
}
}
public static void main(String[] args) throws Exception {
//setting the env variable to local
StreamExecutionEnvironment envrionment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple2<String, String>> dataStream = envrionment
.fromCollection(collection)
.map(new MapFunction<String, Tuple2<String, String>>() {
final String mapped = " mapped ";
String[] splitted;
#Override
public Tuple2<String, String> map(String s) throws Exception {
splitted = s.split("\\s+");
return Tuple2.of(
UUID.randomUUID().toString(),
splitted[0] + mapped + splitted[1]
);
}
});
CassandraSink.addSink(dataStream)
.setQuery("INSERT INTO test.cassandraData(id,heart_rate) values (?,?);")
.setHost("127.0.0.1")
.build();
envrionment.execute();
} //main
} //writeToCassandra
I am getting the following error
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
Not sure if this is always required, but the way that I set up my CassandraSink is like this:
CassandraSink
.addSink(dataStream)
.setClusterBuilder(new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return Cluster.builder()
.addContactPoints(myListOfCassandraUrlsString.split(","))
.withPort(portNumber)
.build();
}
})
.build();
I have annotated POJOs that are returned by the dataStream so I don't need the query, but you would just include ".setQuery(...)" after the ".addSink(...)" line.
The exception simply indicates that the example program cannot reach the C* database.
flink-cassandra-connector offers streaming API to connect to designated C* database. Thus, you need to have a C* instance running.
Each streaming job is pushed/serialized to the node that Task Manager runs at. In your example, you assume C* is running on the same node as the TM node. An alternative is to change the C* address from 127.0.0.1 to a public address.
I am evaluating spark with marklogic database. I have read a csv file, now i have a JavaRDD object which i have to dump into marklogic database.
SparkConf conf = new SparkConf().setAppName("org.sparkexample.Dataload").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("/root/ml/workArea/data.csv");
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Record> rdd_records = data.map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3],fields[4]);
return sd;
}
});
This JavaRDD object i want to write to marklogic database.
Is there any spark api available for faster writing to the marklogic database ?
Lets say, If we could not write JavaRDD directly to marklogic then what is the currect approach to achieve this ?
Here is the code which i am using to write the JavaRDD data to marklogic database, let me know if it is wrong way to do that.
final DatabaseClient client = DatabaseClientFactory.newClient("localhost",8070, "MLTest");
final XMLDocumentManager docMgr = client.newXMLDocumentManager();
rdd_records.foreachPartition(new VoidFunction<Iterator<Record>>() {
public void call(Iterator<Record> partitionOfRecords) {
while (partitionOfRecords.hasNext()) {
Record record = partitionOfRecords.next();
System.out.println("partitionOfRecords - "+record.toString());
String docId = "/example/"+record.getID()+".xml";
JAXBContext context = JAXBContext.newInstance(Record.class);
JAXBHandle<Record> handle = new JAXBHandle<Record>(context);
handle.set(record);
docMgr.writeAs(docId, handle);
}
}
});
client.release();
I have used java client api to write the data, but i am getting below exception even though POJO class Record is implementing Serializable interface. Please let me know what could be the reason & how to solve that.
org.apache.spark.sparkexception task not Serializable .
The easiest way to get data into MarkLogic is via HTTP and the client REST API - specifically the /v1/documents endpoints - http://docs.marklogic.com/REST/client/management .
There are a variety of ways to optimize this, such as via a write set, but based on your question, I think the first thing to decide is - what kind of document do you want to write for each Record? Your example shows 5 columns in the CSV - typically, you'll write either a JSON or XML document with 5 fields/elements, each named based on the column index. So you'd need to write a little code to generate that JSON/XML, and then use whatever HTTP client you prefer (and one option is the MarkLogic Java Client API) to write that document to MarkLogic.
That addresses your question of how to write a JavaRDD to MarkLogic - but if your goal is to get data from a CSV into MarkLogic as fast as possible, then skip Spark and use mlcp - https://docs.marklogic.com/guide/mlcp/import#id_70366 - which involves zero coding.
Modified example from spark streaming guide, Here you will have to implement connection and writing logic specific to database.
public void send(JavaRDD<String> rdd) {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> partitionOfRecords) {
// ConnectionPool is a static, lazily initialized pool of
Connection connection = ConnectionPool.getConnection();
while (partitionOfRecords.hasNext()) {
connection.send(partitionOfRecords.next());
}
ConnectionPool.returnConnection(connection); // return to the pool
// for future reuse
}
});
}
I'm wondering if you just need to make sure everything you access inside your VoidFunction that was instantiated outside it is serializable (see this page). DatabaseClient and XMLDocumentManager are of course not serializable, as they're connected resources. You're right, however, to not instantiate DatabaseClient inside your VoidFunction as that would be less efficient (though it would work). I don't know if the following idea would work with spark. But I'm guessing you could create a class that keeps hold of a singleton DatabaseClient instance:
public static class MLClient {
private static DatabaseClient singleton;
private MLClient() {}
public static DatabaseClient get(DatabaseClientFactory.Bean connectionInfo) {
if ( connectionInfo == null ) {
throw new IllegalArgumentException("connectionInfo cannot be null");
}
if ( singleton == null ) {
singleton = connectionInfo.newClient();
}
return singleton;
}
}
then you just create a serializable DatabaseClientFactory.Bean outside your VoidFunction so your auth info is still centralized
DatabaseClientFactory.Bean connectionInfo =
new DatabaseClientFactory.Bean();
connectionInfo.setHost("localhost");
connectionInfo.setPort(8000);
connectionInfo.setUser("admin");
connectionInfo.setPassword("admin");
connectionInfo.setAuthenticationValue("digest");
Then inside your VoidFunction you could get that singleton DatabaseClient and new XMLDocumentManager like so:
DatabaseClient client = MLClient.get(connectionInfo);
XMLDocumentManager docMgr = client.newXMLDocumentManager();
I am new to spark and I am want to save the output of recommendProductsForUsers to Hbase table. I found an example (https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/) showing to use JavaPairRDD and saveAsNewAPIHadoopDataset to save.
How can I convert JavaRDD<Tuple2<Object, Rating[]>> to JavaPairRDD<ImmutableBytesWritable, Put> so that I can use saveAsNewAPIHadoopDataset?
//Loads the data from hdfs
MatrixFactorizationModel sameModel = MatrixFactorizationModel.load(jsc.sc(), trainedDataPath);
//Get recommendations for all users
JavaRDD<Tuple2<Object, Rating[]>> ratings3 = sameModel.recommendProductsForUsers(noOfProductsToReturn).toJavaRDD();
By using mapToPair. From the same source you provided example(i changed types by hand):
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = javaRDD.mapToPair(
new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> row) throws Exception {
Put put = new Put(Bytes.toBytes(row.getString(0)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier1"), Bytes.toBytes(row.getString(1)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier2"), Bytes.toBytes(row.getString(2)));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
It goes like this, you cretne new instance of put supplying it with row key in constructor, and then for each column you call add. and then you return the put created.
This is how i solved the above problem, hope this will be helpful to someone.
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts1 = ratings3
.mapToPair(new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> arg0)
throws Exception {
Rating[] userAndProducts = arg0._2;
System.out.println("***********" + userAndProducts.length + "**************");
List<Item> items = new ArrayList<Item>();
Put put = null
String recommendedProduct = "";
for (Rating r : userAndProducts) {
//Some logic here to convert Ratings into appropriate put command
// recommendedProduct = r.product;
}
put.addColumn(Bytes.toBytes("recommendation"), Bytes.toBytes("product"),Bytes.toBytes(recommendedProduct)); Bytes.toBytes("product"),Bytes.toBytes(response.getItems().toString()));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
System.out.println("*********** Number of records in JavaPairRdd: "+ hbasePuts1.count() +"**************");
hbasePuts1.saveAsNewAPIHadoopDataset(newApiJobConfig.getConfiguration());
jsc.stop();
We just open sourced Splice Machine and we have examples integrating MLIB with querying and storage into Splice Machine. I do not know if this will help but thought I would let you know.
http://community.splicemachine.com/use-spark-libraries-splice-machine/
Thanks for the post, very cool.
Has anyone retrieved the auto-generated keys for a database insert while using Groovy SQL's withBatch method? I have the following code
def Sql target = ...//database connection
target.withBatch { ps ->
insertableStuff.each { ps.addBatch ( it ) }
ps.executeBatch()
def results = ps.getGeneratedKeys() //what do I do with this?
}
We're using DB2, and I've successfully tested the getGeneratedKeys method with a single statement/result set, but once I wrap the process in a batch, I'm not sure what objects I'm dealing with anymore.
According to IBM, it is possible to get the results back, but their example is using standard JDBC objects, not the groovy ones. Any ideas?
I took the Groovy SQL stuff out the picture to see if I could get something working, I wanted to make sure that DB2 for z/OS actually supported the function, and was able to get the generated values. I was using IBM's example, however I had to add some extra code to handle for the casting that the IBM example is using.
SQL target = ...//get database connection
def preparedStatement = target.connection.prepareStatement(statement, ['ISN'] as String[])
ResultSet[] resultSets = ((DB2PreparedStatement) (ps.getDelegate().getDelegate())).getDBGeneratedKeys()
resultSets.each { ResultSet results ->
while(results.next()) {
println results.getInt(1)
}
}
So... that's a little clunky, but it's functional. Unfortunately, by controlling the statement myself, I lost all of the parameter mapping that Groovy normally does for me.
I was looking through the groovy Sql source code and can see where they are explicitly telling the database connection not to handle parameters, so I'm thinking I'll add a new method to Sql.metaClass that can pass in a list of the auto-generated column names or something to make this more palatable.
I also want to see if there's a way to get the getGeneratedKeys method working so that I don't have to do all of that casting. At the very least, a utility method to safely handle the casting for me.
try {
withinBatch = true;
PreparedStatement statement = (PreparedStatement) getAbstractStatement(new CreatePreparedStatementCommand(0), connection, sql);
configure(statement);
psWrapper = new BatchingPreparedStatementWrapper(statement, indexPropList, batchSize, LOG, this);
closure.call(psWrapper);
return psWrapper.executeBatch();
} catch (SQLException e) {
The createNewPreparedStatement(0) prevents the creation of a statement which could return the auto-generated keys.
Just to make sure I wasn't crazy, I re-tried the 'getGeneratedKeys' method again with a statement that I know works and I got no results (see below). I had to recursively spin through the results to find the IBM class. So... not my favorite code, it's pretty brittle, but it's functional. Now I just need to see if I can still use the withBatch method somehow, I'll obviously need to override some things.
println 'print using getGeneratedKeys'
def results = preparedStatement.getGeneratedKeys()
while (results.next()) {
println SqlGroovyMethods.toRowResult(results)
}
println 'print using delegate processing'
println getGeneratedKeys(preparedStatement)
private List getGeneratedKeys(PreparedStatement statement) {
switch (statement) {
case DelegatingStatement:
return getGeneratedKeys(DelegatingStatement.cast(statement).getDelegate())
case DB2PreparedStatement:
ResultSet[] resultSets = DB2PreparedStatement.cast(statement).getDBGeneratedKeys()
List keys = []
resultSets.each { ResultSet results ->
while (results.next()) {
keys << SqlGroovyMethods.toRowResult(results)
}
}
return keys
default:
return [SqlGroovyMethods.toRowResult(statement.getGeneratedKeys())]
}
}
---- Console Output ----
print using getGeneratedKeys
print using delegate processing
[[KEY:7391], [KEY:7392]]
Okay, got it working. I had to hack my way into the Groovy SQL class, and there are some things that I just couldn't do because the methods in the Groovy class were private, so this implementation doesn't support cachedStatements, the isWithinBatch method won't operate correctly in the closure, and there's no access to the number of rows that were updated.
It'd be nice to see some variation of this in the base Groovy code, perhaps with a extension point where you put in your own handler (since you wouldn't want the IBM specific stuff in the base Groovy code), but at least I have a workable solution now.
public class SqlWithGeneratedKeys extends Sql {
public SqlWithGeneratedKeys(Sql parent) {
super(parent);
}
public List<GroovyRowResult> withBatch(String pSql, String [] keys, Closure closure) throws SQLException {
return this.withBatch(0, pSql, keys, closure);
}
public List<GroovyRowResult> withBatch(int batchSize, String pSql, String [] keys, Closure closure) throws SQLException {
final Connection connection = this.createConnection();
List<Tuple> indexPropList = null;
final SqlWithParams preCheck = this.buildSqlWithIndexedProps(pSql);
BatchingPreparedStatementWrapper psWrapper = null;
String sql = pSql;
if (preCheck != null) {
indexPropList = new ArrayList<Tuple>();
for (final Object next : preCheck.getParams()) {
indexPropList.add((Tuple) next);
}
sql = preCheck.getSql();
}
PreparedStatement statement = null;
try {
statement = connection.prepareStatement(sql, keys);
this.configure(statement);
psWrapper = new BatchingPreparedStatementWrapper(statement, indexPropList, batchSize, LOG, this);
closure.call(psWrapper);
psWrapper.executeBatch();
return this.getGeneratedKeys(statement);
} catch (final SQLException e) {
LOG.warning("Error during batch execution of '" + sql + "' with message: " + e.getMessage());
throw e;
} finally {
BaseDBServices.closeDBElements(connection, statement, null);
}
}
protected List<GroovyRowResult> getGeneratedKeys(Statement statement) throws SQLException {
if (statement instanceof DelegatingStatement) {
return this.getGeneratedKeys(DelegatingStatement.class.cast(statement).getDelegate());
} else if (statement instanceof DB2PreparedStatement) {
final ResultSet[] resultSets = DB2PreparedStatement.class.cast(statement).getDBGeneratedKeys();
final List<GroovyRowResult> keys = new ArrayList<GroovyRowResult>();
for (final ResultSet results : resultSets) {
while (results.next()) {
keys.add(SqlGroovyMethods.toRowResult(results));
}
}
return keys;
}
return Arrays.asList(SqlGroovyMethods.toRowResult(statement.getGeneratedKeys()));
}
}
Calling it is nice and clean.
println new SqlWithGeneratedKeys(target).withBatch(statement, ['ISN'] as String[]) { ps ->
rows.each {
ps.addBatch(it)
}
}
I'm using nested Asynchronous query execution with Cassandra. Data is continuously streamed in and for each incoming data, the below block of cassandra operations are executed. It works fine for a while but then starts throwing a lot of NoHostAvailableException.
Please me help me out here.
Cassandra Session Connection code :
I use separate sessions for read and write. Each of these sessions connect to a different seed as I was told this would improve performance.
final com.datastax.driver.core.Session readSession = CassandraManager.connect("10.22.1.144", "fr_repo",
"READ");
final com.datastax.driver.core.Session writeSession = CassandraManager.connect("10.1.12.236", "fr_repo",
"WRITE");
The CassandraManager.connect method is below :
public static Session connect(String ip, String keySpace,String type) {
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 400);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
return s;
}
Database operation code :
ResultSetFuture resultSetFuture = readSession.executeAsync(selectBound.bind(fr.getHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(com.datastax.driver.core.ResultSet resultSet) {
try {
Iterator<Row> rows = resultSet.iterator();
if (!rows.hasNext()) {
ResultSetFuture resultSetFuture = readSession.executeAsync(selectPrimaryBound
.bind(fr.getPrimaryKeyHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onFailure(Throwable arg0) {
}
public void onSuccess(ResultSet arg0) {
Iterator<Row> rows = arg0.iterator();
if (!rows.hasNext()) {
writeSession.executeAsync(insertBound.bind(fr.getHashcode(), fr,
System.currentTimeMillis()));
writeSession.executeAsync(insertPrimaryBound.bind(
fr.getHashcode(),
fr.getCombinedPrimaryKeys(), System.currentTimeMillis()));
produceintoQueue(new Gson().toJson(frCompleteMap));
} else {
writeSession.executeAsync(updateBound.bind(fr,
System.currentTimeMillis(), fr.getHashcode()));
produceintoQueue(new Gson().toJson(frCompleteMap));
}
}
});
} else {
writeSession.executeAsync(updateLastSeenBound.bind(System.currentTimeMillis(),
fr.getHashcode()));
}
} catch (Exception e) {
e.printStackTrace();
}
}
It sounds like you're sending more requests than your pool/cluster can handle. This is pretty easy to do when you're never actually waiting for a result, as is the case in your code. You're essentially just throwing as many requests as you can into the pipeline with no blocking, and there's no natural back pressure to slow down your app if the pool or cluster get backed up. So if your request volume is too high, eventually all the hosts will be busy with the backed up work queue. You can use nodetool tpstats to see what your request queues look like on each node.