How to write JavaRDD to marklogic database - apache-spark

I am evaluating spark with marklogic database. I have read a csv file, now i have a JavaRDD object which i have to dump into marklogic database.
SparkConf conf = new SparkConf().setAppName("org.sparkexample.Dataload").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("/root/ml/workArea/data.csv");
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Record> rdd_records = data.map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3],fields[4]);
return sd;
}
});
This JavaRDD object i want to write to marklogic database.
Is there any spark api available for faster writing to the marklogic database ?
Lets say, If we could not write JavaRDD directly to marklogic then what is the currect approach to achieve this ?
Here is the code which i am using to write the JavaRDD data to marklogic database, let me know if it is wrong way to do that.
final DatabaseClient client = DatabaseClientFactory.newClient("localhost",8070, "MLTest");
final XMLDocumentManager docMgr = client.newXMLDocumentManager();
rdd_records.foreachPartition(new VoidFunction<Iterator<Record>>() {
public void call(Iterator<Record> partitionOfRecords) {
while (partitionOfRecords.hasNext()) {
Record record = partitionOfRecords.next();
System.out.println("partitionOfRecords - "+record.toString());
String docId = "/example/"+record.getID()+".xml";
JAXBContext context = JAXBContext.newInstance(Record.class);
JAXBHandle<Record> handle = new JAXBHandle<Record>(context);
handle.set(record);
docMgr.writeAs(docId, handle);
}
}
});
client.release();
I have used java client api to write the data, but i am getting below exception even though POJO class Record is implementing Serializable interface. Please let me know what could be the reason & how to solve that.
org.apache.spark.sparkexception task not Serializable .

The easiest way to get data into MarkLogic is via HTTP and the client REST API - specifically the /v1/documents endpoints - http://docs.marklogic.com/REST/client/management .
There are a variety of ways to optimize this, such as via a write set, but based on your question, I think the first thing to decide is - what kind of document do you want to write for each Record? Your example shows 5 columns in the CSV - typically, you'll write either a JSON or XML document with 5 fields/elements, each named based on the column index. So you'd need to write a little code to generate that JSON/XML, and then use whatever HTTP client you prefer (and one option is the MarkLogic Java Client API) to write that document to MarkLogic.
That addresses your question of how to write a JavaRDD to MarkLogic - but if your goal is to get data from a CSV into MarkLogic as fast as possible, then skip Spark and use mlcp - https://docs.marklogic.com/guide/mlcp/import#id_70366 - which involves zero coding.

Modified example from spark streaming guide, Here you will have to implement connection and writing logic specific to database.
public void send(JavaRDD<String> rdd) {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> partitionOfRecords) {
// ConnectionPool is a static, lazily initialized pool of
Connection connection = ConnectionPool.getConnection();
while (partitionOfRecords.hasNext()) {
connection.send(partitionOfRecords.next());
}
ConnectionPool.returnConnection(connection); // return to the pool
// for future reuse
}
});
}

I'm wondering if you just need to make sure everything you access inside your VoidFunction that was instantiated outside it is serializable (see this page). DatabaseClient and XMLDocumentManager are of course not serializable, as they're connected resources. You're right, however, to not instantiate DatabaseClient inside your VoidFunction as that would be less efficient (though it would work). I don't know if the following idea would work with spark. But I'm guessing you could create a class that keeps hold of a singleton DatabaseClient instance:
public static class MLClient {
private static DatabaseClient singleton;
private MLClient() {}
public static DatabaseClient get(DatabaseClientFactory.Bean connectionInfo) {
if ( connectionInfo == null ) {
throw new IllegalArgumentException("connectionInfo cannot be null");
}
if ( singleton == null ) {
singleton = connectionInfo.newClient();
}
return singleton;
}
}
then you just create a serializable DatabaseClientFactory.Bean outside your VoidFunction so your auth info is still centralized
DatabaseClientFactory.Bean connectionInfo =
new DatabaseClientFactory.Bean();
connectionInfo.setHost("localhost");
connectionInfo.setPort(8000);
connectionInfo.setUser("admin");
connectionInfo.setPassword("admin");
connectionInfo.setAuthenticationValue("digest");
Then inside your VoidFunction you could get that singleton DatabaseClient and new XMLDocumentManager like so:
DatabaseClient client = MLClient.get(connectionInfo);
XMLDocumentManager docMgr = client.newXMLDocumentManager();

Related

How to read messages in MQs using spark streaming,i.e ZeroMQ,RabbitMQ?

As the spark docs says,it support kafka as data streaming source.but I use ZeroMQ,And there is not a ZeroMQUtils.so how can I use it? and generally,how about other MQs. I am totally new to spark and spark streaming, so I am sorry if the question is stupid.Could anyone give me a solution.Thanks
BTW,I use python.
Update, I finally did it in java with a Custom Receiver. Below is my solution
public class ZeroMQReceiver extends Receiver<T> {
private static final ObjectMapper mapper = new ObjectMapper();
public ZeroMQReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}
#Override
public void onStart() {
// Start the thread that receives data over a connection
new Thread(this::receive).start();
}
#Override
public void onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private void receive() {
String message = null;
try {
ZMQ.Context context = ZMQ.context(1);
ZMQ.Socket subscriber = context.socket(ZMQ.SUB);
subscriber.connect("tcp://ip:port");
subscriber.subscribe("".getBytes());
// Until stopped or connection broken continue reading
while (!isStopped() && (message = subscriber.recvStr()) != null) {
List<T> results = mapper.readValue(message,
new TypeReference<List<T>>(){} );
for (T item : results) {
store(item);
}
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}
}
I assume you are talking about Structured Streaming.
I am not familiar with ZeroMQ, but an important point in Spark Structured Streaming sources is replayability (in order to ensure fault tolerance), which, if I understand correctly, ZeroMQ doesn't deliver out-of-the-box.
A practical approach would be buffering the data either in Kafka and using the KafkaSource or as files in a (local FS/NFS, HDFS, S3) directory and using the FileSource for reading. Cf. Spark Docs. If you use the FileSource, make sure not to append anything to an existing file in the FileSource's input directory, but move them into the directory atomically.

how to pass cassandra cluster connection from one bolt to another bolt

Storm Topology reads data from kafka and write into cassandra tables
In Storm i am creating cassandra cluster connection and session in prepare method.
cassandraCluster = Cluster.builder().withoutJMXReporting().withoutMetrics()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ExponentialReconnectionPolicy(100L,
TimeUnit.MINUTES.toMillis(5)))
.withLoadBalancingPolicy(
new TokenAwarePolicy(new RoundRobinPolicy()))
.build();
session = cassandraCluster.connect(keyspace);
In execute method i can process the tuple and save it in cassandra table
Suppose if i want to write data from single tuple into multiple table
Writing separate bolt for each table will be good choice. But i have to create cluster connection and session each table in each bolt.
But in this link single connection per cluster will be a good idea for performance
http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
Did any of you have any idea on creating cluster connection in one bolt and use this connection in other bolt?
It depends on how storm allocates the bolts and spouts to the workers. You can't assume that you can can share connections between bolts because they might be running in different workers (read: JVMs) or on different nodes entirely.
See my answer here: Mongo connection pooling for Storm topology
Might look something like this pseudocode:
public class CassandraBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private static Logger LOG = LoggerFactory.getLogger(CassandraBolt.class);
OutputCollector _collector;
// whatever your cassandra session is
// has to be transient because session is not serializable
protected transient CassandraSession _session;
#SuppressWarnings("rawtypes")
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
_collector = collector;
// maybe get properties from stormConf instead of hard coding them
cassandraCluster = Cluster.builder().withoutJMXReporting().withoutMetrics()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ExponentialReconnectionPolicy(100L,
TimeUnit.MINUTES.toMillis(5)))
.withLoadBalancingPolicy(
new TokenAwarePolicy(new RoundRobinPolicy()))
.build();
_session = cassandraCluster.connect(keyspace);
}
#Override
public void execute(Tuple input) {
try {
// use _session to talk to cassandra
} catch (Exception e) {
LOG.error("CassandraBolt error", e);
_collector.reportError(e);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
}
}

dynamic template generation and formatting using freemarker

My goal is to format a collection of java map to a string (basically a csv) using free marker or anything else that would do smartly. I want to generate the template using a configuration data stored in database and managed from an admin application.
The configuration will tell me at what position a given data (key in hash map) need to go and also if any script need to run on this data before applying it at a given position. Several positions may be blank if the data in not in map.
I am thinking to use free-marker to build this generic tool and would appreciate if you could share how I should go about this.
Also would like to know if there is any built is support in spring-integration for building such process as the application is a SI application.
I am no freemarker expert, but a quick look at their quick start docs led me here...
public class FreemarkerTransformerPojo {
private final Configuration configuration;
private final Template template;
public FreemarkerTransformerPojo(String ftl) throws Exception {
this.configuration = new Configuration(Configuration.VERSION_2_3_23);
this.configuration.setDirectoryForTemplateLoading(new File("/"));
this.configuration.setDefaultEncoding("UTF-8");
this.template = this.configuration.getTemplate(ftl);
}
public String transform(Map<?, ?> map) throws Exception {
StringWriter writer = new StringWriter();
this.template.process(map, writer);
return writer.toString();
}
}
and
public class FreemarkerTransformerPojoTests {
#Test
public void test() throws Exception {
String template = System.getProperty("user.home") + "/Development/tmp/test.ftl";
OutputStream os = new FileOutputStream(new File(template));
os.write("foo=${foo}, bar=${bar}".getBytes());
os.close();
FreemarkerTransformerPojo transformer = new FreemarkerTransformerPojo(template);
Map<String, String> map = new HashMap<String, String>();
map.put("foo", "baz");
map.put("bar", "qux");
String result = transformer.transform(map);
assertEquals("foo=baz, bar=qux", result);
}
}
From a Spring Integration flow, send a message with a Map payload to
<int:transformer ... ref="fmTransformer" method="transform" />
Or you could do it with a groovy script (or other supported scripting language) using Spring Integration's existing scripting support without writing any code (except the script).

Global Static Dictionary Initialization from Database in Webapi

I want to Initialize a global Dictionary from Database in my web Api. Do i need to inject my DBContext in Global.Asax or Owin Startup. Any example would be much appreciated.
Any kind initialization purposes can be made in your custom defined OWIN Startup class class, like this:
using Microsoft.Owin;
using Microsoft.Owin.Security.OAuth;
using Owin;
using System;
[assembly: OwinStartup(typeof(WebAPIRestWithNest.Startup))]
namespace YourNamespace
{
public class Startup
{
public Dictionary<string, string> Table {get; private set;}
public void Configuration(IAppBuilder app)
{
// token generation
app.UseOAuthAuthorizationServer(new OAuthAuthorizationServerOptions
{
AllowInsecureHttp = false,
TokenEndpointPath = new PathString("/token"),
AccessTokenExpireTimeSpan = TimeSpan.FromHours(8),
Provider = new SimpleAuthorizationServerProvider()
});
// token consumption
app.UseOAuthBearerAuthentication(new OAuthBearerAuthenticationOptions());
app.UseWebApi(WebApiConfig.Register());
Table = ... Connect from DB and fill your table logic ...
}
}
}
After that you can use your Startup.Table property from your application.
In general, it is bad practice to access objects using static field in the asp.net applications because this may lead to bugs that are hardly detected and reproduced: especially this is true for non-immutable/not-thread-safe objects like Dictionary.
I assume you want to cache some DB data in memory to avoid excessive SQL queries. It is good idea to use standard asp.net caching for this purpose:
public IDictionary GetDict() {
var dict = HttpRuntime.Cache.Get("uniqueCacheKey") as IDictionary;
if (pvtData==null) {
dict = doLoadDictionaryFromDB(); // your code that loads data from DB
HttpRuntime.Cache.Add(cacheKey, dict,
null, Cache.NoAbsoluteExpiration,
new TimeSpan(0,5,0), // cache at least for 5 minutes after last access
CacheItemPriority.Normal, null);
}
return dict;
}
This approach allows you to select appropriate expiration policy (without the need to reinventing the wheel with static dictionary).
If you still want to use static dictionary, you can populate it on the application start (global.asax):
void Application_Start(object sender, EventArgs e)
{
// your code that initializes dictionary with data from DB
}

Multithreading and ORMLite

I have a database manager class that manages access do the database. It contains the connections pool and 2 DAOs. Each for a different table. Looks something like this:
public class ActivitiesDatabase {
private final ConnectionSource connectionSource;
private final Dao<JsonActivity, String> jsonActivityDao;
private final Dao<AtomActivity, String> atomActivityDao;
private ActivitiesDatabase() {
try {
connectionSource = new JdbcPooledConnectionSource(Consts.JDBC);
TableUtils.createTableIfNotExists(connectionSource, JsonActivity.class);
jsonActivityDao = DaoManager.createDao(connectionSource, JsonActivity.class);
TableUtils.createTableIfNotExists(connectionSource, AtomActivity.class);
atomActivityDao = DaoManager.createDao(connectionSource, AtomActivity.class);
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
public long insertAtom(String id, String content) throws SQLException {
long additionTime = System.currentTimeMillis();
atomActivityDao.createIfNotExists(new Activity(id, content, additionTime));
return additionTime;
}
public long insertJson(String id, String content) throws SQLException {
long additionTime = System.currentTimeMillis();
jsonActivityDao.createIfNotExists(new Activity(id, content, additionTime));
return additionTime;
}
public AtomResult getAtomEntriesBetween(long from, long to) throws SQLException {
long updated = System.currentTimeMillis();
PreparedQuery<Activity> query = atomActivityDao.queryBuilder().limit(500L).orderBy(Activity.UPDATED_FIELD, true).where().between(Activity.UPDATED_FIELD, from, to).prepare();
return new Result(atomActivityDao.query(query), updated);
}
public JsonResult getJsonEntriesBetween(long from, long to) throws SQLException {
long updated = System.currentTimeMillis();
PreparedQuery<Activity> query = jsonActivityDao.queryBuilder().limit(500L).orderBy(Activity.UPDATED_FIELD, true).where().between(Activity.UPDATED_FIELD, from, to).prepare();
return new Result(jsonActivityDao.query(query), updated);
}
}
In addition, I have two thread running using the same database manager. Each thread writes to a different table. There are also threads who read from the database. A reading thread can read from any table.
I noticed in the ConnectionsSource documentation that it is not thread safe.
my question is. Should I synchronize the function that write to the database.
Would the answer to my question be different if both write thread were to write to the the same table?
I noticed in the ConnectionsSource documentation that it is not thread safe.
Right but you are using the JdbcPooledConnectionSource which is thread-safe.
Would I synchronize the function that write to the database.
You shouldn't have a problem with ORMLite doing this. However, you need to make sure that your database supports multiple concurrent database updates. For example, you won't have a problem if you are using MySQL, Postgres, or Oracle. You'll need to read up on H2 multithreading to see what options you will need to use to get that to work.
Would the answer to my question be different if both write thread were to write to the the same table?
That would increase the concurrency so (uh) maybe? Again it depends on the database type.
You may use Connection pool for multithreading working with ORMLite here is javaDoc

Resources