How to save Sparks MatrixFactorizationModel recommendProductsForUsers to Hbase - apache-spark

I am new to spark and I am want to save the output of recommendProductsForUsers to Hbase table. I found an example (https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/) showing to use JavaPairRDD and saveAsNewAPIHadoopDataset to save.
How can I convert JavaRDD<Tuple2<Object, Rating[]>> to JavaPairRDD<ImmutableBytesWritable, Put> so that I can use saveAsNewAPIHadoopDataset?
//Loads the data from hdfs
MatrixFactorizationModel sameModel = MatrixFactorizationModel.load(jsc.sc(), trainedDataPath);
//Get recommendations for all users
JavaRDD<Tuple2<Object, Rating[]>> ratings3 = sameModel.recommendProductsForUsers(noOfProductsToReturn).toJavaRDD();

By using mapToPair. From the same source you provided example(i changed types by hand):
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = javaRDD.mapToPair(
new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> row) throws Exception {
Put put = new Put(Bytes.toBytes(row.getString(0)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier1"), Bytes.toBytes(row.getString(1)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier2"), Bytes.toBytes(row.getString(2)));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
It goes like this, you cretne new instance of put supplying it with row key in constructor, and then for each column you call add. and then you return the put created.

This is how i solved the above problem, hope this will be helpful to someone.
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts1 = ratings3
.mapToPair(new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> arg0)
throws Exception {
Rating[] userAndProducts = arg0._2;
System.out.println("***********" + userAndProducts.length + "**************");
List<Item> items = new ArrayList<Item>();
Put put = null
String recommendedProduct = "";
for (Rating r : userAndProducts) {
//Some logic here to convert Ratings into appropriate put command
// recommendedProduct = r.product;
}
put.addColumn(Bytes.toBytes("recommendation"), Bytes.toBytes("product"),Bytes.toBytes(recommendedProduct)); Bytes.toBytes("product"),Bytes.toBytes(response.getItems().toString()));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
System.out.println("*********** Number of records in JavaPairRdd: "+ hbasePuts1.count() +"**************");
hbasePuts1.saveAsNewAPIHadoopDataset(newApiJobConfig.getConfiguration());
jsc.stop();

We just open sourced Splice Machine and we have examples integrating MLIB with querying and storage into Splice Machine. I do not know if this will help but thought I would let you know.
http://community.splicemachine.com/use-spark-libraries-splice-machine/
Thanks for the post, very cool.

Related

Why am I getting blank records when using PXDatabase.GetSlot to retrieve data from a table?

I set up this IPrefetchable class because I needed to store a set of data in memory for quick access to it and wanted it returned as a dictionary. I'm using PXDatabase.GetSlot to save data from INItemLotSerial into a pxdatabase slot, however when dbRecords selects the records from it, the field values in it are null(lotserialnbr, inventoryID, etc) although it does return the correct count of records. This is my first time using PXDatabase.GetSlot so perhaps I'm missing something fairly simple.
Any suggestions would be greatly appreciated. Thank you.
class INItemLotSerialFetcher : IPrefetchable
{
private Dictionary<string, int?> _availlist = new Dictionary<string, int?>();
public void Prefetch()
{
// read database
var dbRecords = PXDatabase.Select<INItemLotSerial>();
//add results to Dictionary
foreach (var rec in dbRecords)
{
if (rec.LotSerialNbr != null)
{
_availlist.Add(rec.LotSerialNbr, rec.InventoryID);
}
}
}
public static Dictionary<string, int?> GetINList()
{
var def = GetSlot();
return def._availlist;
}
private static INItemLotSerialFetcher GetSlot()
{
return PXDatabase.GetSlot<INItemLotSerialFetcher>("INserialFetcherSlot", typeof(INItemLotSerial));
}
}
Try this change ala an example from Sergey Marenich
public static Dictionary<string, int?> GetINList()
{
return PXDatabase.GetSlot<INItemLotSerialFetcher>("INserialFetcherSlot", typeof(INItemLotSerial));
}

How to output the content of JavaMapwithStateDstream to the textFile?

all. I have two questions about Spark-streaming's application.
First one is how to output JavaMapwithStateDstream's content into the textFile, I went through the API document, and found out it's of the Dstreamlike interface.So I use the following code, trying to output the content:
Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>> mappingFunc =
new Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> call(String word, Optional<Integer> one,
State<Integer> state) {
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<>(word, sum);
state.update(sum);
return output;
}
};
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
adCounts.mapWithState(StateSpec.function(mappingFunc));
stateDstream.print();
stateDstream.foreachRDD(new Function<JavaRDD<Tuple2<String,Integer>>, Void>() {
#Override
public Void call(JavaRDD<Tuple2<String, Integer>> rdd) throws Exception {
rdd.saveAsTextFile("/path/to/hdfs");
return null;
}
});
However, Nothing output to the hdfs path.But I can see the print result from the console
Please tell me what's the matter??How can I output the content of JavaMapwithStateDstream?
Second question:
I want to update the real-time result every duration, even no other new flowing in, how can I implement it??
Thanks.
I found out the reason why JavaMapwithStateDstream can print out something but not saving the textFile, since it is updated/initialized every duration, new data flowing in will be covered by next time's initialization, thus nothing can be saved into the textFile.
Workaround is to declare a new variable to copy the value of stateDstream,
I use Dstream here, I think JavaPairDstream should be also ok.
DStream<Tuple2<String, Integer>> fin_Counts = stateDstream.dstream();
fin_Counts.print();
fin_Counts can be updated and saved.

Cassandra queries not having any effect

I'm running a bunch of queries one after the other but It seems like some queries are not having any effect even though no errors are thrown UNLESS I restart the session after each query. I'm using datastax cassandra driver for this.
Here are the queries, which I'm storing in a file seperated by ###.
DROP KEYSPACE if exists test_space;
####
CREATE KEYSPACE test_space WITH replication = {'class': 'NetworkTopologyStrategy','0':'2'};
####
CREATE TABLE test_space.fr_core (
frid text PRIMARY KEY,
attributes text,
pk1 text,
pk2 text,
pk3 text,
pk4 text,
pk5 text,
pk6 text
);
####
Here's the code for executing the above statements :
public class CassandraKeyspaceDelete {
public static void main(String[] args) {
try {
new CassandraKeyspaceDelete().run();
} catch (Exception e) {
e.printStackTrace();
}
}
public void run() {
// Get file from resources folder
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
InputStream is = classloader.getResourceAsStream("create_keyspace.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder out = new StringBuilder();
String line;
try {
while ((line = reader.readLine()) != null) {
out.append(line);
}
// read from input stream
reader.close();
} catch (Exception e) {
System.out.println("Error reading kespace creation script.");
return;
}
// System.out.println();
com.datastax.driver.core.Session readSession = CassandraManager.connect("12.10.1.122", "", "READ");
String selectStmnts[] = out.toString().split("####");// { };
for (String selectStmnt : selectStmnts) {
System.out.println("" + selectStmnt.trim());
if (selectStmnt.trim().length() > 0) {
ResultSet res = readSession.execute(selectStmnt.trim());
}
// readSession.close();
if (readSession.isClosed()) {
readSession = CassandraManager.connect("12.10.1.122", "", "READ");
}
}
System.out.println("Done");
return;
}
}
Here's the CassandraManager class :
public class CassandraManager {
static Cluster cluster;
public static Session session;
static PreparedStatement statement;
static BoundStatement boundStatement;
public static HashMap<String, Session> sessionStore = new HashMap<String, Session>();
public static Session connect(String ip, String keySpace,String type) {
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 400);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect();
return s;
}
}
When I run this, the first two CQL queries run without errors. When the third one runs, I get an error saying Keyspace test_space doesn't exist.
If I uncomment out readSession.close(), all the queries execute though each time the session is closed and then opened resulting in slow execution.
Why aren't the queries working unless session is restarted after each query ?
I created a new project and tried your code in my Cassandra sandbox. It worked with four changes:
My datacenter is defined as "DC1", so the replication factor I used for the test_space keyspace was {'class': 'NetworkTopologyStrategy','DC1':'1'};
My sandbox instance is secured, so I had to use .withCredentials in the Cluster.builder
I couldn't get getResourceAsStream to work, so I replaced that with a FileInputStream instead.
I moved readSession.close(); outside of the for loop.
Based on the fact that it worked on mine, I can't speak to the behavour that you are seeing, so I will offer a few observations:
Is your datacenter really named 0? Your keyspace replication factor {'class': 'NetworkTopologyStrategy','0':'2'} is telling Cassandra to put two replicas in the 0 datacenter. If that really is the case, you should make your datacenter name something a little more intuitive.
None of the statements in your text file return a result set. So doing this ResultSet res = readSession.execute(selectStmnt.trim()); really doesn't get you anything.
Given the name of your keyspace, I can only assume that you are testing some things out. So how do you know that you need all of these options on your cluster builder? My advice to you, is to start simple. Don't add the other options unless you know that you need them, and more importantly, what they do.
cluster = Cluster.builder()
.addContactPoint(ip)
.build();
Session s = cluster.connect();
Make sure that your readSession.close(); is outside of your for loop.
Something else that might help you, is to read through Things You Should Be Doing When Using Cassandra Drivers by DataStax's Rebecca Mills.

Add Geolocation filter to twitter stream through spark streaming in Java

I want the tweets related to particular geo location alone. After googling around, I found that this can be achieved by adding extra methods/functionality to TwitterUtils and TwitterInputDStream classes. But I am unable to do so as these are final classes.
Help me on How can we achieve this?
Thanks in advance.
This is the best answer I could find to do this when creating the stream. However, with this filter, we are filtering the tweets after receiving them. I think there is a way to filter before receiving them by supplying the filter to the twitter api. This code has been verified to work and comes from:
http://www.michael-goettsche.de/?p=19#return-note-19-4
JavaDStream<Status> tweetsWithLocation = twitterStream.filter(
new Function<Status, Boolean>() {
public Boolean call(Status status){
if (status.getGeoLocation() != null) {
return true;
} else {
return false;
}
}
}
);
JavaDStream<String> statuses = tweetsWithLocation.map(
new Function<Status, String>() {
public String call(Status status) {
return status.getGeoLocation().toString() + ": " + status.getText();
}
}
);

Storing a reference to an object to be

I don't know if this is possible at all so this is a shot in the dark.
Anyhow...
Consider having the following model:
Class Model
{
public List<string> TheList = null;
}
The List is set to null on purpose.
var model = new Model();
command.RegisterInData( model => model.TheList ); // TheList is null at this point
model.TheList = new List<string>();
model.TheList.Add("A value here");
command.Execute(); // <-- Here I want to access the new list somehow
As said, I don't know if anything like this is possible but I would like a push in the right direction.
The function desired: I would like to tell the command where to put the result before I have a concrete object.
Thanks in advance
This seems quite doable. Here is a variation with an even simpler accessor:
class Command
{
private Func<List<string>> listAccessor;
public void RegisterInData(Func<List<string>> listAccessor)
{
this.listAccessor = listAccessor;
}
public void Execute()
{
var list = this.listAccessor();
foreach (string s in list)
{
Console.Log(s);
}
}
}
// Elsewhere
var model = new Model();
command.RegisterInData(() => model.TheList);
model.TheList = new List<string>();
model.TheList.Add("A value here");
command.Execute();
You'll probably want error handling for the case where RegisterInData is not called before Execute, but you get the idea.
You simply have to delay calling the delegate passed to RegisterInData and call it (I guess) at Execute.
Could Lazy be of use here? http://msdn.microsoft.com/en-us/library/dd642331.aspx

Resources