I'm developing a classifier for text categorisation using Weka java libraries. I've extracted a number of features using Stanfords' CoreNLP package, including a dependency parse of the text which returns a string "(rel, head, mod)".
I was wanting to use the dependency triplets returned from this as features for classification but I cannot figure out how to properly represent them in the ARFF file. Basically, I'm stumped; for each instance, there are an arbitrary number of dependency triplets, so I can't define them explicitly in the attributes, for example:
#attribute entityCount numeric
#attribute depTriple_1 string
#attribute depTriple_2 string
.
.
#attribute depTriple_n string
Is there a particular way to go about this? I've spent the better part of the day searching and have not found anything yet.
Thanks a lot for reading.
Extracted from Weka Wiki:
import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instance;
import weka.core.Instances;
/**
* Generates a little ARFF file with different attribute types.
*
* #author FracPete
*/
public class SO_Test {
public static void main(String[] args) throws Exception {
FastVector atts;
FastVector attsRel;
FastVector attVals;
FastVector attValsRel;
Instances data;
Instances dataRel;
double[] vals;
double[] valsRel;
int i;
// 1. set up attributes
atts = new FastVector();
// - numeric
atts.addElement(new Attribute("att1"));
// - nominal
attVals = new FastVector();
for (i = 0; i < 5; i++)
attVals.addElement("val" + (i+1));
atts.addElement(new Attribute("att2", attVals));
// - string
atts.addElement(new Attribute("att3", (FastVector) null));
// - date
atts.addElement(new Attribute("att4", "yyyy-MM-dd"));
// - relational
attsRel = new FastVector();
// -- numeric
attsRel.addElement(new Attribute("att5.1"));
// -- nominal
attValsRel = new FastVector();
for (i = 0; i < 5; i++)
attValsRel.addElement("val5." + (i+1));
attsRel.addElement(new Attribute("att5.2", attValsRel));
dataRel = new Instances("att5", attsRel, 0);
atts.addElement(new Attribute("att5", dataRel, 0));
// 2. create Instances object
data = new Instances("MyRelation", atts, 0);
// 3. fill with data
// first instance
vals = new double[data.numAttributes()];
// - numeric
vals[0] = Math.PI;
// - nominal
vals[1] = attVals.indexOf("val3");
// - string
vals[2] = data.attribute(2).addStringValue("This is a string!");
// - date
vals[3] = data.attribute(3).parseDate("2001-11-09");
// - relational
dataRel = new Instances(data.attribute(4).relation(), 0);
// -- first instance
valsRel = new double[2];
valsRel[0] = Math.PI + 1;
valsRel[1] = attValsRel.indexOf("val5.3");
dataRel.add(new Instance(1.0, valsRel));
// -- second instance
valsRel = new double[2];
valsRel[0] = Math.PI + 2;
valsRel[1] = attValsRel.indexOf("val5.2");
dataRel.add(new Instance(1.0, valsRel));
vals[4] = data.attribute(4).addRelation(dataRel);
// add
data.add(new Instance(1.0, vals));
// second instance
vals = new double[data.numAttributes()]; // important: needs NEW array!
// - numeric
vals[0] = Math.E;
// - nominal
vals[1] = attVals.indexOf("val1");
// - string
vals[2] = data.attribute(2).addStringValue("And another one!");
// - date
vals[3] = data.attribute(3).parseDate("2000-12-01");
// - relational
dataRel = new Instances(data.attribute(4).relation(), 0);
// -- first instance
valsRel = new double[2];
valsRel[0] = Math.E + 1;
valsRel[1] = attValsRel.indexOf("val5.4");
dataRel.add(new Instance(1.0, valsRel));
// -- second instance
valsRel = new double[2];
valsRel[0] = Math.E + 2;
valsRel[1] = attValsRel.indexOf("val5.1");
dataRel.add(new Instance(1.0, valsRel));
vals[4] = data.attribute(4).addRelation(dataRel);
// add
data.add(new Instance(1.0, vals));
// 4. output data
System.out.println(data);
}
}
Your problem is in particular a "relational" attribute. This code segment has dealt with such relational attribute.
Alright, I did it! Just posting this as an answer incase anyone else has a similar problem. Previously I was following the guide found on the Weka Wiki (as posted below by Rushdi), but I was having a lot of trouble following as the guide is creating static instances of the relational attribute, where as I required dynamic declarations of an arbitrary amount. So I decided to re-evaluate how I was generating the attributes, and I managed to get it to work with slight changes to the above guide:
//1. Set up attributes
FastVector atts;
FastVector relAtts;
Instances relData;
atts = new FastVector();
//Entity Count - numeric
atts.addElement(new Attribute("entityCount"));
//Dependencies - Relational (Multi-Instance)
relAtts = new FastVector();
relAtts.addElement(new Attribute("depTriplet", (FastVector) null));
relData = new Instances("depTriples", relAtts, 0);
atts.addElement(new Attribute("depTriples", relData, 0));
atts.addElement(new Attribute("postTxt", (FastVector) null));
//2. Create Instances Object
Instances trainSet = new Instances("MyName", atts, 0);
/* 3. Fill with data:
Loop through text docs to extract features
and generate instance for train set */
//Holds the relational attribute instances
Instances relAttData;
for(Object doc: docList) {
List<String> depTripleList = getDepTriples(doc);
int entCount = getEntityCount(doc);
String pt = getText(doc);
//Create instance to be added to training set
Instance tInst = new Instance(trainSet.numAttributes());
//Entity count
tInst.setValue( (Attribute) atts.elementAt(0), entCount);
//Generate Instances for relational attribute
relAttData = new Instances(trainSet.attribute(1).relation(), 0);
//For each deplist entry, create an instance and add it to dataset
for(String depTriple: depTripleList) {
Instance relAttInst = new Instance(1);
relAttInst.setDataset(relAttData);
relAttInst.setValue(0, depTriple);
relAttData.add(relAttInst);
}
//Add relational attribute (now filled with a number of Instances of attributes) to the main Instance
tInst.setValue( (Attribute) atts.elementAt(1), trainSet.attribute(1).addRelation(relAttData));
//Finally, add the instance to the relational attribute
trainSet.add(tInst)
}
//4. Output data
System.out.println(trainSet);
I realise this could probably be done differently, but this works well with my situation. Please keep in mind this is not my actual code, but an excerpt of multiple parts stitched together to demonstrate the process used to fix the problem.
Related
I am writing one tool in node js. I wanted to define some POJO in node js. I don't have much experience in Node js. I came from JAVA background where classes are used for defining entities. One way, in which I have define entities now are:-
function Person(name) {
this.name = name;
this.values = [];
this.characteristics = {};
}
But this is defined in one JS file. And to make it available in other JS files, I have to export this function. Is this the best way to define entities or are there any other way in which I can define something of class kind of format?
That is just fine for creating objects. If you start to use a DB like mongo, you might be better off creating objects with mongoose but that's personal preference as well. As for your example -
1) Export Person
module.exports = Person;
2) Import Person from another file
const Person = require('../path/to/Person');
3) Create Person with the new keyword to call the constructor (very important)
const mitch = new Person('Mitch');
You should read up on javascript's prototype. Every object has a reference to Object.prototype. Then you can create objects with Object.create(obj) to create objects and assign the new object's prototype as the reference being passed in to Object.create(obj)
Here's an example from MDN
// Shape - superclass
function Shape() {
this.x = 0;
this.y = 0;
}
// superclass method
Shape.prototype.move = function(x, y) {
this.x += x;
this.y += y;
console.info('Shape moved.');
};
// Rectangle - subclass
function Rectangle() {
Shape.call(this); // call super constructor.
}
// subclass extends superclass
Rectangle.prototype = Object.create(Shape.prototype);
Rectangle.prototype.constructor = Rectangle;
var rect = new Rectangle();
console.log('Is rect an instance of Rectangle?',
rect instanceof Rectangle); // true
console.log('Is rect an instance of Shape?',
rect instanceof Shape); // true
rect.move(1, 1); // Outputs, 'Shape moved.'
Here is my code
public static void save(IgniteContext igniteContext, String cacheName, Dataset<Row> dataSet) {
CacheConfiguration<BinaryObject, BinaryObject> cacheConfiguration = new CacheConfiguration<BinaryObject, BinaryObject>(cacheName)
.setAtomicityMode(CacheAtomicityMode.ATOMIC)
.setBackups(0)
.setAffinity(new RendezvousAffinityFunction(false, 2))
.setIndexedTypes(BinaryObject.class, BinaryObject.class);
IgniteCache<BinaryObject, BinaryObject> rddCache = igniteContext.ignite()
.getOrCreateCache(cacheConfiguration)
.withKeepBinary();
rddCache.clear();
IgniteRDD<BinaryObject, BinaryObject> igniteRDD = igniteContext.fromCache(cacheName);
StructField[] fields = dataSet.schema().fields();
RDD<BinaryObject> binaryObjectJavaRDD = dataSet.toJavaRDD().map(row -> {
BinaryObjectBuilder valueBuilder = igniteContext.ignite().binary().builder(BinaryObject.class.getCanonicalName());
for (int i = 0; i < fields.length; i++) {
valueBuilder.setField(fields[i].name(), convertValue(String.valueOf(row.get(i)), fields[i].dataType())); //convertValue converts value to specific datatype
}
return valueBuilder.build();
}).rdd();
igniteRDD.saveValues(binaryObjectJavaRDD);
}
I have a problem with the above code, that is even after successful completion of this method cache remains empty. Dataset has 20 rows so that is not the problem.
The other problem is that if I use savePairs method from IgniteRDD then I have to generate the Key by myself(here Key is BinaryObject), so how to do that?
update
saveDFInPairs(IgniteContext igniteContext, Dataset<Row> dataSet, IgniteRDD<BinaryObject, BinaryObject> igniteRDD) {
StructField[] fields = dataSet.schema().fields();
JavaRDD<Tuple2<BinaryObject, BinaryObject>> rdd = dataSet.toJavaRDD().map(row -> {
BinaryObjectBuilder keyBuilder = igniteContext.ignite()
.binary().builder("TypeName");
keyBuilder.setField("id", row.mkString().hashCode());
BinaryObject key = keyBuilder.build();
BinaryObjectBuilder valueBuilder = igniteContext.ignite()
.binary().builder("TypeName");
for (int i = 0; i < fields.length; i++) {
valueBuilder.setField(fields[i].name(), convert(row, i, fields[i].dataType()));
}
BinaryObject value = valueBuilder.build();
return new Tuple2<>(key, value);
});
igniteRDD.savePairs(rdd.rdd(), true);
}
Couple of considerations:
The type name (the one passed to the builder() method) should be a meaningful name representing the data type. Do not use BinaryObject class name for this.
setIndexedTypes(BinaryObject.class, BinaryObject.class) is incorrect. This should specify classes to be processed for query annotations. If you don't have classes, you can use QueryEntity to configure queries. See this page for further details: https://apacheignite.readme.io/docs/sql-queries
Other than that code looks correct. I would recommend to try with default settings and check if it works this way. Also it's not very clear how you check that the data is in cache or not.
In the applications I'm developing I need to store data for Customer,Products and their Prices.
In order to persist that data I use RMS, but knowing that RMS doesn't support object serializing directly and since that data I read already comes in json format, I store every JSONObject as its string version, like this:
rs = RecordStore.openRecordStore(mRecordStoreName, true);
JSONArray jsArray = new JSONArray(data);
for (int i = 0; i < jsArray.length(); i++) {
JSONObject jsObj = jsArray.getJSONObject(i);
stringJSON = jsObj.toString();
addRecord(stringJSON, rs);
}
The addRecord Method
public int addRecord(String stringJSON, RecordStore rs) throws JSONException,RecordStoreException {
int id = -1;
byte[] raw = stringJSON.getBytes();
id= rs.addRecord(raw, 0, raw.length);
return id;
}
So I have three RecordStores (Customer,Products and their Prices) and for each of them I do the save as shown above to save their corresponding data.
I know this might be a possible to solution, but I'm sure there's gotta be a better implementation. Even more,considering that over those three "tables" I'm going to perform searching, sorting,etc.
In those cases, having to deserialize before proceeding to search or sort doesn't seem a very good idea.
That's why I want to ask you guys. In your experience, how do store custom objects in RMS in way that is easy to work with them later??
I really appreciate all your comments and suggestions.
EDIT
It seems that it's easier to work with records when you define a fixed max length for each field. So here's what I tried:
1) First all, this is the class I use to retrieve the values from the record store:
public class Customer {
public int idCust;
public String name;
public String IDNumber;
public String address;
}
2) This is the code I use to save every jsonObject to the record store:
RecordStore rs = null;
try {
rs = RecordStore.openRecordStore(mRecordStoreName, true);
JSONArray js = new JSONArray(data);
for (int i = 0; i < js.length(); i++) {
JSONObject jsObj = js.getJSONObject(i);
byte[] record = packRecord(jsObj);
rs.addRecord(record, 0, record.length);
}
} finally {
if (rs != null) {
rs.closeRecordStore();
}
}
The packRecord method :
private byte[] packRecord(JSONObject jsonObj) throws IOException, JSONException {
ByteArrayOutputStream raw = new ByteArrayOutputStream();
DataOutputStream out = new DataOutputStream(raw);
out.writeInt(jsonObj.getInt("idCust"));
out.writeUTF(jsonObj.getString("name"));
out.writeUTF(jsonObj.getString("IDNumber"));
out.writeUTF(jsonObj.getString("address"));
return raw.toByteArray();
}
3) This is how I pull all the records from the record store :
RecordStore rs = null;
RecordEnumeration re = null;
try {
rs = RecordStore.openRecordStore(mRecordStoreName, true);
re = rs.enumerateRecords(null, null, false);
while (re.hasNextElement()) {
Customer c;
int idRecord = re.nextRecordId();
byte[] record = rs.getRecord(idRecord);
c = parseRecord(record);
//Do something with the parsed object (Customer)
}
} finally {
if (re != null) {
re.destroy();
}
if (rs != null) {
rs.closeRecordStore();
}
}
The parseRecord Method :
private Customer parseRecord(byte[] record) throws IOException {
Customer cust = new Customer();
ByteArrayInputStream raw = new ByteArrayInputStream(record);
DataInputStream in = new DataInputStream(raw);
cust.idCust = in.readInt();
cust.name = in.readUTF();
cust.IDNumber = in.readUTF();
cust.address = in.readUTF();
return cust;
}
This is how I implemented what Mister Smith suggested(hope it's what he had in mind). However, I'm still not very sure about how to implement the searchs.
I almost forget to mention that before I made theses changes to my code, the size of my RecordStore was 229048 bytes, now it is only 158872 bytes :)
RMS is nothing of the sort of a database. You have to think of it as a record set, where each record is a byte array.
Because of this, it is easier to work with it when you define a fixed max length for each field in the record. For instance, a record could be some info about a player in a game (max level reached, score, player name, etc). You could define the level field as 4 bytes long (int), then a score field of 8 bytes (a long), then the name as a 100 bytes field (string). This is tricky because strings usually will be of variable length, but you would probably like to have a fixed max length for this field, and if some string is shorter than that, you'd use a string terminator char to delimite it. (This example is actually bad because the string is the last field, so it would have been easier to keep it variable length. Just imagine you have several consecutive fields of type string.)
To help you with serialization/deserialization, you can use DataOutputstream and DataInputStream. With these classes you can read/write strings in UTF and they will insert the string delimiters for you. But this means that when you need a field, as you don't know exactly where it is located, you'll have to read the array up to that position first.
The advantage of fixed lengths is that you could later use a RecordFilter and if you wanted to retrieve recors of players that have reached a score greater than 10000, you can look at the "points" field in exactly the same position (an offset of 4 bytes from the start of the byte array).
So it's a tradeoff. Fixed lengths means faster access to fields (faster searches), but potential waste of space. Variable lengths means minimum storage space but slower searches. What is best for your case will depend on the number of records and the kind of searches you need.
You have a good collection of tutorials in the net. Just to name a few:
http://developer.samsung.com/java/technical-docs/Java-ME-Record-Management-System
http://developer.nokia.com/community/wiki/Persistent_Data_in_Java_ME
What I want is :
In the search method i will add an extra parameter say relevance param of type float to setup the cuttoff relevance. So lets say if the cutoff is 60% I want items that are higher than 60% relevance.
Here is current code of search :
say the search text is a
and in lucene file system i have following description:
1) abcdef
2)abc
3)abcd
for now it will fetch all the above three docuements , i want to fetch those which are that are higher than 60% relevance.
//for now i am not using the relevanceparam anywhere in the method :
public static string[] Search(string searchText,float relevanceparam)
{
//List of ID
List<string> searchResultID = new List<string>();
IndexSearcher searcher = new IndexSearcher(reader);
Term searchTerm = new Term("Text", searchText);
Query query = new TermQuery(searchTerm);
Hits hits = searcher.Search(query);
for (int i = 0; i < hits.Length(); i++)
{
float r = hits.Score(i);
Document doc = hits.Doc(i);
searchResultID.Add(doc.Get("ID"));
}
return searchResultID.ToArray();
}
Edit :
what if i set boost to my query
say : query.SetBoost(1.6);-- is this is equivalent to 60 percent?
You can easily do this by ignore those hits that have less than a TopDocs.MaxScore * minRelativeRelevance where minRelativeRelevance should be a value between 0 and 1.
I've modified your code to match the 3.0.3 release of Lucene.Net, and added a FieldSelector to your call to IndexSearcher.Doc to avoid loading non-required fields.
Calling Query.SetBoost(1.6) would only mean that the score calculated by that query would be boosted by 60% (multiplied with 1.6). It may change the ordering of the result if there were other queries involved (in a BooleanQuery, for example), but it wont change which results are returned.
public static String[] Search(IndexReader reader, String searchText,
Single minRelativeRelevance) {
var resultIds = new List<String>();
var searcher = new IndexSearcher(reader);
var searchTerm = new Term("Text", searchText);
var query = new TermQuery(searchTerm);
var hits = searcher.Search(query, 100);
var minScore = hits.MaxScore * minRelativeRelevance;
var fieldSelector = new MapFieldSelector("ID");
foreach (var hit in hits.ScoreDocs) {
if (hit.Score >= minScore) {
var document = searcher.Doc(hit.Doc, fieldSelector);
var hitId = document.Get("ID");
resultIds.Add(hitId);
}
}
return resultIds.ToArray();
}
For some reason I need to save some big strings into user profiles. Because a property with type string has a limit to 400 caracters I decited to try with binary type (PropertyDataType.Binary) that allow a length of 7500. My ideea is to convert the string that I have into binary and save to property.
I create the property using the code :
context = ServerContext.GetContext(elevatedSite);
profileManager = new UserProfileManager(context);
profile = profileManager.GetUserProfile(userLoginName);
Property newProperty = profileManager.Properties.Create(false);
newProperty.Name = "aaa";
newProperty.DisplayName = "aaa";
newProperty.Type = PropertyDataType.Binary;
newProperty.Length = 7500;
newProperty.PrivacyPolicy = PrivacyPolicy.OptIn;
newProperty.DefaultPrivacy = Privacy.Organization;
profileManager.Properties.Add(newProperty);
myProperty = profile["aaa"];
profile.Commit();
The problem is that when I try to provide the value of byte[] type to the property I receive the error "Unable to cast object of type 'System.Byte' to type 'System.String'.". If I try to provide a string value I receive "Invalid Binary Value: Input must match binary byte[] data type."
Then my question is how to use this binary type ?
The code that I have :
SPUser user = elevatedWeb.CurrentUser;
ServerContext context = ServerContext.GetContext(HttpContext.Current);
UserProfileManager profileManager = new UserProfileManager(context);
UserProfile profile = GetUserProfile(elevatedSite, currentUserLoginName);
UserProfileValueCollection myProperty= profile[PropertyName];
myProperty.Value = StringToBinary(GenerateBigString());
and the functions for test :
private static string GenerateBigString()
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 750; i++) sb.Append("0123456789");
return sb.ToString();
}
private static byte[] StringToBinary(string theSource)
{
byte[] thebytes = new byte[7500];
thebytes = System.Text.Encoding.ASCII.GetBytes(theSource);
return thebytes;
}
Have you tried with smaller strings? Going max on the first test might hide other behaviors. When you inspect the generated string in the debugger, it fits the requirements? (7500 byte[])
For those, who are looking for answer. You must use Add method instead:
var context = ServerContext.GetContext(elevatedSite);
var profileManager = new UserProfileManager(context);
var profile = profileManager.GetUserProfile(userLoginName);
profile["MyPropertyName"].Add(StringToBinary("your cool string"));
profile.Commit();