I have working on clustering algorithm. I decided to use hashmap to store the points because thinking that i can use as clusterID and as the point. I do a dfs fashion search to identify nearest and my calculation related work and all the looping on data take place outside of the method that I identify the clusters.
Also the intention of this clustering is that, if a point belongs to a same cluster its id remain the same. What I want to find out is that once i enter value in the hash map how can increase the index for the next value (Key would be same) with out using loop.
Here is how my method looks like, I took up some content of the algorithm out of since it really not relevant to the question.
public void dfsNearest(double point) {
double aPointInCluster = point;
if(!cluster.contains(aPointInCluster)) {
...
this.setNumOfClusters(this.getNumOfClusters() + 1);
mapOfCluster.put(this.getNumOfClusters(), aPointInCluster);
//after this i want to increase the index so no override happens
}
...
if(newNeighbor != 0.0) {
cluster.add(newNeighbor);
mapOfCluster.put(this.getNumOfClusters(), newNeighbor);
//want to increase the index....
...
if (!visitedMap.containsKey(newNeighbor)) {
dfsNearest(newNeighbor);
}
}
...
}
Thanks for any suggestions, also please let me know if rest of the code is necessary to make a good decision. Just wanted to keep it simple.
Related
With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the seeds and do a closed crawl on the injected seeds only with no redirection at all? (what I was expecting)
Launch looks fine but ES status index is empty.
See MaxDepthFilter, with a value of 0, everything gets filtered. Setting the filter to a value of 1 should do the trick, the seeds will be injected but their links won't be followed.
In MaxDepthFilter,
private String filter(final int depth, final int max, final String url) {
// deactivate the outlink no matter what the depth is
if (max == 0) {
return null;
}
if (depth >= max) {
LOG.debug("filtered out {} - depth {} >= {}", url, depth, maxDepth);
return null;
}
return url;
}
turns out that URLs need to have a depth of max-1 to be kept, so to put it differently, the actual maximum depth is max-1.
This feels not right and slightly confusing, I agree.
I think this is due to the sequence in which the outlinks get filtered. Often, this is done in the StatusEmitterBolt.
At the moment they first get filtered then inherit their metadata from the parent metadata. It is during that later step that their depth value gets incremented. I suspect this is why we are doing the max-1 trick.
There probably was a reason why the filtering was done first then the metadata inheritance but it has been a while and I can't remember any. I would be happy to change the order and get the metadata then filter and change the depth filtering so that it is more intuitive. Could you please open an issue on Github so that we discuss it there?
Thanks!
I have my data as follows
{
"key":"adasd",
"col1"::23,
"col2":3
}
I want to see the results sorted in descending order of the ratio of col1/sum(col2)
where sum(col2) refers to the sum of all values of col2. I am a bit new to cloudant so I don't know what the best way to approach this is. I can think of a few options.
Create a new column for sum(col2) and keep updating it with each new value of col2
For each record,also create a new column col1/sum(col2). Then i can sort on this column.
Use Views to calculate the ratio and sum on the fly. This way I don't have to store new columns plus I don't have to perform costly calculations on each update.
I tried to create a view and the map function is easy enough
function (doc) {
emit(doc._id, {"col1_value":doc.col1,"col2_value":doc.col2});
}
but I am confused by the reduce template
function (keys, values, rereduce) {
if (rereduce) {
return sum(values);
} else {
return values.length;
}
}
I have no idea on how to access the values of the two columns and then aggregate here. Is this even possible? Is there any other way to achieve the result I need?
Two comments:
Ordering by X/sum(Y) is the same as ordering by X (or by -X if sum(Y) is negative). So for ordering purposes, just order by X and save yourself a bunch of hassle.
Assuming you actually want to know the value of X/sum(Y), and not just order by it, there's no one-step way to accomplish this in CouchDB. The best I can think of is to create a map/reduce view that gives you the global sum(Y). Then you can fetch that sum with a simple query, and do the math in your application, when fetching your documents.
I have the following data(example) -
1 - "Value1A"
1 - "Value1B"
1 - "Value1C"
2 - "Value2A"
2 - "Value2B"
I'm using Multimaps for the above data, such that the key 1, has 3 values(Value1A, Value1B, Value1C) and key 2 has 2 values(Value2A, Value2B).
When I try to retrieve all the values for a given key using the get function, it works. But I want to get the key given the value. i.e. if I have "Value1C", I want to use this to get its key 1, from the Multimap. Is this possible, if so how and if not what other than Multimap can I use to achieve this result.
Thanks for the help
https://www.npmjs.com/package/multimap
It is not possible to do this with a single operation, You will need to choose beetween use some extra memory or consume CPU resource.
Use more memory
In this case you need to store the data in a reverse mapping. So you will have another map to store as "Value1C" -> 1. This solution can cause consistency issues, since all the operations will need to be updated in both map. The original one and the reverse one.
The example for this code is basic:
//insert
map.set(1, "Value1C");
reverseMap.set("Value1C", 1);
//search
console.log(map.get(reverseMap.get("Value1C")));
Use more CPU
In this cause you will need to do a search throught all the values, this will be an O(n) complexity. It is not good if your list is too big, even worst in a single thread environment like Node.js.
Check the code example below:
function findValueInMultiMap(map, value, callback){
map.forEachEntry(function (entry, key) {
for(var e in entry){
if(entry[e]==value){
callback(map.get(key));
}
}
});
}
findValueInMultiMao(map, 'Value1C', function(values){
console.log(values);
});
I am using Solr to search and index products from a database. Products have two interesting fields : a name and a description. Product names are normally unique, but sometimes contain common words, which serve as a pre-description of the product. One example would be "UltraScrew - a motor powered screwdriver”. Names are generally much shorter than descriptions
The problem is that when one searches for a common term, documents that contain it in the name get an unwanted boost, over those that contain it only in the description. This is due to the fact that names are shorter, and even with the normalization added afterwards, it is quite visible.
I was wondering if it is possible to filter terms out of the name, not with a dictionary of stop words, but based on the relative document frequency of the term. That means, if a term appears in more than 10% of the available documents, it should be ignored when the name field is queried. The description field should be left untouched.
Is this generally possible?
maybe you could use your own similarity:
import org.apache.lucene.search.Similarity;
public class MySimilarity extends Similarity {
#Override
public float idf(int docFreq, int numDocs) {
float freq = ((float)docFreq)/((float)numDocs);
if (freq >=0.1) return 0;
return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);
}
...
}
and use that one instead of the default one.
You can set the similarity for an indexSearcher at lucene level, see this other answer to a question.
I am not sure if I understood the question correctly, but you could run two separate queries. Pseudo code:
SearchResults nameSearchResults = search("name:X");
if (nameSearchResults.size() * 10 >= corpusSize) { // name-based search useless?
return search("description:X"); // use description-based search
} else {
return search("name:X description:X); // search both fields
}
I have the following c# code:
private XElement BuildXmlBlob(string id, Part part, out int counter)
{
// return some unique xml particular to the parameters passed
// remember to increment the counter also before returning.
}
Which is called by:
var counter = 0;
result.AddRange(from rec in listOfRecordings
from par in rec.Parts
let id = GetId("mods", rec.CKey + par.UniqueId)
select BuildXmlBlob(id, par, counter));
Above code samples are symbolic of what I am trying to achieve.
According to the Eric Lippert, the out keyword and linq does not mix. OK fair enough but can someone help me refactor the above so it does work? A colleague at work mentioned accumulator and aggregate functions but I am novice to Linq and my google searches were bearing any real fruit so I thought I would ask here :).
To Clarify:
I am counting the number of parts I might have which could be any number of them each time the code is called. So every time the BuildXmlBlob() method is called, the resulting xml produced will have a unique element in there denoting the 'partNumber'.
So if the counter is currently on 7, that means we are processing 7th part so far!! That means XML returned from BuildXmlBlob() will have the counter value embedded in there somewhere. That's why I need it somehow to be passed and incremented every time the BuildXmlBlob() is called per run through.
If you want to keep this purely in LINQ and you need to maintain a running count for use within your queries, the cleanest way to do so would be to make use of the Select() overloads that includes the index in the query to get the current index.
In this case, it would be cleaner to do a query which collects the inputs first, then use the overload to do the projection.
var inputs =
from recording in listOfRecordings
from part in recording.Parts
select new
{
Id = GetId("mods", recording.CKey + part.UniqueId),
Part = part,
};
result.AddRange(inputs.Select((x, i) => BuildXmlBlob(x.Id, x.Part, i)));
Then you wouldn't need to use the out/ref parameter.
XElement BuildXmlBlob(string id, Part part, int counter)
{
// implementation
}
Below is what I managed to figure out on my own:.
result.AddRange(listOfRecordings.SelectMany(rec => rec.Parts, (rec, par) => new {rec, par})
.Select(#t => new
{
#t,
Id = GetStructMapItemId("mods", #t.rec.CKey + #t.par.UniqueId)
})
.Select((#t, i) => BuildPartsDmdSec(#t.Id, #t.#t.par, i)));
I used resharper to convert it into a method chain which constructed the basics for what I needed and then i simply tacked on the select statement right at the end.