StormCrawler: setting "maxDepth": 0 prevents ES seed injection

StormCrawler: setting "maxDepth": 0 prevents ES seed injection - stormcrawler

With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the seeds and do a closed crawl on the injected seeds only with no redirection at all? (what I was expecting)
Launch looks fine but ES status index is empty.

See MaxDepthFilter, with a value of 0, everything gets filtered. Setting the filter to a value of 1 should do the trick, the seeds will be injected but their links won't be followed.
In MaxDepthFilter,
private String filter(final int depth, final int max, final String url) {
// deactivate the outlink no matter what the depth is
if (max == 0) {
return null;
}
if (depth >= max) {
LOG.debug("filtered out {} - depth {} >= {}", url, depth, maxDepth);
return null;
}
return url;
}
turns out that URLs need to have a depth of max-1 to be kept, so to put it differently, the actual maximum depth is max-1.
This feels not right and slightly confusing, I agree.
I think this is due to the sequence in which the outlinks get filtered. Often, this is done in the StatusEmitterBolt.
At the moment they first get filtered then inherit their metadata from the parent metadata. It is during that later step that their depth value gets incremented. I suspect this is why we are doing the max-1 trick.
There probably was a reason why the filtering was done first then the metadata inheritance but it has been a while and I can't remember any. I would be happy to change the order and get the metadata then filter and change the depth filtering so that it is more intuitive. Could you please open an issue on Github so that we discuss it there?
Thanks!

Related

what code-instrument should be added to register each http event in MeterRegistry with specific tag & minute value. Event requests are in millions

I need to analyse one http event value which should not be greater than 30mins. & 95% event should belong to this bucket. If it fails send the alert.
My first concern is to get the right metrics in /actuator/prometheus
Steps I took:
As in every http request event, I am getting one integer value called eventMinute.
Using micrometer MeterRegistry, I tried below code
// MeterRegistry meterRegistry ...
meterRegistry.summary("MINUTES_ANALYSIS", tags);
where tag = EVENT_MINUTE which receives some integer value in each
http event.
But this way, it floods the metrics due to millions of event.
Guide me a way please, i am beginner to this. Thanks!!

The simplest solution (which I would recommend you start with) would be to just create 2 counters:
int theThing = //getTheThing()
if(theThing > 30) {
meterRegistry.counter("my.request.counter.abovethreshold").inc()
}
meterRegistry.counter("my.request.counter.total").inc()
You would increment the counter that matches your threshold and another that tracks all requests (or reuse another meter that does that for you).
Then it is simple to setup a chart or alarm:
my_request_counter_abovethreshold/my_request_counter_total < .95
(I didn't test the code. It might need a tiny bit of tweaking)
You'll be able to do a similar thing with DistributionSummary by setting various SLOs (I'm not familiar with them to be able to offer one), but start with something simple first and if it is sufficient, you won't need the other complexity.

There are certain ways to solve this problem
1 ; here is a function which receives tags, name of metrics and a value
public void createOrUpdateHistogram(String metricName, Map<String, String> stringTags, double numericValue)
{
DistributionSummary.builder(metricName)
.tags(tags)
//can enforce slo if required
.publishPercentileHistogram()
.minimumExpectedValue(1.0D) // can take this based on how you want your distibution
.maximumExpectedValue(30.0D)
.register(this.meterRegistry)
.record(numericValue);
}
then it produce metrics like
delta_bucket{mode="CURRENT",le="30.0",} 11.0
delta_bucket{mode="CURRENT", le="+Inf",} 11.0
so as infinte also hold the less than value, so subtract the le=30 from le=+Inf
Another ways could be
public void createOrUpdateHistogram(String metricName, Map<String, String> stringTags, double numericValue)
{
Timer.builder(metricName)
.tags(tags)
.publishPercentiles(new double[]{0.5D, 0.95D})
.publishPercentileHistogram()
.serviceLevelObjectives(new Duration[]{Duration.ofMinutes(30L)})
.minimumExpectedValue(Duration.ofMinutes(30L))
.maximumExpectedValue(Duration.ofMinutes(30L))
.register(this.meterRegistry)
.record((long)timeDifference, TimeUnit.MINUTES);
}
it will only have two le, the given time and +inf
it can be change based on our requirements also it gives us quantile.

Typescript Multi Dimensional Array's Values Not Updating (to null)

What I am Doing
I am trying to create a Sudoku solver and generator in Vue. Right now, I have the solving algorithm set up, and just need to generate new problems. I am generating problems by creating a completed Sudoku problem (complete with no bugs), then I have to remove nodes so that there is still only 1 solution to the problem.
The Problem
When I try to access a node from the multi-dimensional array that represents the board, and change it to null (what I am using to display a blank node), the board does not update that value. I am changing it with the following code: newGrid[pos[0]][pos[1]] = null; (where pos[0] is the row, pos[1] is the column , and newGrid is grid we want to mutate). Note that the array is an array with 9 arrays inside, and each of those arrays has 9 numbers (or null) which represent the values for that position in the grid. To elaborate on the bug, if I put a console.log(newGrid), there are normal looking values, and no null.
What I Know and Have Tried
I know it has to do with this specific line, and the fact that I am setting the value equal to null because changing null to another value (i.e. newGrid[pos[0]][pos[1]] = 0;) works and changes the array. The reason I don't just use a value other than null is: null renders and nothing and other values (0) render as something (null nodes should be blank), null is simple to understand in this situation (the logic is node has null, node has nothing, node is blank), and null is already implemented throughout my codebase.
Additionally, if I use console.log(newGrid[pos[0]][pos[1]]), null (the correct output) is outputted, even though console.log(newGrid) shows a number there, not null. Also, oddly enough, this works for one specific node. In row 1 (indexing starts at 0), column 8, null is set. Even though the input (completed) grid is always different, this node is always set to null. Edit: this bug had to do with the input grid already having null here, so it actually doesn't let any nulls be set.
To summarize: I expect an array with a null in a few positions I update, but I get a number instead. Also, there are no errors when the Typescript compiles to Javascript or during runtime.
Code
Given that I am not exactly sure where the problem may be (i.e. maybe I create the array wrong) I am including the minimum code with a pastebin link to the whole file (this is the full code). To restate, the goal of this function is to remove nodes from the list (by replacing them with null) in order to create a Sudoku puzzle with one solution. The code on Stack Overflow only includes some of the whole file, and the pastebin link includes the rest.
//global.d.ts
type Nullable<T> = T | null;
type Grid = Array<Array<number | null>>;
import { Solver } from './Solve';
// Inside the function that does the main work
const rowLen: number = grid.length;
const colLen: number = grid[0].length;
let newGrid: Grid = grid; // Grid is a argument for this function
let fullNodes = GetFirstFull(grid, colLen, rowLen);
let fullNodesLen: number = fullNodes.length;
// Some stuff that figures out how many solutions there are (we only want 1) is excluded
if (solutions != 1) {
fullNodesLen++;
rounds--;
} else {
newGrid[pos[0]][pos[1]] = null;
}
Note that if anything seems confusing check out the pastebin or ask. Thank you so much for taking the time to look at my problem!
Also, it isn't just 0 that works, undefined also makes it set correctly. So, this problem seems to be something with the null keyword...
EDIT:
Given that no one has responded yet, I assume: my problem is a bit hard, there isn't enough information, my post isn't good quality, or not enough people have seen it. To control the problem of not enough information, I would like to include the function that calls this function (just to see if that might be related).
generate(context: ActionContext<State, any>) {
let emptyArray = new Array(9);
for (let i = 0; i < 9; ++i)
emptyArray[i] = [null, null, null, null, null, null, null, null, null];
const fullGrid = Solver(emptyArray);
const puzzle = fullGrid ? Remover(fullGrid, 6) : state.gridLayout;
context.commit('resetBoard', puzzle);
},
Note: If you aren't familiar with Vuex, what context.commit does is changes the state (except it is changing a global state rather than a component state). Given that this function isn't refactored or very easy to read code in the first place, if you have any questions, please ask.
To solve other potential problems: I have been working on this, I have tried a lot of console.log()ing, changing the reference (newGrid) to a deepcopy, moving stuff out of the if statements, verifying code execution, and changing the way the point on the newGrid is set (i.e. by using newgrid.map() with logic to return that point as null). If you have any questions or I can help at all, please ask.

CRM PlugIn Pass Variable Flag to New Execution Pipeline

I have records that have an index attribute to maintain their position in relation to each other.
I have a plugin that performs a renumbering operation on these records when the index is changed or new one created. There are specific rules that apply to items that are at the first and last position in the list.
If a new (or existing changed) item is inserted into the middle (not technically the middle...just somewhere between start and end) of the list a renumbering kicks off to make room for the record.
This renumbering process fires in a new execution pipeline...We are updating record D. When I tell record E to change (to make room for D) that of course fires the plugin on update message.
This renumbering is fine until we reach the end of the list where the plugin then gets into a loop with the first business rule that maintains the first and last record differently.
So I am trying to think of ways to pass a flag to the execution context spawned by the renumbering process so the recursion skips the boundary edge business rules if IsRenumbering == true.
My thoughts / ideas:
I have thought of using the Depth check > 1 but that isn't a reliable value as I can't explicitly turn it on or off....it may happen to work but that is not engineering a solid solution that is hoping nothing goes bump. Further a colleague far more knowledgeable than I said that when a workflow calls a plugin the depth value is off and can't be trusted.
All my variables are scoped at the execute level so as to avoid variable pollution at the class level....However if I had a dictionary object, tuple, something at the class level and one value would be the thread id and the other the flag value then perhaps my subsequent execution context could check if the same owning thread id had any values entered.
Any thoughts or other ideas on how to pass context information to a new pipeline would be greatly appreciated.
Per Nicknow sugestion I tried sharedvariables but they seem to be going out of scope...:
First time firing post op:
if (base.Stage == EXrmPluginStepStage.PostOperation)
{
...snip...
foreach (var item in RenumberSet)
{
Context.ParentContext.SharedVariables[recordrenumbering] = "googly";
Entity renumrec = new Entity("abcd") { Id = item.Id };
#region We either add or subtract indexes based upon sortdir
...snip...
renumrec["abc_indexfield"] = TmpIdx + 1;
break;
.....snip.....
#endregion
OrganizationService.Update(renumrec);
}
}
Now we come into Pre-Op of the recursion process kicked off by the above post-op OrganizationService.Update(renumrec); and it seems based upon this check the sharedvariable didn't carry over...???
if (!Context.SharedVariables.Contains(recordrenumbering))
{
//Trace.Trace("Null Set");
//Context.SharedVariables[recordrenumbering] = IsRenumbering;
Context.SharedVariables[recordrenumbering] = "Null Set";
}
throw invalidpluginexception reveals:
Sanity Checks:
Depth : 2
Entity: ...
Message: Update
Stage: PreOperation [20]
User: 065507fe-86df-e311-95fe-00155d050605
Initiating User: 065507fe-86df-e311-95fe-00155d050605
ContextEntityName: ....
ContextParentEntityName: ....
....
IsRenumbering: Null Set

What are you looking for is IExecutionContext.SharedVariables. Whatever you add here is available throughout the entire transaction. Since you'll have child pipelines you'll want to look at the ParentContext for the value. This can all get a little tricky, so be sure to do a lot of testing - I've run into many issues with SharedVariables and looping operations in Dynamics CRM.
Here is some sample (very untested) code to get you started.
public static bool GetIsRenumbering(IPluginExecutionContext pluginContext)
{
var keyName = "IsRenumbering";
var ctx = pluginContext;
while (ctx != null)
{
if (ctx.SharedVariables.Contains(keyName))
{
return (bool)ctx.SharedVariables[keyName];
}
else ctx = ctx.ParentContext;
}
return false;
}
public static void SetIsRenumbering(IPluginExecutionContext pluginContext)
{
var keyName = "IsRenumbering";
var ctx = pluginContext;
ctx.SharedVariables.Add(keyName, true);
}

A very simple solution: add a bit field to the entity called "DisableIndexRecalculation." When your first plugin runs, make sure to set that field to true for all of your updates. In the same plugin, check to see if "DisableIndexRecalculation" is set to true: if so, set it to null (by removing it from the TargetEntity entirely) and stop executing the plugin. If it is null, do your index recalculation.
Because you are immediately removing the field from the TargetEntity if it is true the value will never be persisted to the database so there will be no performance penalty.

Setting a df threshold, beyond which, query terms should be ignored

I am using Solr to search and index products from a database. Products have two interesting fields : a name and a description. Product names are normally unique, but sometimes contain common words, which serve as a pre-description of the product. One example would be "UltraScrew - a motor powered screwdriver”. Names are generally much shorter than descriptions
The problem is that when one searches for a common term, documents that contain it in the name get an unwanted boost, over those that contain it only in the description. This is due to the fact that names are shorter, and even with the normalization added afterwards, it is quite visible.
I was wondering if it is possible to filter terms out of the name, not with a dictionary of stop words, but based on the relative document frequency of the term. That means, if a term appears in more than 10% of the available documents, it should be ignored when the name field is queried. The description field should be left untouched.
Is this generally possible?

maybe you could use your own similarity:
import org.apache.lucene.search.Similarity;
public class MySimilarity extends Similarity {
#Override
public float idf(int docFreq, int numDocs) {
float freq = ((float)docFreq)/((float)numDocs);
if (freq >=0.1) return 0;
return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);
}
...
}
and use that one instead of the default one.
You can set the similarity for an indexSearcher at lucene level, see this other answer to a question.

I am not sure if I understood the question correctly, but you could run two separate queries. Pseudo code:
SearchResults nameSearchResults = search("name:X");
if (nameSearchResults.size() * 10 >= corpusSize) { // name-based search useless?
return search("description:X"); // use description-based search
} else {
return search("name:X description:X); // search both fields
}

Increase Hashmap Index Without Looping

I have working on clustering algorithm. I decided to use hashmap to store the points because thinking that i can use as clusterID and as the point. I do a dfs fashion search to identify nearest and my calculation related work and all the looping on data take place outside of the method that I identify the clusters.
Also the intention of this clustering is that, if a point belongs to a same cluster its id remain the same. What I want to find out is that once i enter value in the hash map how can increase the index for the next value (Key would be same) with out using loop.
Here is how my method looks like, I took up some content of the algorithm out of since it really not relevant to the question.
public void dfsNearest(double point) {
double aPointInCluster = point;
if(!cluster.contains(aPointInCluster)) {
...
this.setNumOfClusters(this.getNumOfClusters() + 1);
mapOfCluster.put(this.getNumOfClusters(), aPointInCluster);
//after this i want to increase the index so no override happens
}
...
if(newNeighbor != 0.0) {
cluster.add(newNeighbor);
mapOfCluster.put(this.getNumOfClusters(), newNeighbor);
//want to increase the index....
...
if (!visitedMap.containsKey(newNeighbor)) {
dfsNearest(newNeighbor);
}
}
...
}
Thanks for any suggestions, also please let me know if rest of the code is necessary to make a good decision. Just wanted to keep it simple.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

StormCrawler: setting "maxDepth": 0 prevents ES seed injection - stormcrawler

Related

what code-instrument should be added to register each http event in MeterRegistry with specific tag & minute value. Event requests are in millions

Typescript Multi Dimensional Array's Values Not Updating (to null)

CRM PlugIn Pass Variable Flag to New Execution Pipeline

Setting a df threshold, beyond which, query terms should be ignored

Increase Hashmap Index Without Looping

Categories

Resources