Proper way to configure Deletion Bolt for Stormcrawler - stormcrawler

So I'm trying to turn on the Deletion Bolt on my storm crawler instances so they can clean up the indexes as the urls for our sites change and pages go away.
For reference I am on 1.13. (our systems people have not upgraded us to Elk v7 yet)
Having never attempted to modify the es-crawler.flux, I'm looking for some help to let me know if I am doing this correctly.
I added a bolt:
- id: "deleter"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.DeletionBolt"
parallelism: 1
and then added the stream:
- from: "status"
to: "deleter"
grouping:
type: FIELDS
args: ["url"]
streamId: "deletion"
Is that the correct way to do this? I don't want to accidentally delete everything in my index by putting in the wrong info. 🤣

Yes, to answer my own question, adding the two above items to their respective places in the es-crawler.flux DOES in fact cause the crawler to delete docs.
In order to test this, I created a directory on one of our servers with a few files in it - index.html, test1.html, test2.html, and test3.html. index.html had links to the three test html files. I crawled them with the crawler, having first limited it to ONLY crawl that specific directory. I also modified the fetch settings to re-crawl crawled docs after 3 min, and re-crawl fetch-error docs after 5min.
All 4 docs showed up in the status index as FETCHED and the content in the content index.
I then renamed test3.html to test5.html, and changed the link in the index.html. The crawler picked up the change, and and changed the status of test3.html to FETCH_ERROR and added test4.html to the indexes.
After 5min it crawled it again, keeping the fetch error status.
After another 5min, it crawled it again, changing the status to ERROR and deleting the test3.html doc from the content index.
So that worked great. In our production indexes, we have a bunch of docs that have gone from FETCH_ERROR status to ERROR status, but because deletions were not enabled, the actual content was not deleted and is still showing up in searches. On my test pages, here's the solution to that:
I disabled deletions (removing the two above items from the es-crawler.flux) and renamed test2.html to test5.html, modifing the link in the index.html. The crawler went through the three crawls with FETCH_ERROR and set it to ERROR status but did not delete the doc from the content index.
I re-enabled deletions and let the crawler run for a while, but soon realized that when the crawler set the status to ERROR, it also set the nextFetchDate to 12/31/2099.
So I went into the elasticsearch index and ran the following query to reset the status and the date to something just ahead of where the current date/time was:
POST /www-test-status/_update_by_query
{
"script": {
"source": """
if (ctx._source?.status != null)
{
ctx._source.remove('metadata.error%2Ecause');
ctx._source.remove('status');
ctx._source.put('status', 'FETCH_ERROR');
ctx._source.remove('nextFetchDate');
ctx._source.put('nextFetchDate', '2019-10-09T15:01:33.000Z');
}
""",
"lang": "painless"
},
"query": {
"match": {
"status": "ERROR"
}
}
}
The crawler then picked up the docs the next time it came around and deleted the docs out of the content index when they went back to ERROR status.
Not sure if that's the complete proper way to do it, but it has worked for me.

Related

ExpressJS: How to cache on demand

I'm trying to build a REST API with express, sequelize (PostgreSQL dialect) and node.
Essentially I have two endpoints:
Method
Endpoint
Desc.
GET
/api/players
To get players info, including assets
POST
/api/assets
To create an asset
And there is a mechanism which updates a property (say price) of assets, over a cycle of 30 seconds.
Goal
I want to cache the results of GET /api/players, but I want some control over it, so that whenever a user creates an asset (using POST /api/assets) and right after that a request to GET /api/players should give the updated data (i.e. including the property which updates for every 30 seconds) and cache it until it gets updated in the next cycle.
Expected
The following should demonstrate it:
GET /api/players
JSON Response:
[
{
"name": "John Doe"
"assets": [
{
"id":1
"price": 10
}
]
}
]
POST /api/assets
JSON Request:
{
"id":2
}
GET /api/players
JSON Response:
{
"name": "John Doe"
"assets": [
{
"id":1
"price": 10
},
{
"id":2
"price": 7.99
}
]
}
What I have managed to do so far
I have made the routes, but GET /api/players has no cache mechanism and basically queries the database every time it is requested.
Some solutions I have found, but none seem to meet my scenario
apicache (https://www.youtube.com/watch?v=ZGymN8aFsv4&t=1360s): But I don't have a specific duration, because a user can create an asset anytime.
Example implementation
I have seen (kind off) similar implementation (that I desire) in Github actions workflow for implementing cache, where you define a key and unless the key has changed it uses the same packages and doesn't install packages everytime, (example: https://github.com/python-discord/quackstack/blob/6792fd5868f28573bb8f9565977df84e7ba50f42/.github/workflows/quackstack.yml#L39-L52)
Is there any package, to do that? So that while processing POST /api/assets I can change the key in its handler, and thus GET /api/players gives me the updated result (also I can change the key in that 30 seconds cycle too), and after that it gives me the cached result (until it is updated in the next cycle).
Note: If you have a solution please try to stick with some npm packages, rather than something like redis, unless its the only/best solution.
Thanks in advance!
(P.S. I'm a beginner and this is my first question in SO)
Typically caching is done with help of Redis. Redis is in-memory key-value store. You could handle the cache in the following manner.
In your handler for POST operation update/reset cached entry for players.
In your handler for GET operation if the Redis has the entry in cache return it, otherwise do the logic query the data, add the entry to the cache and return the data.
Alternatively, you could use Memcached.
A bit late to this answer but I was looking for a similar solution. I found that the apicache library not only allows for caching for specified durations, but the cache can also be manually cleared.
apicache.clear([target]) - clears cache target (key or group), or entire cache if no value passed, returns new index.
Here is an example for your implementation:
// POST /api/assets
app.post('/api/assets', function(req, res, next) {
// update assets then clear cache
apicache.clear()
// or only clear the specific players cache by using a parameter
// apicache.clear('players')
res.send(response)
})

Why can't Azure Search import JSON blobs?

When importing data using the configuration found below, Azure Cognitive Search returns the following error:
Error detecting index schema from data source: ""
Is this configured incorrectly? The files are stored in the container "example1" and in the blob folder "json". When creating the same index with the same data in the past there were no errors, so I am not sure why it is different now.
Import data:
Data Source: Azure Blob Storage
Name: test-example
Data to extract: Content and metadata
Parsing mode: JSON
Connection string:
DefaultEndpointsProtocol=https;AccountName=EXAMPLESTORAGEACCOUNT;AccountKey=EXAMPLEACCOUNTKEY;
Container name: example1
Blob folder: json
.json file structure.
{
"string1": "vaule1",
"string2": "vaule2",
"string3": "vaule3",
"string4": "vaule4",
"string5": "vaule5",
"string6": "vaule6",
"string7": "vaule7",
"string8": "vaule8",
"list1": [
{
"nested1": "value1",
"nested2": "value2",
"nested3": "value3",
"nested4": "value4"
}
],
"FileLocation": null
}
Here is an image of the screen with the error when clicking "Next: Add cognitive skills (Optional)" button:
To clarify there are two problems:
1) There is a bug in the portal where the actual error message is not showing up for errors, hence we are observing the unhelpful empty string "" as an error message. A fix is on the way and should be rolled out early next week.
2) There is an error when the portal attempts to detect index schema from your data source. It's hard to say what the problem is when the error message is just "". I've tried your sample data and it works fine with importing.
I'll update the post once the fix for displaying the error message is out. In the meantime (again we're flying blind here without the specific error string) here are a few things to check:
1) Make sure your firewall rules allow the portal to read from your blob storage
2) Make sure there are no extra characters inside your JSON files. Check the whitespace charcters are whitespace (you should be able to open the file in VSCode and check).
Update: The portal fix for the missing error messages has been deployed. You should be able to see a more specific error message should an error occur during import.
Seems to me that is a problem related to the list1 data type. Make sure you're selecting: "Collection(Edm.String)" for it during the index creation.
more info, please check step 5 of the following link: https://learn.microsoft.com/en-us/azure/search/search-howto-index-json-blobs
I have been in contact with Microsoft, and this is a bug in the Azure Portal. The issue is the connection string wizard does not append the Endpoint suffix correctly. They have recommeded to manually pasting the connection string, but this still does not work for me. So this is a suggested answer by Microsoft, but I don't believe is completely correct because the portal outputs the same error message:
Error detecting index schema from data source: ""

How do I create a composite index for my Firestore query?

I'm trying to perform a firestore query on a collection which results in a failure because an index needs to be created for the query I'm attempting. The error contains a link that is suppose to auto create the missing index for me. However when I follow the link and attempt to create the index that has been prepared for me I encounter an error stating "name only indexes are not supported". I would also point out I have been using the npm functions-framework to test my cloud function that contains the relevant query.
I have tried creating the composite index myself manually but none of the index I have made seem to satisfy my attempted query.
Sample docs in my Items Collection:
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "fr" <string>
}
{
descriptionLastModified:someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
These are all queries I have tried which fail:
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).orderBy("descriptionLastModified","desc").where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where("detectedLanguage", '==', "en-us").where('descriptionLastModified','<=', oneDayAgoTimestamp).get()
I have made the following composite indexes at the collection level to no avail:
CollectionId:items Fields: descriptionLastModified:DESC detectedLangauge: ASC
CollectionId:items Fields: descriptionLastModified:ASC detectedLangauge: ASC
CollectionId:items Fields: detectedLangauge: ASC descriptionLastModified:DESC
My expectation is I should be able to filter my items by their descriptionLastModified timestamp field and additionally by the value of their detected Language string field.
In case anyone finds this in the future, its 2021, I still find composite indexes created manually, despite being incredibly simple, or you'd think and I fully understand why the OP thought his indexes would work, often just don't. Doubtless there is some subtlety that reading some guides would make clear but I haven't found the trick yet and have been using firestore for over 18 months intensively at work.
The trick is to use the link it creates, but this often fails, you get a dialogue box telling you an index will be created, but no details for you to manually create and the friendly blue 'create' button does nothing, it neither creates the index nor does it dismiss the window.
For a while I had it working in firefox but it stopped. A colleague across a couple of desks who has to create them a lot tells me that Edge is the most reliable, and you have to be very careful to not have multiple google accounts signed in - if edge (or chrome) takes you to the wrong login initially when following the link, even if you switch user back (and you have to do this because it will assume your default login rather than say the one currently selected in your only google cloud console window), even if you switch back its about a 1 in 3. He tells me in edge it works about 60%
I used to get about 30% with firefox just hitting refresh and soon a few times, but cant get it working other than in edge now, and actually, unless there is a client with little cash who will notice, I just go for inefficient and costly queries which return the superset of results and do some filters on the results. Mostly running in nodejs and its nippy enough for my purposes. Real shame to ramp up the read counts and consequential bills, but just doesn't seem a fix.

Is there a way to resolve this error: "CloudKit integration requires does not support ordered relationships."

I'm trying to use Apple's CoreDataCloudkitDemo app. I've only changed the app settings per their README document. On running the demo, I'm getting the error: "CloudKit integration requires does not support ordered relationships."
(The weird grammar in the title is included in the app)
The console log shows:
Fatal error: ###persistentContainer: Failed to load persistent stores:Error Domain=NSCocoaErrorDomain Code=134060 "A Core Data error occurred." UserInfo={NSLocalizedFailureReason=CloudKit integration requires does not support ordered relationships. The following relationships are marked ordered:
Post: attachments
There is the same error for the "Tags" entity.
I'm using Xcode 11.0 beta 4 (11M374r).
I've only changed the bundle identifier, and set my Team Id.
I removed the original entitlements file - no errors in resulting build.
I've not changed any code from the original.
Does anyone have a workaround, or preferably, a fix? Or did I do something wrong?
Thanks
Firstly, select CoreDataCloudKitDemo.xcdatamodeld -> Post -> RelationShips, select attachments relationship, on the Inspect panel, deselect Ordered, then do the same thing on the tags relationship.
Secondly, there will be some errors in the code now, because we unchecked the Ordered option, the property of attachments and tags in the generated NSManagedObject may changed from NSOrderedSet? to NSSet?. So we could change these error lines of code like below:
Origin:
guard let tag = post?.tags?.object(at: indexPath.row) as? Tag else { return cell }
Changed:
guard let tag = post?.tags?.allObjects[indexPath.row] as? Tag else { return cell }
Finally, you can run the code now. ;-)
Further more, on WWDC19 Session 202, the demo shows they set both attachments and tags relationships as Unordered, so I think there's something wrong in the given demo project.

Listing active replications in couchdb 1.1.0

I am bootstraping replications in couchdb by POSTing to localhost:5984/_replicate. This URL only accepts POST requests.
There is also a second URL: localhost:5984/_replicator, which accepts PUT, GET and DELETE requests.
When I configure a replication POSTing to _replicate, it gets started, but I can not get information about it. It is also not listed in _replicator.
How can I get the list of active replications?
How can I cancel an active replication?
Edit: how to trigger replications with the _replicator method.
Thanks to comments by JasonSmith, I got to the following solution: PUTting to _replicator requires using full url (including authentictation credentials) for the target database. This is not the case when using the _replicate url, which is happy getting just the name of the target database (I am talking here about pull replications). The reason, as far as I can tell, is explained here (see section 8, "The user_ctx property and delegations")
The original API was a special URL, /_replicate where you tell Couch what to do and it tells you the result. However, the newer system is a regular database, called /_replicator and you create documents inside it telling Couch what to do. The document format is the same as the older _replicate format, however CouchDB will update the document as the replication proceeds. (For example, it will add a field "state":"triggered" or "state":"complete", etc.)
To get a list of active replications, GET /_active_tasks as the server admin. For example (formatted):
curl http://admin:secret#localhost:5984/_active_tasks
[ { "type": "Replication"
, "task": "`1bea06f0596c0fe6a1371af473a95aea+create_target`: `http://jhs.iriscouch.com/iris/` -> `iris`"
, "started_on": 1315877897
, "updated_on": 1315877898
, "status": "Processed 83 / 119 changes"
, "pid": "<0.224.0>"
}
, { "type": "Replication"
, // ... etc ...
}
]
The wiki has instructions to cancel CouchDB replication. Basically, you want to specify the same source and target and also add "cancel":true.

Resources