Pass metadata along seed urls with Nutch 1.X REST APi

Pass metadata along seed urls with Nutch 1.X REST APi - nutch

I'm currently trying to include the seed url in the data indexed for each url in my search backend (currently ElasticSearch).
I've seen in this previous question that metadata could be passed with each seed, which could suit my need. However, I'm using the REST API to create my seed list, and it seems that metadata aren't allowed in the seedUrls parameter.
Has anybody tried to do this with the REST API?
Is there another way to achieve this?
I thought I could write a custom IndexingFilter to add the seed URL in the NutchDocument to be indexed, but at this point, the seed URL is not available from what I've seen.
Thanks in advance!

At the moment the REST API doesn't seem to support handling associated metadata. I believe that this doens't require such a great effort to accomplish, basically we just need to handle the JSON payload and customize the corresponding entity SeedUrl to hold the metadata and of course customize the writeToSeedFile method.
Although your approach of writing an IndexingFilter wouldn't work. The seed URLs are injected at the very begining of the crawl life cycle, and the IndexingFilter are only responsable of choosing what gets indexed into your storage.

Related

How to disable pagination for a single request (request all items under resource)?

I have an Eve instance running and pagination enabled. In some cases I want to request all items under a resource. This is done together with a projection to get a full list of ids.
This question is very similar to another question, but this question concerns external requests rather than internal calls.
I have tried setting max_results to 0 and -1 but both yield a single result. Is there a way to request all items without disabling pagination globally?
Edit My current solution to circumvent this is a custom flask endpoint which just access the database directly. The issue with this approach is that I would like to add various projects and make use of Eve's database optimizations. All of which I need to manually reimplement.

Can Azure API management cached based on request payload?

Is it possible to use cache based on a key in the request payload?
Eg. let's say we got a json or xml request payload where one of the elements is CustomerId.
Would it then be possible to cache based on CustomerId?
Thanks

I hope I understood your query properly and am not too late. I think you want to cache only when 'CustomerId' is present in the input OR it contains a certain value.
You can refer to the samples given in the foll link
https://azure.microsoft.com/en-us/blog/policy-expressions-in-azure-api-management/
It will help you to write policy expressions to check the presence or value of a particular field. Then you can cache or ignore based on that.
On a side note, Custom Caching is also something cool to check
https://learn.microsoft.com/en-us/azure/api-management/api-management-sample-cache-by-key

Get Request URL Capability

I recently began working with JavaScript and am looking at various get and post requests one can send to a server.
For get, as far as I know, all of the information of the query is contained in the URL that the user triggers. On the server side this has to be dissected to retrieve the necessary parameters.
I was just wondering how larger and more detailed requests are handled with this get method? For instance what if I had millions and millions of parameters that make up my whole request? Would they all be jumbled into the URL? Is there a limit as to the number of unique URLs one can have? I read this post:
How do URL shorteners guarantee unique URLs when they don't expire?
I would really like some more input.
Thank You!

Sail.js - how to structure JSON based live data output with existing static data in the model

In my Angular app, I want to display a table which contains the following
a) URL
b) Social share counts divided by different social networks
Using Sails.js, I already have the api created for the URL when the results show up, I can display the URL now I'm confused how to get the appropriate social counts showing right besides
Here's the API I'm using: https://docs.sharedcount.com/
by itself, I can see the JSON it produces
But here are my questions:
Should I create a new api (model/controller) for social count data or include it in my model where I have the 'url' action defined?
If I create a new api or include the social_counts as an action in the current, what would my JSON query look like? to retrieve the URL's, I'm using default API blueprint that Sails provides, so:
http://www.example.com/url/find?where={"title":{"contains":"mark"}}
Struggling a bit in terms of the thought process, would be great to get input on this

It depends on your app. is your app will store that data or just consume it? If it need to store, of course you need the API. In purpose for modification or aggregating the data for example.
No, you can't do that. That shortcut method only works if you have the data in your database and let the Sails Waterline ORM and Blueprint API served it.
Perhaps, if you only need to consume the data from that Sharedcount API, you didn't need to use Sails as a backend, in this context. Just use Angular as a client of that API. Except if you need to modify the data first and store it in your own database, so Sails will helps with it's Waterline ORM and Blueprint API.

Grab instagram photo based on hashtags

I am new to instagram and i am tasked to program an application to grab instagram photo uploads based on a certain hashtag. Meaning if the application is started and searching for the hashtag "#awesomeevent" any one that uploads a photo with that hashtags it will automatically be stored into our database.
The application should work something similar to http://statigr.am/tag/ but instead displaying the photos it should store the photos into the database.
What is the process of doing this. Any tutorials that has this from start to end. Even covering how to start creating a instagram app from scratch. any help would be greatly appreciated.
Thanks

Things we developers often overlook are the API Terms and Conditions. I've been there myself.
API TERMS OF USE
Before you start using the API, we have a few guidelines that we'd like to tell you about. Please make sure to read the full API Terms of Use
Terms of Use. Here's what you'll read about:
Instagram users own their images. It's your responsibility to make sure that you respect that right.
You cannot use the Instagram name in your application.
You cannot use the Instagram API to crawl or store users' images without their express consent.
You cannot replicate the core user experience of Instagram.com
Do not abuse the API. Too many requests too quickly will get your access turned off
However, a part in the terms also states that:
You shall not cache or store any Instagram user photos other than for reasonable periods in order to provide the service you are
providing to Instagram users.
Hope that's a start before you actually get coding and storing images.
API Terms of Use: http://instagram.com/about/legal/terms/api/
API: http://instagram.com/developer/

For starter, you should consult to instagram api.
As for the specific api you will need is:
/tags/tag-name/media/recent
For example, if you want to look for images from tag #awesomeevent, you will do an api query to:
https://api.instagram.com/v1/tags/awesomeevent/media/recent?access_token=ACCESS-TOKEN

I would have a look at the two libraries Instagram provides. The ruby library is at https://github.com/Instagram/instagram-ruby-gem and the python library is at https://github.com/Instagram/python-instagram
They both seem to have examples to get you started if you're programming with either libraries.
As far as the storing issue goes, could you instead store the URL address of the images instead of the actual images themselves? The API returns JSON information of which the URL of the images are returned.
Hope that helps.

You can use the below ruby script to retrieve the images and save them to a file. You can then either reference the file within the database or replace the last block with code for your particular database implementation. Without knowing your database type and schema, no one can tell you how to add something to it.
require "instagram"
require "restclient"
Instagram.configure do |config|
config.client_id = INSTAGRAM_CLIENT_ID
config.client_secret = INSTAGRAM_CLIENT_SECRET
end
instagram_client = Instagram.client(:access_token => INSTAGRAM_ACCESS_TOKEN)
tags = instagram_client.tag_search('cat')
urls = Array.new
for media_item in instagram_client.tag_recent_media(tags[0].name)
urls << media_item.images.standard_resolution.url
end
urls.each_with_index do |url, idx|
image = RestClient.get(url)
path = Dir.pwd + "/#{idx}.jpg"
File.open(path, 'w') {|f| f.write(image) }
end

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pass metadata along seed urls with Nutch 1.X REST APi - nutch

Related

How to disable pagination for a single request (request all items under resource)?

Can Azure API management cached based on request payload?

Get Request URL Capability

Sail.js - how to structure JSON based live data output with existing static data in the model

Grab instagram photo based on hashtags

Categories

Resources