I have come across a list of bitorrent trackers in a text file about 100 of them.My question is how are they able to generate this text file with so many trackers like that.anybody with a script that generates this tracker urls?
example of a bittorent tracker is : http://3dfreedom.ru:6969/announce
There is no central directory of trackers, but one can build a list by harvesting torrents, e.g. from big indexing sites and then extracting tracker lists from them.
To extract tracker URLs from a torrent file you a library that supports bencoding, a serialization format used in the bittorrent ecosystem.
Most torrents will either have a simple announce URL as described in BEP-3 or multiple as in BEP-12
Related
I am reading the Google Photos API documentation. I can't find out what mediaItemId is, see for example here:
https://developers.google.com/photos/library/guides/access-media-items#get-media-item
There are some other questions that might be related, but they have no answers:
How to get mediaItemId of a Google photo using its shared URL?
I've not used the API but I'm familiar with other Google services and am a Photos user.
If you consider you're experience with photos.google.com, you browse a somewhat unstructured list of all your photos. The Photos (phone|browser) apps do categorize photos by date but you have to search to filter by other metadata to find the specific photo(s) that you're seeking. Or you happy-scroll through years of photos of your cat.
This contrasts with another common metaphor for arranging files in which a hierarchy of folders is used to categorize the content e.g. /photos/cats/2022 but this mechanism is limited because you can only really navigate through one dimension (the folders).
Considerable metadata (type, width|height, creation date etc.) is associated with each photo and it is customary in schemas like this to construct a unique ID for each object. The unique ID is sometimes exposed to the end-user but not necessarily. Identifiers are generally for the system's own purposes.
With Photos, there are public, unique identifiers in the form of URLs for each photos but evidently the id and the URL although probably related (perhaps via a hash) aren't obviously related.
So, since it's not always possible to specific a photo uniquely by e.g. "The one of my dog where he's wearing sunglasses because of the eclipse" and the absence of folders, a really powerful alternative (which you'll need to employ) is to search for some subset of the photos and then iterate over the results.
It appears that the Photos service has such a search to which you provide Filters and each of the items in the results will be a MediaItem (uniquely identified by id).
Unlike the file system example above, because Photos does not use a fixed hierarchy, we can view our Photos by filtering them using an extensive set of metadata: photos of cats, taken in 2022, using my phone.
The Wikipedia page of BitTorrent says regarding Multitracker torrents, "One disadvantage to this is that it becomes possible to have multiple unconnected swarms for a single torrent where some users can connect to one specific tracker while being unable to connect to another. This can create a disjoint set which can impede the efficiency of a torrent to transfer the files it describes."
Can someone please give me an example of this?
Thanks.
The information on the wikipedia page is old and no longer relevant.
Split swarms was only a problem between the introduction of:
Multi-trackers extension:BEP12 around 2004
and the introduction of:
PeerEXchange(PEX):BEP11 and the DHT distributed tracker:BEP5 around 2005.
These three extensions working together creates a single unified swarm.
I am basically working on nlp, collecting interest based data from web pages.
I came across this source http://schema.org/ as being helpful in nlp stuff.
I go through the documentation, from which I can see it adds additional tag properties to identify html tag content.
It may help search engine to get specific data as per user query.
it says : Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!
But I don't understand how it can help me being nlp guy? Generally I parse web page content to process and extract data from it. schema.org may help there, but don't know how to utilize it.
Any example or guidance would be appreciable.
Schema.org uses microdata format for representation. People use microdata for text analytics and extracting curated contents. There can be numerous application.
Suppose you want to create news summarization system. So you can use hNews microformats to extract most relevant content and perform summrization onit
Suppose if you have review based search engine, where you want to list products with most positive review. You can use hReview microfomrat to extract the reviews, now perform sentiment analysis on it to identify product has -ve or +ve review
If you want to create skill based resume classifier then extract content with hResume microformat. Which can give you various details like contact (uses the hCard microformat), experience, achievements , related to this work, education , skills/qualifications, affiliations
, publications , performance/skills for performance etc. You can perform classifier on it to classify CVs with particular skillsets
Thought schema.org does not helps directly to nlp guys, it provides platform to perform text processing in better way.
Check out this http://en.wikipedia.org/wiki/Microformat#Specific_microformats to see various mircorformat, same page will give you more details.
Schema.org is something like a vocabulary or ontology to annotate data and here specifically Web pages.
It's a good idea to extract microdata from Web pages but is it really used by Web developper ? I don't think so and I think that the majority of microdata are used by company such as Google or Yahoo.
Finally, you can find data but not a lot and mainly used by a specific type of website.
What do you want to extract and for what type of application ? Because you can probably use another type of data such as DBpedia or Freebase for example.
GoodRelations also supports schema.org. You can annotate your content on the fly from the front-end based on the various domain contexts defined. So, schema.org is very useful for NLP extraction. One can even use it for HATEOS services for hypermedia link relations. Metadata (data about data) for any context is good for content and data in general. Alternatives, include microformats, RDFa, RDFa Lite, etc. The more context you have the better as it will turn your data into smart content and help crawler bots to understand the data. It also leads further into web of data and in helping global queries over resource domains. In long run such approaches will help towards domain adaptation of agents for transfer learning on the web. Pretty much making the web of pages an externalized unit of a massive commonsense knowledge base. They also help advertising agencies understand publisher sites and to better contextualize ad retargeting.
I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?
Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.
In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.
I found a new search engine that speeds up finding piracy files from rapidshare, how could I automate a tool that finds our product using this engine and outputs the list of the rapidshare URLs that will be then sent to abuse#rapidshare.com.
search engine:
http://rapidlibrary.com/
(note, the captcha image appears just once there)
Below is a nice script that could perhaps do this pretty easily?
http://www.nasser.me/ubiquity/rapidsharecom-link-checker/
I thought about this in the past and being a "tv show pirate" myself it kinda annoys me why free torrent sites like The Pirate Bay and Mininova are being taken down while other not so free sites like Rapidshare, Megaupload and so on host the files and continue to make millions out of piracy.
The marketing model of those sites is viral, meaning the more a user spreads his link the more points he will receive and the less he will have to pay for his "subscription" in the future so is just obvious to suppose that those same links would be well spread over the Internet.
I would just search and scrap all the major warez forums out there, for a week or two and after that a search on the web should find all the remaining blogs / sites that still point to the pirated file.