Are there methods to trace text files that were web scrapped, finding the user involved and after the text has been machine translated? - security

Information
I’m a bit new and want to create a site with these capabilities but don’t know where to start, please point out if I violated any rules or should write differently.
I’ll a bit specific here, so there is a web novel site where content is hidden behind a subscription.
So you would need to be logged into an account.
The novels are viewed through the site’s viewer which can not be selected/highlighted then copied.
Question
If you web scrape and download the chapters as .txt files then machine translated using like Google Translate, is there a way to track the uploader of the MTL or when the MTL file is shared?
Of similar nature there are aggregator sites that have non-MTL’d novels on them, but the translation teams have hidden lines which tell readers to go to their official site. The lines aren’t on the official site though, only when it has been copied. How is that possible?
I’ve slightly read about JWT, and I’m assuming they can find the user when they’re on the website but what about in the text?
Additional Question (if above is possible, don’t have to answer this just curious)
If it is possible to embed like some identifying token, is there a way to break it by perhaps converting it into an encrypted epub?

Related

Should I be using SharePoint sub-sites or metadata?

We're starting to use SharePoint 2013 to manage our department's process documentation and I have some questions about best practices for site structure. I'm a little surprised I can't find the answers via web search, since this seems like a basic question every new SharePoint user must deal with.
Moving from a file share environment, I'm trying hard to get out of that mindset and I understand the many benefits of SharePoint over file shares. I also understand why creating folders in SharePoint forces arbitrary divisions on files whereas one large set of documents with metadata lets you filter and group the files based on different needs.
What's confusing me is that I also read that it's better to have too many sub-sites than not enough. It seems like sub-sites can easily become pseudo-folders and I'm not sure where that line is crossed.
Here's an example.
We have a SharePoint site devoted to our department. We've create a sub-site dedicated to an application we developed to load data into our business systems. It mainly holds technical documentation about the application. This application supports many different data sources, each with its own set of user instructions for loading, its own schedule (calendar), contact lists, supporting files, etc. There's no compelling reason to separate them to restrict access. However, there doesn't seem like a lot of value in having them all in the same sub-site, either, since someone working on a job will only want to see the docs and supporting files for that data source. I just can't foresee someone wanting to view supporting files across all data sources, although, I could see someone wanting to see the schedule for all data sources combined.
My question is, should I create separate sub-sites under the application for each data source or do I just store everything in the application sub-site and use metadata and views to group things by data source? Putting all the items for a specific data source into its own sub-site seems to be much simpler to manage and present than having to specify metadata for every new file and creating a lot of views. However, I can't shake the feeling that I'm still using file share thinking. Or maybe I'm just missing some basic concept of SharePoint.
Any advice or links to good discussions of this topic would be greatly appreciated. Thanks.
I would recommend that you use metadata and views to separate data in one repository/site.
My reasons are as below:
In SharePoint, it is recommended to use metadata than "evil"
folders(or subsites in your case). Keep in mind that maintaining
multiple subsites requires big administrative efforts in long term,
for example, some sites will be inherited while others unique
permission.
As time passes by and people rotate, it becomes vague
that where the data was stored and where the new data should go to,
especially with large volume subsites.
Since confidentiality is not concerned in your case, keep data centralized and open to people working in related field increases sharing and collaborating phenomenon. In contrast, using subsites increases the possibility of data silos.
people are all lazy :). Taken me as example, I dont want to remember all those xyz URLs, I want to go to one place and be able to fetch everything that I need.

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!
The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

Securing PDF and embedded video

My company delivers programming instructions for products we sell in both streaming video (hosted on CloudFront) and pdfs (hosted on Amazon S3). We don't want for our customers to be able to take the content out of these PDFs, save the PDF, or be able to share the link. At the same time, we don't want for people to be able to streal the video (we're less concerned with the videos).
I've been racking my brain trying to figure out the best options on securing this. What are the limitations with PDF security, at the end of the day, can you stop them? Or at least make it really hard?
Unless you create and deliver your data in custom format, your own viewer with built-in content protection mechanisms, you are out of luck. Everything you deliver to the client can be captured, copied and distributed. With PDFs and video streams this is trivial.
If you can suffer the PDF generation overhead, you could individualize the PDFs by putting the customer's name on each page. Turn off editing as well, and that'll discourage people. It'll still be quite possible to get around these, of course.

Is it possible to create an app for a site without an API?

I would like to create an app for a myBB forum. So the site on the forum will look nicer and much more cleaner on an iPhone or Android.
Is it possible without an API? It isn't my site ether.
everything is possible, it's just a matter of resources...
technically, you can write an app for everything on the web, but:
an API will tell you how you can do things with the site, without having to reverse engineer all pages/posts/..., and the format of every output resulting from post/get operations. reverse engineering may take a long time, and you will surely not come accross all possible results (error pages, bad authentication...);
an API is quite stable and is always updated with great care from the developpers so as not to break existing applications. without an API, there is no guarantees that your app will not break with the next release of the forum when it is upgraded;
a web API generally defines an output format which is easily parseable: many API outputs XML or JSON, which can be processed with standard libraries. without an API, the output format is plain HTML, which may be difficult to reorganize in order to show the results in a different format.
so, yes, you can definitely write an app for a myBB forum, but it may require a fair amount of work.
You can do, it's called screen scraping and is what was done before XML, the semantic web, SOAP, web services and then JSON apis tried to solve the problem better.
In screen scraping, you grab the site's HTML, parse it, get the data you want out of it, then do what you need with that data. It's more work, and breaks each time the site's layout changes, hence the history of improvements to it.
You mention the site in question is not yours. Many sites do not regard screen scraping as fair use, so check with the site's terms and conditions that you can legally create an app from the data posted there.
you can consider useing HTML5 ... do you think it doable for use app ?

Document Management [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
We looking for a simple, open source, web based document management system for Linux. With document management I mean the ability to store a set of files (minimally doc, xls and pdf) as a document. Associate meta data with the document like owner and version. Update and delete documents. Ability to index and search content. Authentication and the ability to authorize at least read, and possible write. If possible I would like to avoid implementations in Java or PHP, and as we use MySQL already that would work especially well for meta-data storage.
We have used Google Applications in the past but the lack of support for PDF makes it a poor fit. Other downsides include their service losing some of our spreadsheets, no concept of company owning information opposed to individual accounts, and some of our information is sensitive and we prefer keeping it in-house (passwords, contracts etc).
MediaWiki was not a good fit either as our documents is really a set opposed to be structured content (i.e. not looking for a content management system), and at least the version we had installed did not deal well with attachments.
Based on review of past questions I plan on looking into KnowledgeTree. Any other projects that we should consider?
I've been using KnowledgeTree now for a few months developing an ASP.Net application and I only have good things to say about it. Our product uses it for PDF storage/retrieval and it really couldn't be easier to deal with. The basic install gives you a simple environment with almost endless amounts of configuration for meta-data, document groups, and various security options. Also, the KnowledgeTree staff have been very helpful and have provided us with sample code when we have run into 'how are we going to do that?' moments.
I'll second the recommendation for KnowledgeTree. Have been using it for a couple years and have roughly 1K documents indexed. Sometime last year, I wrote a short script which monitors KT's transaction table (in MySQL) and notifies users of new or updated documents via Twitter, Identica, and/or Jabber. The Twitter/Identica feeds can then be monitored with a RSS reader.
Look for something that will index all your document formats and keep them searchable.
I solved this in my office using Coldfusion. It has verity search engine built in. This indexes files on your network (doc/xls/pdf, etc) to make the text in them searchable (like google).
An instant search engine for all my files, for upto 150,000 or so is built in for free with Coldfusion so it suits my purpose.. Something like this would allow you to save your files on a network how/wherever and you'd be able to extract the rest of the information about owners, modification dates through libraries available in java / .net.
I am sure you could replicate this with another language, but probably a bit more effort. I am presently wishing I could use the Google Docs API as a wysiwyg editor in my own wiki in-house.. that would solve most of my problems then because everything would be intranet based.
Try https://www.mayan-edms.com, written on Django, db agnostic
You can consider GroupDocs as they have got storage, conversion and few more features.

Resources