Which copy protection techniques are available for digital material? [closed]

Which copy protection techniques are available for digital material? [closed] - security

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Suppose a website offers the following resources for premium users:
PDF Files
Video Files
Presentations (e.g. .ppt files)
Which protection techniques are available to prevent (slow down) the user to copy and re-distribute these resources?

PDF - password protection
Video files - DRM (unable to play without license file)
Presentations? No idea.
Most these techniques are also sure techniques to repel normal users from your site.

One good way to protect your material is to make your web site the easiest way to get/view/access your stuff. Note that Apple makes millions of dollars selling MP3's on ITunes that are wholly unprotected, because it is easier for most people to grab them on Itunes than to find them on torrent sites.
Ultimately, you will not be able to prevent a determined user from copying and redistributing your material. The most you can do is try to slow them down. Whatever encryption method you end up using will require a key, and that key will need to end up on your user's computer. Therefore, a determined user will have everything they need to grab the content from you. What you can do is annoy average users enough that they decide it is not worth the trouble. However, there is a fine line to walk between annoying users enough that they pay, and annoying them so much that they leave your site entirely.

Nothing will prevent the user redistributing anything that can be downloaded to the local device. Very few will actually 'slow down' this either. Most all will inconvenience legitimate users completely.
Create compelling content and offer it for a compelling price. Those that see the value will buy it, those that don't see the value would never buy it to begin will so you are really losing anything.

For images, you can put in a watermark (translucent text over the image, but not very noticible, saying something like "© 2010 Me inc.").
Same goes for video files, but in video you could move it to make the process of removing it (which is already extremely hard) harder.
Presentations, I have no clue either, but you could always try having "© 2010 Me inc." at the bottom of all the slides, or on the BG picture.
In truth, there is no way to fully protect your files, but these solutions will do the best to slow down, and possibly stop the user from redistributing your work.

Related

how can I protect scraping of certain data on my web pages? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?

The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.

It's not possible.
You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
You use images and I can just feed them through an OCR.
You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.
Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).

Apart from the images, you could display the numbers using JavaScript or flash.
You could also use CSS to position individual digits using various combinations of absolute or relative positions.
You could also use JavaScript to help you create these DIV.
The point is just to obfuscate enough that it becomes really hard.
One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display.
You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.
At any rate, you can't stop a determined scraper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR.
There is nothing anyone from paying someone pennies to get the data manually anyway.
The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.
However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.

Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.

I can't believe I'm promoting a common malware scripting tactic, but...
You could encode the numbers as encoded Javascript that gets rendered at runtime.

Generate an image containing those numbers and display the image. :-)

I think you guys are being too reactive with these solutions. Javascript, Capcha, even litigation and the DMCA process don't address the complex adaptive nature of web scraping and data theft. Don't you think the "ideal" solution to prevent malicious bots and website scraping would be something working in a real-time proactive mitigation strategy? Very similar to a Content Protection Network. Just say'n.
Examples:
IBM - IBM ISS Data Security Services
DISTIL - www.distil.it

Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?
Using JavaScript is the same as using text. It's trivial to reverse engineer.

Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.

What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.

Don't output the numbers, i.e. prefix
echo $secretNumber;
with //.

For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.
Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.

just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.

I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...

Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.
For pointers on throttling see this question.

Web Usability - Background Music

I personally loathe background music on a website. My client has opposite feelings on the subject. I added music because the customer is always right, though I'd like to revisit the subject with them.
Almost everyone would agree that it is annoying and wastes precious bandwidth but are there any usability studies or a recommendation for someone esteemed in the profession that can provide a valid argument against background music?

Usability is not the only concern. Consider the following scenarios:
1 - Someone browses to the site while at work in a shared office, and now all of their co-workers think "Gee, he's wasting time".
2 - Someone browses to the site while in a room with a sleeping baby, and now they have to spend an hour getting him/her back to sleep.
3 - Someone browses to the site while they are listening to their own music, and now they hear a cacaphony of shrieks until one source is muted.
Also, consider that any benefit gained from the music on your website will be totally lost on anyone who has their speakers muted. So your audience can be divided between:
A - People who cannot hear the music
B - People who can hear it, but do not like it
C - People who can hear it, and do like it
I would not care to estimate the percentages associated with each of these groups, but keep in mind that category "B" is actively offended by your website. To take a line from the hippocratic oath, one rule of web design should be "do no harm".

Metrics. You'll never be able to convince a business person with an emotional answer.
If you investigate the situation empirically you'll be able to give them something irrefutable.
I would would try an experiment: (get google analytics)
have one site with the music as-is, measure the bounce rate,etc
have an identical site without music, measure the bounce rate,etc
Have the server randomly serve up the different pages for a couple weeks (until you get a significant data) and see what happens.
Maybe we're wrong (I hate music too). I hope your customer is wrong, but who knows.
You could also add a survey link and try to get people to answer that as well (but without an incentive that might not work)
Stats can be your friend here :)

I would also:
(calculate the size of the audio file(s)*the number of hits*months)/cost of GB per month
Then tell them how much money they are wasting.

Basically, it boils down to this:
Audio on websites is a bad idea. No one likes it.
Try to educate your client that it is a bad idea. (It's annoying, different levels of sound can cause problems, yadda yadda) Mention that most users don't take sites seriously if they use sound. It's a very '99 thing to do.
If you client does not budge, (politely) remind him/her that they are paying you for your expertise as an internet professional. You are the expert on the web, and they have hired you to give your expertise.
If they still won't budge, keep the sound and make sure they are happy. The bottom line is keeping the client happy.

Music also interferes with screen reader users. I'm a blind computer user and nothing annoys me more then having music start playing and drowned out my speech program that's trying to read the site. Nothing will make me close a website quicker then unwanted audio.

It took a bit but I found a site that talks about usability on web sites.
They have a video on the right hand side of this page:
http://www.ciaromano.com/evaluating/testing.php
It shows why audio ads are not a good idea on websites.
Hope this helps.
G-Man

Just make sure that there is a way to turn it off. It really depends on the type of Website, because multimedia-heavy sites (i.e. sites for Movies or Games) can benefit from it, but if I'm listening to some of my own music, I definitely want a way to turn it off.
Oh and please, no crappy MIDI-Files that people already hated in 1993 when they were novel.

This is a tough one -- and what's amazing is that at the moment, I have a client who's demanding the exact same thing.
Personally I don't know of any usability studies addressing this topic specifically, but there's plenty of anecdotal evidence out there from users complaining about the intrusiveness or outright corniness of unrequested background music. * That said, clients still ask for it. Best you can do is try to explain the situation to them, try to gather a few good examples of people complaining about it from the Web at large, build a case, and hope the client goes for it.
In my case, she completely agrees that it's potentially annoying, understands it cuts against the grain of user expectations and politeness, but wants it anyway. So I'm building it. Whaddyagonnado.
* Indeed, you could probably use this thread as evidence! Good luck.

Consider taking a different path with the client.
Ask them what the purpose for the music is...
If it is to install a particular feeling or mood with the visitor of the site, consider taking them through all the points mentioned in answers here and discuss how that may violate the intended for the music.
Then you will be able to talk to the client about different ways to instill the same "ambience" to the website without resorting to music. This is really a design issue and not usability.
If the background music/sound was to convey some information, then it is a usability issue as people who for technological or biological reasons cannot hear the sound at the correct volume will miss out on that. Therefore the site is not as usable as it should be.

Unfortunately, as a service provider of sorts, all we can do is cringe and give the customer what they want - after documenting your disapproval both commented in the code and in writing to the client, of course.

Pardon me, but i have a different opinion about loading music in the website. With all due respect I have for the answer posters of this thread.
I see visits to e-commerce websites like going to a shopping complex. Where you have a cart, varieties of products, checkout counters and background music to make your stay as comfortable and interesting as possible.
There's a whole psychological reason as to what certain slow paced music can do to certain parts of the brain. Some studies even suggested that certain music play a role in motivating customers to purchase more items. Check this site
This can definitely be a plus point in a website. Of course it depends on what kind of website it is. However, a slow and non-vocal music shouldn't necessarily disrupt one's attention; rather it might have the opposite effect.
My justification is that when a potential customer visits a site, he is only using one of his senses while browsing through the pages. His eyes! I'm saying why not allow him (if he wants) to use his sense of hearing that would encourage him (not only through the means of displaying fancy texts, design and animations that looks nice to the eyes) but also to capture his attention through music (allowing him to be more in touch with the site).
Its obviously not possible to trigger his sense of smell and taste. But why limit it to only the eyes. Why not use the ears too!
Whether you choose to put music into your site or not, MichaelStum's post about having an option to turn off the music is highly essential.
Of course in the end its all about the amount of traffic that comes to your website. For this matter, #Cbrulak's idea of using Google Analytics would be a realistic approach for different individuals.

Pivotal Suboptimal Decisions in the History of Software [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Throughout the history of software development, it sometimes happens that some person (usually unknown, probably unwittingly) made what, at the time, seemed a trivial, short-term decision that changed the world of programming. What events of this nature come to mind, and what have been our industry's response to mitigate the pain?
Illustration (the biggest one I can think of): When IBM designed the original PC, and decided to save a couple dollars in manufacturing costs by choosing the half-brain-dead 8088 with 8-bit-addressable memory, instead of one of the 16-bit options (8086, 680n, etc.), dooming us to 20 years of address offset calculations.
(In response, a lot of careers in unix platform development were begun.)
Somewhere toward the other end of the scale lies the decision someone made to have a monster Shift Lock key at the left end of the keyboard, instead of a Ctrl key.

Paul Allen deciding to use the / character for command line options in MS DOS.

Allocating only 2 digits for the year field.
And the mitigation was to spend huge amounts of money and time just before the fields overflowed to extend them and fix the code.

Ending Alan Turing's career when he was only 42.

Microsoft deciding to use backslash rather than forwardslash as the path delimiter. And failing to virtualize the drive letter.

Actually the 8088 & 8086 have same memory model and same number of address bits (20). Only difference is width of external data bus which is 8 bit for 8088 & 16 bit for 8086.
I would say that use of inconsistent line endings by different operating systems (\n - UNIX, \r\n - DOS, \r - Mac) was a bad decision. Eventually Apple relented by making \n default for OS-X but Microsoft is stubbornly sticking to \r\n. Even in Vista, Notepad can not properly display a text file using \n as line ending.
Best example of this problem is the ASCII mode of FTP which just adds /r to each /n in a file transferred from a UNIX server to Windows client even though the file originally contained /r/n.

There were a lot of suboptimal decisions in the design of C (operator precedence, the silly case statement, etc.), that are embedded in a whole lot of software in many languages (C, C++, Java, Objective-C, maybe C# - not familiar with that one).
I believe Dennis Ritchie remarked that he rethought precedence fairly soon, but wasn't going to change it. Not with a whole three installations and hundreds of thousands of lines of source code in the world.

Deciding that HTML should be used for anything other than marking up hypertext documents.

Microsoft's decision to use "C:\Program Files" as the standard folder name where programs should be installed in Windows. Suddenly working from a command prompt became much more complicated because of that wordy location with an embedded space. You couldn't just type:
cd \program files\MyCompany\MyProgram
Anytime you have a space in a directory name, you have to encase the entire thing in quotes, like this:
cd "\program files\MyCompany\MyProgram"
Why couldn't they have just called it c:\programs or something like that?

Apple ousting Steve Jobs (the first time) to be led by a succession of sugar-water salemen and uninspired and uninspiring bean counters.

Gary Kildall not making a deal with IBM to license CP/M 86 to them, so they wouldn't use MS-DOS.

HTML as a browser display language.
HTML was originally designed a content markup language, whose goal was to describe the contents of a document without making too many judgments about how that document should be displayed. Which was great except that appearance is very important for most web pages and especially important for web applications.
So, we've been patching HTML ever since with CSS, XHTML, Javascript, Flash, Silverlight and Ajax all in order to provide consistent cross-browser display rendering, dynamic content and the client-side intelligence that web applications demand.
How many times have you wished that browser control languages had been done right in the first place?

Microsoft's decision not to add *NIX-like execute/noexecute file permissions and security in MS-DOS. I'd say that ninety percent of the windows viruses (and spyware) that we have today would be eliminated if every executable file needed to be marked as executable before it can even execute (and much less wreak havoc) on a system.
That one decision alone gave rise to the birth of the Antivirus industry.

Using 4 bytes for time_t and in the internet protocols' timestamps.
This has not bitten us yet - give it a bit more time.

Important web sites like banks still using "security questions" as secondary security for people who forget their passwords. Ask Sarah Palin how well that works when everybody can look up your mother's maiden name on Wikipedia. Or better yet, find the blog post that Bruce Schneier wrote about it.

EBCDIC, the IBM "standard" character set for mainframes. The collation sequence was "insane" (the letters of the alphabet are not contiguous).

Lisp's use of the names "CAR" and "CDR" instead of something reasonable for those basic functions.

Null References - a billion dollar mistake.

Netscape's decision to rewrite their browser from scratch. This is arguably one of the factors that contributed to Internet Explorer running away with browser market share between Netscape 4.0 and Netscape 6.0.

DOS's 8Dot3 file names, and Windows' adoption of using the file extension to determine what application to launch.

Using the qwerty keyboard on computers instead of dvorak.

Thinking that a password would be a neat way to control access.

Every language designer who has made their syntax different when the only reason was "just to be different". I'm thinking of S and R, where comments start with #, and _ is an assignment operator.

Microsoft copying the shortcut keys from the original Mac but using Ctrl instead of a Command key for Undo, Cut, Copy, Paste, etc. (Z, X, C, V, etc.), and adding a near worthless Windows key in the thumb position that does almost nothing compared to the pinky's numerous Ctrl key duties. (Modern Macs get a useful Ctrl key (for terminal commands), and a Command key in the thumb position (for program or system shortcuts) and an Alt (Option) key for typing weird characters.)
(See this article.)

Null-terminated strings

7-bits for text. And then "fixing" this with code pages. Encoding issues will kill me some day.

Deciding that "network order" for multi-byte numbers in the Internet Protocol would be high order byte first.
(At the time the heterogenous nature of the net meant this was a coin toss decision. Thirty years later, Intel-derived processors so completely dominate the marketplace it seems lower-order-byte first would have been a better choice).

Netscape's decision to support Java in their browser.

Microsoft's decision to base Window NT on DEC VMS instead of Unix.

The term Translation Lookaside Buffer (which should be called something along the lines of Page Cache or Address Cache).

Having a key for Caps Lock instead of for Shift Lock, in effect it's a Caps Reverse key, but with Shift Lock it could have been controllable.

Card Wall + online card wall = duplication? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm not a great fan of duplicating effort. I do find, however, that there are benefits to tracking agile iteration progress on both a physical card wall and an online "calculator" (Excel, some scrum tools) or an online card wall (e.g. Mingle).
I find that a physical card wall in the team space provides a visceral kind of connection to the status of the cards... and that moving a card physically when you finish something provides a level of satisfaction that can't be duplicated online. I can feel the card... and people can see me walk up to the wall to move something.
Online tools provide great capabilities to share remotely and to calculate progress (e.g. in Mingle, you can use the built-in tools to automatically calculate burn-ups or burn-downs from the real data, saving lots of administrative time in doing those things manually).
I'm curious if agile practitioners maintain two tracking media like I do, and how do you present the benefits of the physical wall to those who say "I can do it online... why would I want to do it on a card wall instead?".

I feel the same. There is something very psychologically satisfying in moving a physical card around on a wall. Thinking managerially, we like stats and we like them to be automated as much as possible.
Perhaps you can keep both? Use the physical wall as the main daily source of information your team work from. Then, assign one person (e.g. the scrum master) to take down the live status and put it into Mingle/Excel at the end of each day.
As long as there is good benefit for the users to have both, then you should find both keep happening alongside each other nicely. Find out what the motivators are for each tool. For example:
Physical wall:
Instant reaction
Quick visual
Physical satisfaction
Online records:
Really really useful statistics
People can be rewarded against the stats in there (e.g. points completed)
Hope this helps.

My team has struggled with this as well. Electronic data makes analysis and reporting very easy and enables associations of checkins with a backlog item, but its a lot easier to manage cards during the standup. Plus, it's a lot easier to get a "5000 foot view" of the project from looking at a large wall than a small monitor.
No matter what you do you're either either going to have some duplicate effort, or you're going to have a process with some pain points. The goal is to find that balance between the amount of duplicate effort and the value that it affords.
We're still working on finding that balance :) Here's what we do:
During planning, we throw everything into OneNote. Formatting is a bit of a pain, but we're getting better.
After planning, our ScrumMaster enters the data from OneNote into an Excel document for generating our burndown. He then exports this data into TFS, for associating checkins, and does a mail-merge to print each task on a label which is then affixed to a post-it and added to the wall.
During the standup we move the post-its around on the wall.
After the standup, the ScrumMaster updates the Excel doc, generates the burndown update, and sends it around to the team.
As a team member this is pretty low-friction, but it's pretty wasteful of the ScrumMaster's time.

I greatly prefer Cards on the wall for a few simple reasons:
Everyone know how to use them. No software training required.
Not subject to problems with network, someone's computer needing maintenance etc., even in a blackout, people can still update their cards. This may sound like a joke, but can be nice to have something to do when for whatever reason yu can not use your PC
Programmers can freely update the cards while they are booting up/compiling
Easy to see them all at a glance
Ideal for meeting if your in a scrum environment and having amini meeting aroudn a desk.
I like jotting a note on the card when it's moved with time and mover... for trakcing bugs/features.

Cross link your online and card wall.
Set up two way replication. Method is left as an exercise for the student.
Also handy to catch whiteboard content from discussions.

We use both, and I can't imagine doing it any other way. Part of it may be that we find our "online card wall" a little too clunky to easily maneuver, but we use the physical cards for getting a quick idea of what developers are working on, letting QA know which cards are ready for testing, and for QA to post what is ready for our weekly demos. The dev area, QA area, and ready to demo areas are three physically distinct places, with the ready to demo being most easily accessed. We also use the physical cards for final scoring.
Could we do all of this online? Yes, would it be quicker, and easier? No way!

We've abandoned using cards after the sprint planning session (they get added to Rally) because it doesn't make sense for us to track in multiple places. Our scrum master is accountable for making sure people enter their tasks appropriately and move them (that's what the daily standup is for). The 5000 foot view is much better in an online tool than a bunch of cards on a wall that can only be categorized two-dimensionally (or maybe three if you stack enough on top of each other).

We use both a card wall and ProjectCards. It's painful for me because I sync the two of them, but it's worth it to have the feedback for the team that is local.
We've bandied about the idea of getting a large touch screen, but I still would rather have physical cards. The other idea I've been toying around with is having a printer which will automatically print out an index card whenever a story is added to ProjectCards.

I was just wondering. How about a giant projector based touch wall. ;)
Best of both worlds. This might give some pointers.
http://johnnylee.net/projects/wii/

Theres something very good about a big wall everyone can always see. I think we need a way to print onto regular thick index cards but I've had no luck so it is duplicated effort at the moment.

Electronic Card Wall Using RFID, this allows you to use a physical wall, with data mastered in software of your choice. As you move cards around, software updates accordingly.

If you use JIRA. http://wallsync.net will keep your cards in sync for you...

How do I search content, within audio files/streams? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have always wondered how many different search techniques existed, for searching text, for searching images and even for videos.
However, I have never come across a solution that searched for content within audio files.
For example: Let us assume that I have about 200 podcasts downloaded to my PC in the form of mp3, wav and ogg files. They are all named generically say podcast1.mp3, podcast2.mp3, etc. So, it is not possible to know what the content is, without actually hearing them. Lets say that, I am interested in finding out, which the podcasts talk about 'game programming'. I want the results to be shown as:
Podcast1.mp3 - 3 result(s) at time index(es) - 0:16:21, 0:43:45, 1:12:31
Podcast21.ogg - 1 result(s) at time index(es) - 0:12:01
So my questions:
How could one approach this problem?
Are there are suitable algorithms developed to do something like this?
One idea the cropped up in my mind was that, one could use a 'speech-to-text' software to get transcripts along with time indexes for each of the audio files, then parse the transcript to get the output.
I was considering this as one of my hobby projects.
Thanks!

If you want to search for text (i.e. what is being said) inside an audio stream you would have to process it with some kind of speech recognition algorithm and store the text as meta data associated with the files. For video you could also do text recognition for text inside the video. Evernote already does this for text inside image files, but has no support for audio as far as I know.
Something similar is possible when using audio to search for audio. I don't know the details of these algorithms, but I'm guessing they involve some kind of frequency analysis. Shazam is using this kind of technology to identify songs based on audio clips.
Here are some Wikipedia articles that may be useful:
Speech recognition
Fast Fourier transform
Frequency analysis (frequency spectrum)
Optical character recognition (OCR)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string