Software: How does software recognize a song?

Software: How does software recognize a song? - audio

I was playing with my phone and there is this app on it that when u use it to record 10 seconds of a song, it tells you the title and author of that song. Now as a software engineer I can't help but wonder, How does this work?

Well, actually Shazam has written a paper explaining the inner workings of the algorithm, you can find it at this address (pdf).
Basically they have a huge database of all the songs that the algorithm can recognize and they create a kind of "hashtag" of the music using its spectrogram. Then, when you record a part of the song and send it to them, they pass it through the same algorithm and try to match it with the hashtags that are stored in the database.
Of course it's a lot more complicated than that since they have to manage the recording noise and stuff like that, but it's the basic idea.

Find the answer here:
http://laplacian.wordpress.com/2009/01/10/how-shazam-works

Related

How is Alexa programmed to sing?

If you say "Alexa, sing for me", she will choose one of several songs that have been created with her voice. The voice(s) for each of these songs must have been created somehow.
At first, I thought that SSML would provide the tools necessary to do this, especially the <prosody> tag which has parameters for pitch and rate (duration).
I thought perhaps each syllable of singing could have its pronunciation specified with <phoneme> and its pitch and duration specified with <prosody>, with <break> tags in between:
<speak>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%" pitch="+50%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
</prosody>
</speak>
However, when executed, Alexa applies her built-in inflection (to sound like a real human), and so the tone is not flat. These "ooh" sounds (above), for example, each have a falling tone. (They also have a noticeable break between phonemes even tho "no break" was explicitly specified.)
So then, how did the Alexa voice which is heard singing all of those songs get programmed? Was it via tools currently only available to Amazon developers?
It's also perplexing to me that I am apparently the only person on the internet even asking this question (based on zero results in stackoverflow, google, etc), especially this late in the game. Aren't there loads of musicians out there who would love to be able to make Alexa sing whatever they want?
Edit: Guys, I thought it was common knowledge, but there is no human voice actor behind Alexa. Her voice is completely computer-generated.

Alexa's voice is completely computer generated and so are the songs. Research is on-going into generating a singing synthesizer model (#1 and #2).
Here's a video by Popgun Labs regarding how they make their AI sing. Although I am unable to find how Amazon and Google do this, my guess it will be something similar.
EDIT: My earlier answer was based on an extension page and drew incorrect inconclusions.

My prediction would be either something really fancy like Natural Language Processing or something around that lines, AI/ML or they just had the voice actor sing out something or sing particular tones and just cut them together, i don't own an Alexa but i do have a HomePod mini and an iPhone and the way it pronounces our local singer names like "sidhu moosewala" or "amrit maan" (off topic but still related) i believe they just cut and put together words in a "clean" and 'flowing" way.

Perhaps her voice is simply autotuned.
Certainly, pitch-shifting tools can force any desired pitch from any audio source, and I presume such tools can force duration changes as well.

CMU Sphinx for Voice/Speaker Recognition

I'm looking for a way to match a known data set, let's say a list of MP3s or wav files, each which is a sample of someone speaking. At this point I know file ABC is of Person X speaking.
I would then like to take another sample, and do some voice matching to show who this voice is most likely of, given then known data set.
Also, I don't necessarily care what the person has said, as long as I can find a match, i.e I don't need any transcribing or otherwise.
I'm aware CMU Sphinx doesn't do voice recognition, and it's primarily used for voice-to-text, but I have seen other systems, eg: the LIUM Speaker Diarization (http://cmusphinx.sourceforge.net/wiki/speakerdiarization) or the VoiceID project (https://code.google.com/p/voiceid/) which uses CMU as a base for this type of work.
If I am to use CMU, how can I do voice matching?
Also, if CMU Sphinx isn't the best framework, is there an alternate that's open source?

This is a subject which would be adequate in complexity for a PhD thesis. There are no good and reliable systems as of right now.
The task you're up for is a very complex one. How you should approach it depends on your situation.
do you have a limited amount of people? how many?
how much data do you have for each person?
If you have very few people to recognize, you may attempt something as simple as obtaining formants of those people and comparing them to a sample.
Otherwise - you have to contact some academics who work on the subject or jury rig a solution of your own. Either way, as I said, it is a difficult problem.

Choosing an audio API

I'm struggling to choose between a vast number of audio programming languages and APIs. I'm very (totally) new to audio programming so please bear with me.
Software
I need to be able to:
Alter volume of different sounds before outputting them to anything (these sounds can have a variety of different origins, for example mp3s and microphone input)
phase shift sounds
superimpose sounds that I have tweaked (as per items 1 and 2)
control the output to each of 8 channels independently of one another
make this all happen on Windows7
These capabilities need be abstracted by a graphical frontend I will probably make myself. What I want to be able to do is create 'sound sources' and move them around a 3D environment along either pre-defined trajectories and/or in relation to the movement of whoever is inside the rig. The reason I want to do pitch bending is so I can mess with red-shift stuff.
I don't want to have to construct full tracks before-hand and just play them. I want the sound that is played to depend on external input from sensors as well as what I am doing on the frontend.
As far as I know this means I cant use any existing full audio making app.
The Question
I've been looking around for for the API or language I should use and I have not turned up a blank, quite the opposite actually. I'm struggling to narrow down my search. A lot of my problem stems from the fact that I have no experience in audio programming.
So, does anyone know off-hand of an API or language that meets my criteria?
Hardware stuff and goals
(I left this until last because I'm not sure how relevant it is)
My goal is to make three rings of speakers at different heights and to have enough control over them to be able to simulate any number of 'sound sources' within the array. The idea is to have someone stand in the middle of the rig and be able to make it sound like there are lots of things moving around them. To get this working I'm planning on doing a little trig and using 8 channels of audio from my PC. The maths is pretty straight forward, it just the rest that I need to worry about
What I want to do next is attach a bunch of cameras to the thing and do some simple image recognition stuff to be able to 'attach sound sources' to different objects. Eg. If someone is standing in the right place it can be made to seem as though all red balls quack like a duck, and all orange ones moan hauntingly.

This is not to detract from Richard Small's answer, but to comment on some of the other options out there:
If you are looking for something higher-level with which you can prototype and develop this faster, you want max/msp or it's open source competitor puredata. These are designed for musicians who are technically minded, but not so much for programmers. As a result, you can build this sort of thing quickly and efficiently.
You also have some lower level options: PortAudio can handle your audio I/O, you would have to do the sound generation and effects and so on on your own or with other libraries. Cinder and OpenFramewoks both provide interfaces for audio, cameras, and other stuff for "creative programming". I'm afraid I don't know if they meet your full requirements, but they are powerful and popular for this sort of thing so I encorage you to look at them.

The two major ones these days tend to be
WWise
WWise Download Link
FMOD
FMOD Download Link
These two engines may even in fact be overkill for what you need, but I can almost guarantee that they will be capable of anything you require.

HOW-TO Make computer sing

I'm trying to develop an online application where the user writes some text and the software sings it back to the user.
I can currently generate the audio file with the words spoken by the computer using espeak, but I have no idea how to make it sound like a song, how to add rhythm to it.
I'm able to change the pitch and tempo using rubberband, but that's as far as I've gotten.
Does anyone have a clue how to make this happen?

If you want to use rubberband to change duration and pitch, then I think the hard part is going to be mapping from phonemes/syllables in the text to corresponding audio ranges in the speech systhesis output, for which I have no simple suggestion. (Ideally you'd get inside the speech synthesiser so that it would provide you with the mapping from phonemes to audio location.)
A simpler alternative might be to try Speech Synthesizer Markup Language - SSML. It has a "pitch" and "duration" elements that can absolutely specify pitch in Hz and duration in seconds. You can also specify volume, for controlling dynamics.
Given this, you could try to convert the text into a SSML document, and mark up words/syllables/phonemees with pitch/duration and volume attributes.

I've ended up using Festival's singing mode. It sounds reasonably well, except for the fact it only works with English voices.

Major game components

I am in the process of developping a game, and after two months of work (not full time mind you), I have come to realise that our specs for the game are lacking a lot of details. I am not a professional game developper, this is only a hobby.
What I would like to receive help or advices for is this: What are the major components that you find in games, that have to be developped or already exists as librairies? The objective of this question is for me to be able to specify more game aspects.
Currently, we had specified pretty much only how we would work on the visual, completely forgetting everything about game logic (AI, Entities interactions, Quest logic (how do we decide whether or not a quest is completed)).
So far, I have found those points:
Physics (collision detection, actual forces, etc.)
AI (pathfinding, objectives, etc.)
Model management
Animation management
Scene management
Combat management
Inventory management
Camera (make sure not to render everything that is in the scene)
Heightmaps
Entities communication (Player with NPC, enemy, other players, etc)
Game state
Game state save system
In order to reduce the scope of this queston, I'd like it if you could specifically discuss aspects related to developping an RPG type of game. I will also point out that I am using XNA to develop this game, but I have almost no grasp of all the classes available yet (pretty much only using the Game component with some classes that are related to it such as GameTime, SpriteBatch, GraphicDeviceManager) but not much more.

You have a decent list, but you are missing storage (save load), text (text is important in RPGs : Unicode, font rendering), probably a macro system for text (something that replaces tokens like {player} with the player characters name), and most important of all content generation tools (map editor, chara editor, dialog editor) because RPGs need content (or auto generation tools if you need ). By the way have links to your work?
I do this exact stuff for a living so if you need more pointers perhaps I can help.

I don't know if this is any help, but I have been reading articles from http://www.gamasutra.com/ for many years.
I don't have a perfect set of tools from the beginning, but your list covers most of the usual trouble for RUNNING the game. But have you found out what each one of the items stands for? How much have you made already? "Inventory Management" sounds very heavy, but some games just need a simple "array" of objects. Takes an hour to program + some graphical integration (if you have your GUI Management done already).
How to start planning
When I develop games in my spare time, I usually get an idea because another game lacks this function/option. Then I start up what ever development tool I am currently using and try to see if I can make a prototype showing this idea. It's not always about fancy graphics, but most often it's more about finding out how to solve a certain problem. Green and red boxes will help you most of the way, but otherwise, use Google Images and do a quick search for prototype graphics. But remember that these images are probably copyrighted, so only use them for internal test purposes and to explain to your graphic artists what type of game/graphic you want to make.
Secondly, you'll find that you need to find/build tools to create the "maps/missions/quests" too. Today many develop their own "object script" where they can easily add new content/path to a game.
Many of the ideas we (my friends and I) have been testing started with a certain prototype of the interface, to see if its possible to generate that sort of screen output first. Then we build a quick'n'dirty map/level-editor that can supply us with test maps.
No game logic at this point, still figuring out if the game-engine in general is running.
My first game-algorithm problem
Back when I was in my teens I had a Commodore 64 and I was wondering, how do they sort 10 numbers in order for a Highscore? It took me quite a while to find a "scalable" way of doing this, but I learned a lot about programming too.
The second problem I found
How do I make a tank/cannon fire a bullet in the correct direction when I fly my helicopter around the screen?
I sat down and drew quick sketches of the actual problem, looked at the bullet lines, tried some theories of my own and found something that seemed to be working (by dividing and multiplying positions etc.) later on in school I discovered this to be more or less Pythagoras. LOL!
Years and many game attempts later
I played "Dune" and the later C&C + the new game Warcraft (v1/v2) - I remember it started to annoyed me how the lame AI worked. The path finding algorithms were frustrating for the player, I thought. They moved in direction of target position and then found a wall, but if the way was to complex, the object just stopped. Argh!
So I first sat with large amounts of paper, then I tried to draw certain scenarios where an "object" (tank/ork/soldier) would go from A to B and then suddenly there was a "structure" (building/other object) in the pathway - what then?
I learned about A-star pathfinding (after solving it first on my own in a similar way, then later reading about the reason for this working). A very "cpu heavy" way of finding a path, but I learned a lot from the process of "cracking this nut". These thoughts have helped me a lot developing other game algortimes over time.
So what I am saying is: I think you'll have to think more of:
How is the game to be played?
What does the user experience look like?
Why would the user want to come back to the game?
What requirements are needed? Broadband? 19" monitor with 1280x1024?
An RPG, yes - but will it be multi-user or single?
Do we need a fast network/server setup or do we need to develop a strong AI for the NPCs?
And much more...
I am not sure this is what you asked for, but I hope you can use it somehow?

There are hundreds of components needed to make a game, from time management to audio. You'll probably need to roll your own GUI, as native OS controls are very non-gamey. You will probably also need all kinds of tools to generate your worlds, exporters to convert models and textures into something suitable for your game etc.
I would strongly recommend that you start with one of the many free or cheap game engines that are out there. Loads of them come with the source code, so you can learn how they have been put together as you go.
When you think you are ready, you can start to replace parts of the engine you are using to better suit your needs.

I agree with Robert Gould's post , especially about tools and I'd also add
Scripting
Memory Management
Network - especially replication of game object states and match-making
oh and don't forget Localisation - particularly for text strings

Effects and effect timers (could be magical effects, could just be stuff like being stunned.)
Character professions, skills, spells (if that kind of game).
World creation tools, to make it easy for non-programmer builders.
Think about whether or not you want PvP. If so, you need to really think about how you're going to do your combat system and any limits you want on who can attack whom.
Equipment, "treasure", values of things and how you want to do the economy.

This is an older question, but IMHO now there is a better answer: use Unity (or something akin to it). It gives you 90% of what you need to make a game up front, so you can jump in and focus directly on the part you care about, which is the gameplay. When you run aground because there's something it doesn't do out of the box, you can usually find a resource in the Asset Store for free or cheap that will save you a lot of work.

I would also add that if you're not working on your game full-time, be mindful of the complexity and the time-frame of the task. If you'll try to integrate so many different frameworks into your RPG game, you can easily end up with several years worth of work; maybe it would be more advisable to start small and only develop the "core" of your game first and not bother about physics, for example. You could still add it in the second version.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string