How Shazam Works

How Shazam Works
    Watch the video

    click to begin

    Youtube

    This episode of Real Engineering is brought to you by Brilliant, a problem solving website
    that teaches you to think like an engineer.
    Introduction: Opening, scene in a pub listening to a song and opening the shazam app.
    Maybe be tricky to film.
    What you just witnessed was the Shazam app recognising a song in a noisy environment,
    and proceeding to find a match for it among the millions of songs in its servers database.
    For most this probably seems like a trivial task.
    Our brains can identify songs incredibly quickly from a young age, but the pathways in your
    brain that allow you to identify a song quickly are incredibly complex.
    Often times you simply need to hear just a few chords to know exactly what song is about
    to play, that jolt of excitement when you can hear a DJ fading in the baseline of your
    favourite song.
    A simple combination of tones in a specific order allow you to identify a song from the
    thousands of other songs you have heard in your life in an instant, but coding a computer
    do the same thing is an incredible challenge.
    A computer does not have an intuitive understanding of music.
    A computer can only compare songs to other songs in its database, looking for a match
    by comparison.
    It is a problem akin to finding a needle in a haystack, where you can only find the needle
    by looking at a picture of a needle and comparing it to each individual straw, comparing it's
    length and colour until you finally find the needle.
    To create a software capable of doing this task quickly poses a very interesting coding
    challenge and the solution the engineers at Shazam came up with gives us some interesting
    insight into how our own brains work.
    A study by the Manchester Museum of Science and Industry tested 12,000 people's ability
    to recognise a song.
    They created an interactive game to search for the most recognisable songs, where they
    would play the hook of 1000 best selling songs and recorded the time required to identify
    them.[1]
    Can you identify this song with just 2.3 seconds of the hook?
    That was the Spice Girls song Wannabe, which ranks highest with a recognition time averaging
    just 2.3 seconds, and that's including the reaction time required to hit the button.
    Our brains are hardwired for this kind of pattern recognition.
    In a world where recognising the sound of an approaching threat meant life or death,
    we have evolved incredibly efficient ways of categorizing and accessing historical data
    like this.
    Our brain does not take the sound and compare it to every sound we have ever heard like
    a computer, the specific combination of chords in progression simply activates specific neurons
    that unlock that historical data.
    What if the chords were played by a different instrument?
    Would we recognise the song as quickly?
    Those same 2.3 seconds played on a guitar sounds like this.
    The notes are exactly the same, but they don't sound the exactly the same.
    We even know intuitively what instrument is playing.
    Why is that?
    This is called the timbre of a note and different instruments have different timbres.
    Pianos and guitars are examples of harmonic instruments and when they produce a note,
    they aren't just producing a pure note of a single frequency.
    Each note is a combination of multiple frequencies all related to the base note, the fundamental
    frequency.
    These are called overtones, and they are simply multiples of the base frequency.
    Each instrument has a unique combination and evolution of these overtones that give it
    that unique sound.
    Again, it's quite easy for our brains to distinguish between a piano and a guitar,
    but we need a way to quantify these characteristics for a computer to recognise, and this is where
    the spectrogram comes in.
    A spectrogram is a visual representation of sound.
    It's a 3D graph with time on the x-axis, frequency on the y-axis, and the amplitude
    of the frequency, or in other words the loudness, on the z-axis, which is often represented
    by a colour.
    This 3D graph is something a computer can absolutely recognise and store as data, but
    there is huge amount of data within a spectrogram like this, and the more data there is the
    more computation time is required to find a match.
    So the first step in reducing computation time is reducing the data required to classify
    a song.
    Shazam uses something they call a fingerprint, where they transform these spectrograms into
    something that looks like star map.
    [2] Here each star represents the strongest frequencies at particular times.
    Doing this, we have not only reduced our graph from 3 dimensions down to 2, but have drastically
    reduced the amount of data points on the graph.
    This is the first vital part of Shazam's technology.
    Every single song in Shazams database is stored in a fingerprint like this.
    When you open your phone and hit that Shazam button, the app accesses your microphone and
    begins to create its own fingerprint of the sound waves it receives.
    This ingenious method also helps the shazam app to filter out noise because it only creates
    data points for stand-out frequencies.
    Once the app has created a fingerprint of your audio, it then sends it to the shazam
    servers where the recognition part of the process begins.
    This is where things get difficult.
    Let's look at a simplified song fingerprint, and a recorded fingerprint to see why.
    The recorded fingerprint is only a short recording of the song, in our example we have just 3
    possible frequencies, and each recorded fingerprint will have just 3 time points.
    If we want to check the first 3 time points in the song for a match we first check the
    3 frequencies, then we move onto the next time point and check the 3 possible frequencies
    again, and do the same for the final time point.
    If we find a match, that is 9 operations required to find a match, but obviously that isn't
    likely.
    We then need to do those nine operations for every time point in the song, or perhaps every
    time point in Shazams massive music archive, this obviously is going to take a lot of computation
    time.
    This is not how Shazam looks for a match.
    First Shazam categorises fingerprints in a clever way.
    We don't search to see if a note exists in a song, we search to see if several notes
    exist separated by a particular time, just as brain does.
    This becomes our searchable address for a hash table.
    Hashes and hash functions are an incredibly useful technique that appear everywhere in
    computer science.
    Hash functions can be found in search algorithms used by Google, to make sure files are downloaded
    correctly, and are the backbone of crypto currencies like bitcoin.
    [3]
    A hash function takes a varying length of input and produces a fixed length output,
    called a hash.
    In practice, the input can be anything from a short piece of text, like a password, to
    a long data stream like an entire movie.
    Consider a library of books.
    We want to store each book on a shelf so we can find it later, and we know we'll have
    the title of the book when we're searching for it.
    We can use a hash function to decide which shelf to put a book on, using the title of
    the book as the input and producing a shelf number as an output.
    The first goal of a hash function is to produce outputs that are uniformly distributed...In
    our library, we want the books to be spread evenly across the shelves, so no shelf in
    particular will end up full of books, leaving others almost empty.
    The second goal of a hash function is that it should reduce collisions.
    A collision is when two different inputs produce the same output hash.
    In our case, a collision results in two or more books on the same shelf.
    If our library only has two shelves, collisions will be really common, no matter what hash
    function you use.
    If our library had a billion shelves, a good hash function will mean collisions will be
    rare.
    Another goal of a hash function is that it should be quick to calculate.
    If our library has millions of books, we don't want to take too long figuring out which shelf
    each one needs to be on.
    A simple hash function might be to take the title of a book, and group them on shelves
    alphabetically.
    This would be really quick to calculate, but it would result in a lot of collisions, with
    many books on the same shelf, and wouldn't be very well distributed.
    Think about how many book titles begin with the word "THE", compared to how many book
    titles start with the letter "Z".
    An alternative might be to take the position of each letter in the alphabet and sum up
    the letters in the book title.
    We could then divide that number by the number of shelves we have and take the remainder
    as the shelf number to store the book on.
    This would still be fairly fast to calculate, and would prevent all the books with titles
    starting with the word "THE" being stored on the same shelf.
    Now imagine, instead of book titles, our hash function takes data from our two frequencies
    separated by particular time as an input, t, and produces a number between 1 and...say
    1 billion.
    First, we go through our database of songs and calculate the hash number for each anchor
    point.
    Songs will contain multiple anchor points, which will allow us to categorise short snippets
    of songs by the frequency of the anchor point, the frequency the following point and the
    time between them.
    And just like the library, we store each anchor point in order by the hash.
    These addresses' are also categorised with song IDs and time stamps within the song in
    a secondary hash table, allowing us to search for matching songs.
    This makes it much faster to locate our matches, and to find our song we will require multiple
    matching anchor points.
    This ingenious method of song recognition allowed Shazam to be sold for 400 million
    dollars to Apple, and help you figure out just what that catchy song is.
    This is a very simplified view of how the programming of Shazam works, but I have linked
    my research materials below if you would like to read more into the process.
    Or if you would like to begin learning more about programming and build a solid foundation
    of understanding, then you could take this course on Computer Science Fundamentals on
    Brilliant which will start with an intro to algorithms and gradually build you up to more
    complex ideas like data types and structures, by the end of this course you will have discovered
    algorithms that can be used to extract answers from data, and when you are finished you can
    take the follow up course in computer science algorithms.
    This is just one of many courses on Brilliant, with more courses due to released soon on
    things like automotive engineering and Python Coding.
    If I have inspired you and you want to educate yourself, then go to brilliant.org/RealEngineering
    and sign up for free.And the first 73 people that go to that link will get 20% off the
    annual Premium subscription.
    As always thanks for watching and thank you to all my Patreon supporters.
    If you would like to see more from me the links to my instagram, twitter, subreddit
    and discord server are below.
    Designing the Perfect Airport Runway Can We Terraform the Sahara to Stop Climate Change? GoPro - How a Hero is Losing Millions - A Case Study For Entrepreneurs 10 Billionaires That Are Cheaper Than You $60,000 for our stolen photo: We made a copyright thief PAY! How to Zinc Electroplate - 1950's British Railway Bolt Restoration How Peter McKinnon gained 1 million subscribers in under 1 year Interview with an Ex Flat Earther WHAT IT TAKES To Edit Big TV Shows A Defense of 24 FPS and Why It's Here to Stay for Cinema