I've spent the last few months evaluating different AI sound generation systems. My original question was whether I could detect AI speech. (Yup.) However, this exploration took me from AI voices to AI videos and eventually to AI music. The music aspect strikes me as really interesting because of the different required components that are combined. These include:
- Score: The written music that conveys the tune, melody, harmony, instruments, etc. This is the documentation so that other musicians can try to replay the same music. But even with unwritten music, something has to come up with the melody, harmony, composition, etc.
- Instrumentation: The score doesn't contain everything. For example, if the music says it is written for a piano, there are still lots of different types of pianos and they all have different sounds. The instrumentation is the selection of instruments and when they play.
- Lyrics: The written words that are sung.
- Vocals: The actual singing.
This is far from everything. I'm not a musician; real musicians have their own terminologies and additional breakdowns of these components.
Each of these components have their own gray levels between AI and human. For example, the vocals can be:
- AI: Completely generated using AI.
- Real: Completely human provided.
- Augmented: A human voice that is adjusted, such as with autotune or other synthetic modulators. (Like Cher singing "Believe", or almost anything from Daft Punk.)
- Synthetic: From a synthesizer -- artificially created, but not AI. This could be a drum machine, a full synthesizer, like Moog or Roland, or even a Yamaha hybrid piano with built-in background rhythm player like Rowlf the Dog uses. (The Eurythmics is a good example of a music group whose earlier works were heavily dependent on synthesizers.)
- Human edited: Regardless of the source, a human may edit the vocals during post-production. For example, Imogen Heap's "Hide and Seek" loops the human's voice at different pitches to create a layered vocal harmony. And Billy Joel's "The Longest Time" features multiple singing voices that are all Billy Joel. He recorded himself singing each part, then combined them for the full song.
- AI edited: Regardless of the source, an AI system may edit the vocals during post-production. Tools that can do this include Audimee and SoundID VoiceAI. (Not an endorsement of either product.) Both can generate harmonies from single voice recordings.
The same goes for the score, arrangement, and even the lyrics. There isn't a clear line between "human" and artificial creations. Unless it's a completely live performance (acoustic and unplugged), most music is a combination.
Detecting AI (Just Beat It!)
Detecting whether something is AI-generated, synthetic, or human -- and the degree of each combination -- can be really difficult. Currently, different AI systems have different 'tells' that can be detected. Each AI system seems to have different quirks. However, each generation changes the detectable artifacts and there may be a new generation every few months. In effect, detection is a moving target.
Because this is a rapidly changing field, I'm not too concerned about giving away anything by disclosing detection methods. Any artifacts that I can detect today are likely to change during the next iteration.
Having said that, it seems relatively easy to differentiate between AI, synthetic, and human these days. Just consider the music. An easy heuristic relies on the "beat":
- No matter how good a human musician is, there are always micro-variations in the beat. The overall song may be at 140 bpm (beats per minute), but the tempo at any given moment may be +/- 5 bpm (or more).
For example, I graphed the beats over time for the song "Istanbul (Not Constantinople)" from They Might Be Giants. This comes from their live album: Severe Tire Damage:
The red lines at the bottom identifies each detected beat, while the blue line shows a ten-second running average to determine the beats per minute. This version of the song starts with a trumpet solo and then they bring in the drums. The trumpet appears as a very unsteady rhythm, while the drums are steadier, but show a continual waver due to the human element.
As another example, here's Tears for Fears singing "Everybody Wants to Rule the World":
Even though they use a lot of synthetic music elements, Tears for Fears combines the synthetic components with humans playing instruments. The human contribution causes the beats per minute to vary.
- Synthetic music, or music on a loop, is incredibly consistent. The only times it changes is when a human changes up the loop or melody. For example, this is Rayelle singing "Good Thing Going":
Even though the vocals are from a human, the music appears synthetic due to the consistency. There's a slight change-up around 2 minutes into the song, where the music switches to a different loop.
Another incredible example is "Too High" by Neon Vines:
The music is completely synthetic. You can see her creating the song in the YouTube video. Initially, she manually taps the beat, but then it loops for a consistent 109 bpm. At different points in the song, she manually adds to the rhythm (like at the 1 minute mark), adding in a human variation to the tempo.
Many professional musicians often record to an electronic "click track" or digital metronome that can help lock-in the beat, which also makes the music's rhythm extremely consistent. The song "Bad Things" by Jace Everett is a perfect example:
If you listen to the music, you can clearly hear the electronic percussion keeping time at 132 bpm. There may have also been some post-production alignment for the actual drums. (No offense to the drummer.)
- AI music systems, like Suno or Udio, have a slow variation to the beat. It varies too much to be synthetic, but not enough to be human. For example, "MAYBE?!" by Around Once uses Suno's AI-generated music:
The beat has a slow variation that doesn't change with the different parts of the song (verse, bridge, chorus, etc.). This is typical for AI generated music (not just Suno).
The distinction between "synthetic" and "AI" music is not always clear. Modern synthesizers often use embedded AI models for sound design or generative sequencing, which blurs the boundaries. Moreover, post-processing for beat alignment against a click-track can make a real performance appear as synthetic or AI.
By evaluating the beat over time, the music can be initially classified into human, synthetic, or AI-generated. (And now that I've pointed that out, I expect the next version of these AI systems to make it more difficult to detect.)
Lyrics and AI (Word Up!)
Lyrics are equally interesting because of how the different approaches combine words. For example:
- ChatGPT can compose lyrics. However, the model I tested usually drops "g" from "ing" words (e.g., "drivin'" instead of "driving") and uses lots of m-dashes for pauses. (Real writers use m-dashes sparingly.) When mentioning women, it regularly uses "baby" or "honey". (Because those are the only two terms of endearment in the English language, right?) It also seems incapable of repeating a verse without changing words. As far as intricacy goes, ChatGPT is great at including subtle elements such as emotion or innuendo. (Note: I tested against ChatGPT 4. Version 5 came out a few days ago.)
- Suno has two lyric modes: the user can provide their own lyrics, or it can generate lyrics from a prompt. When using the prompt-based system, the songs are usually short (six structures, like verses, bridge, and chorus). As for linguistics, it seems to prefer night/evening language over day/morning and uses the specific word "chaos" (or "chaotic") far too often.
Unlike ChatGPT, Suno is great at repeating verses, but it sometimes ends the song with partial repetition of previous verses, weird sounds, or completely different musical compositions. The AI-generated score, instrumentation, and vocal components don't seem to known when to end the song, and may not follow the written lyrics to the letter. The free Suno version 3.5 does this often; the paid version 4.5 does it much less often, but still does it.
- Microsoft Copilot? Just don't. It writes really primitive short songs with bad wording and inconsistent meter.
- Gemini is a slight improvement over Copilot. The lyrics are often longer and have better rhyming structure, but lack any subtly or nuance.
In every case, the AI-generated lyrics often include poor word choices. This could be because the AI doesn't understand the full meaning of the word in context, or because it chose the word's sound over the meaning for a stronger rhyme. (If you're going to use AI to write a song, then I strongly suggest having a human act as a copy editor and fix up the lyrics.)
If you need a fast song, then try using ChatGPT to write the initial draft. However, be sure to use a human to edit the wordings and maybe replace the chorus. Then use Suno to put it to music and generate the AI vocals.
During my testing, I also evaluated songs from many many popular artists. From what I can tell, some recent pop songs appear to do just that: it sounds like ChatGPT followed by human edits. A few songs also seem to use some AI music to create the first draft of the song, then used humans and/or synthesizers to recreate the AI music so it becomes "human made". Basically, there are some composition choices that AI systems like to make that differ from human-scored music. (I'm not going to name names because the music industry is litigious.)
Ethics and AI (Should I Stay or Should I Go?)
I'm not sure how I feel about AI-generated music. On one hand, a lot of AI-generated music is really just bad. (Not 'bad' like 'I don't like that genre' or song, but 'bad' as in 'Turn it off', 'My ears are bleeding', and 'The lyrics make no sense'.) Even with AI assistance, it's hard to make good music.
On the other hand, not everyone is a musician or can afford a recording studio. Even if you have a good song in your head, you may not have the means to turn it into a recording. AI-generated music offers the ability for less-talented people (like me) to make songs.
The use of AI to make creative arts is very controversial and strong concerns are found in the AI-generated artwork field. However, copying an artist's style (e.g., "in the artistic style of
Larry Elmore") is not the same as saying "in the musical genre of 80s Hair Metal". One impersonates a known artist and impedes on their artistic rights, while the other is a generalized artistic style. (In copyright and trademark law, there have been "sound-alike" disputes with style emulation. So it's not as simple as saying that it's fine to stick to a genre. A possible defense is to show that everyone copies everyone and then find a similar rift from a much older recording that is out of copyright.)
AI-generated music is also different from AI-books. I'm not seeing a flood of AI-created crap albums flooding online sellers. In contrast, some book sellers are drowning in AI-written junk novels. The difference is that you often can't tell the quality of a book's contents without reading it, while online music sellers often includes a preview option, making it relatively easy to spot the bad music before you buy it.
Unlike artwork or books, music often has
cover performances, where one band plays a song that was originally written by someone else. For example:
- They Might Be Giants' version of "Istanbul (Not Constantinople)" (1990) was a cover of a song originally made famous in 1953 by Jimmy Kennedy, but dates back to 1928's Paul Whiteman and His Orchestra.
- Elvis Presley's "Hound Dog" (1956) is based on a 1952 version by Big Mama Thornton. (I prefer Big Mama's version.)
- And let's not forget Hayseed Dixie's A Hillbilly Tribute to AC/DC. They recreated some of AC/DC's rock classics as country music -- same lyrics, similar melody, but different pace and instrumentation. (No offense to AC/DC, but I like the hillbilly version of "You Shook Me All Night Long" much more than the original.)
I don't see AI-music as "creating more competition for real musicians". Between cover bands and the wide range of current alterations permitted in music (autotune, beat machines, synthesizers, mashups, sampling, etc.), having completely AI-generated music and vocals seems like a logical next step. Rather, I view AI-music as way for amateurs, and the less musically talented, to create something that could pass for real music.
Bad Poetry and AI (Where Did My Rhythm Go?)
I was told by a friend who is a musician that songs often start with the melody and then they write lyrics to fit the music. However, many AI-music systems flip that around. They start with the lyrics and then fit the melody to the tempo of the lyrics. It's an interesting approach and sometimes works really well. This means: if you can write, then you can write a song. (Not necessarily a
good song, but you can write
a song.)
I'm definitely not a musician. I joke that I play a mean kazoo. (A kazoo doesn't have to be on key or play well with others.) As for lyrics, well, I tell people that I'm an award-winning professional poet. While that's all true, it's a little misleading:
- I've entered two limerick contests and won both of them. (Thus, "award winning.")
- One of my winning limericks was published in the Saturday Evening Post. (Thus, published.)
- And the published poem paid me $25! (Paid for my poem? That makes me a professional!)
With these AI systems, I quickly learned that a formal poetic structure really doesn't sound good when put to music. The uniformity of syllables per line makes for a boring song. For a good song, you really need to introduce rhythm variations and half-rhymes. (In my opinion, a good song is really just a bad poem put to music.)
During my testing, I listened to lots of real music (from professional musicians) as well as AI-generated music (from amateurs). I also created a ton of test examples, using a variety of lyric generation methods and genres. (Yes, the beat detector holds up regardless of the genre, singing voice, etc.) Let me emphasize: most of my test songs were crap. But among them were a few lucky gems. (Well, at least I enjoyed them.) These include songs with some of my longer poems put to music. (I wrote the poems while creating controlled test samples for human vs AI lyric detection.)
Having said that, I put some of my favorite test songs together into two virtual albums:
Thinning the Herd and
Memory Leaks. One of my friends suggested that I needed a band name ("Don't do this as
Hacker Factor"), so "Brain Dead Frogs" was born.
I have two albums on the
web site. They span a variety of genres and techniques. All of them use AI-generated music and AI voices (because I can't play music and I can't sing on key); only the lyrics vary: some are completely human written, some are based on AI creations but human edited, a few use AI-editing a human's writings, and one is completely written by AI. If you listen to them, can you tell which is which?
Hint: When editing, even if I changed all of the lyrics, I tried to keep the original AI artifacts. I found it was easiest to use AI to create the first draft of the song -- because that creates the pacing and tempo. Then I'd rewrite the lyrics while retaining the same syllables per line, which kept the pacing and tempo.
Outro (Thank You For The Music)
As a virtual musician and amateur songwriter, I'd like to thank everyone who helped make these songs possible. That includes Suno, ChatGPT, Gemini, Audacity, the online Merriam-Webster dictionary (for help with rhyming and synonyms), and Madcat for temporarily suspending her devotion to Taylor Swift long enough to be my first fan and website developer.
I definitely must thank my critical reviewers, including Bob ("I had low expectations, so this was better than I thought."), Richard ("Have you thought about therapy?"), The Boss ("At least it's not another C2PA blog"), and Dave for his insights about music and pizza. I'd also like to thank my muses, including Todd (nerdcore genre) and Zach Weinersmith for his cartoon about the
fourth little pig (genre: traditional ska; a calypso and jazz precursor to reggae).
Finally, I'd like to issue a preemptive apology. For my attempt at K-pop ("Jelly Button"), I thought it needed some Korean in the bridge, because you can't have K-pop without some Korean lyrics. I don't speak Korean, and I suspect that my translator (ChatGPT) and singer (Suno) also don't. So I'll apologize in advance if the words are wrong. (But if the words are funny or appropriate, I'll unapologize and claim it was intentional.)