Do LLMs Have Good Music Taste?

August 17, 2025

Taste has become a bit of a buzz-word, at least among VC-types. Taste is a philosophy, it must run deep in the core of your business, they say. I mostly disagree, as taste really only matters if you are comparing two things (which of these two companies is more tasteful?). If a customer is making a decision based on whether you are more tasteful than your competitor, you're probably in the wrong business. Nevertheless, taste is an interesting concept to think about. As we become increasingly reliant on LLMs, the question arises: do models have good taste?

This past weekend, I tried to think of ideas for some kind of ‘taste' benchmark. The first strategy I thought of was a simple test where I would give the model two examples of art (could be a poem, a painting, etc.) and it would have to pick whichever was ‘best.' This seemed like a solid idea until I realized it assumes I already know the correct answer, and since I hardly have the world's premier taste in every form of art, this strategy would have to wait.

I figured it best if the model instead produced something which could then be inspected for taste. To be most accessible, I figured that music was a good place to start. At some level this makes no sense, as, at least in my understanding, frontier models are typically not trained on music/audio files. On the other hand, models do a pretty good job at music recommendation. In the end, I landed on having models rank their favorite artists. You’ll be able to make your own judgement about 'model taste' by scrolling down.

The Setup

I felt that asking models for a list of their top 10 favorite artists or something would be too heavily influenced by various charts and lists online, so I chose a bracket-style approach instead: the model picks between two artists, and the winner advances to the next round.

I used the ListenBrainz dataset to collect the top 5000 most played artists and randomly shuffled the starting matchups. Running 5000 people through 13 rounds required a lot of requests, but since the prompt was short, the token count stayed low enough that costs weren’t prohibitive. Here's the prompt I used:


Pick your favorite music artist between {artist_1} and {artist_2}. You have to pick one. Respond with just their name.

I also raised the models' temperature and ran each matchup as a best-of-three to better capture each model's persona. The result is not a true ranking, but instead a series of eliminations, with a finalist, a runner up, two semifinalists, etc. They are still shown as a ranking to make it look better, but #3 and #4 really made it to the same round, as did #5 - #8, and so on.

The Results

I've included the top 20 artists for each model below so that you can try to ascertain an overall vibe of the model. I can also upload the full bracket results if people want.

It's a bit hard to summarize the results. If I had to pick who won, I would maybe say Claude 4 Sonnet, but I don't really know. There are a bunch of weird things going on with some of the lists too.

For starters, something odd is clearly happening with some of the reasoning models (see: o3, gpt-5, grok-4, deepseek-r1). As funny as it would be for Grok-4 to love $uicideboy$ and 100 gecs, basically all of the artists on these lists start with numbers or dollar signs, which points to the idea that the labs went a little too hard on the RL. This seems like a fairly major flaw? Perhaps it's reasonable to say that RL is making the models much spikier, and spikier in ways we might not want. It's also interesting that this happens with not just OpenAI, but Grok and DeepSeek too...

Besides that, Mistral's list is odd. There's a lot of foreign artists, maybe something related to their emphasis on different languages? Kimi-VL also stands out. There we see a clear preference for artists with longer names. I'm not sure how to rationalize that.


Claude

Pretty solid lists here. Going in I expected Claude to have the best list, and I think that was generally correct. Lots of jazz, classics, and ~softer music.

Claude 3.5 Sonnet

Claude 3.7 Sonnet

Claude 4.1 Opus

Claude 4 Sonnet


OpenAI

Results here are kinda suprising. GPT-3.5-Turbo is a bit more upbeat than Claude, and from there the lists get a little crazy. The reasoning models, as stated before, are especially weird. Side not: I am SUPER curious about what GPT-4.5 would've done, but sadly they took away our precious API access.

GPT 3.5 Turbo

GPT 4o

GPT 4.1

o3

GPT 5


Gemini

I think what is particularly interesting about the Gemini models is that the list seems to get better (or less obscure)? This doesn't seem to be the case with any other models.

Gemini 2.0 Flash

Gemini 2.5 Flash


Grok

Definitely an interesting progression here. Like what we saw with the OpenAI models, it seems like reasoning models really prefer artists with numbers in their name.

Grok 3

Grok 4


Other

This is a random assortment of other models I thought I should try. I could've done ALL of the Llama versions, the Mistral versions, and so on, but I think you can generally get the main idea from a sinlge model.

Llama 3 70b

DeepSeek R1

Mistral 3.1 Medium

Kimi vl

Kimi K2

Qwen 2.5 72b

Takeaways

This was a lot of fun, and wasn't really meant to be very scientific. I'm sure there's a better way to generate these lists (maybe some kind of ELO thing?), and maybe in the future I'll try something else. It also doesn't objectively state whether the models have good taste or not (that's for you to decide). It does do a fairly decent job of summarizing a model's vibe though, which I think is valuable and needs to be done more. This post was somewhat inspired by Henry's post, which is about another cool LLM benchmark. I'll definitely be doing more experiments like this in the future.

Let me know what you think!