Music is the Language of Us All
So, a while ago I discovered last.fm but being a tryhard perfectionist, I wasn’t happy with how the songs were being tagged, the discrepancies in song names, confusion if there were bands with the same name (which started when Yen told me about Delaware) and so forth.
Looking through the help pages of last.fm, I found that they will one day take all their data (which should be much cleaner) from MusicBrainz and could all users please tag their songs properly using the MusicBrainz Picard software, kthnxbye.
What is this MusicBrainz, I wonder to myself. Wandering over to their website I find a huge ass music database. They’re trying to collect every recording made and catalogue them properly. It’s also open source. Nice, thinks I. I open an account, download their tagging program and start to clean up my collection. None of my mp3s have very clean tags. There’s tracks in my collection known only as “Track01″. I was looking forward to having the correct information and accompanying picture turn up on the iPods: Kyo and nano-kun.
The process ground to a halt quickly. I have a lot of mp3s, some dating back to when I first started using the internet in 1998. Some are quite new - so new in fact, that they haven’t even been entered into the database yet. More still are Japanese songs so they get generally ignored by the database, unless they’re anime songs, but those database entries are such a mess that they make me shudder.
My gods, I spent 8 days trying to clean up the Rurouni Kenshin CD entries. There’s no such artist as Rurouni Kenshin, people! It’s the name of the show! It’s a soundtrack!
So, cleaning up my mp3s quickly turned into cleaning up the MusicBrainz database. Anime, Japanese and Australian music, because a lot of it gets ignored by Americans, who make up most of the tagging population of MusicBrainz. The Japanese music is particularly difficult, because sometimes the data can’t be twisted to fit the choices. Does a song entitled U-ボット have Latin script or katakana? Because it can’t be listed as both.
Also, how do you write out the titles? I think it should appear the same way that it does on the CD, because that’s “artist’s intent” and appeases that perfectionist part of me. But some people like to read a transliterated version of the title. They want Asian Kung Fu Generation’s song アンダスタンド to read “Understand” - which is what it means. Okay, fair enough, not everyone is a Japanese student like me. But what I dislike about the system is how there is two entries for the one album - one with the original Japanese and one with someone’s transliteration.
Why not have one album with several aliases or something all linked together? When people tag their mp3s, they can choose which alias they want for their tags. It could be expanded safely to use all languages. Volunteers just add their own language transliterations. That song could be listed as “Verstehe” by a German fan. So instead of potentially an entry of the same album in every language, just have one album listed, then a drop down listed saying “View this information in [German]”. Have a default setting available in the user’s control panel, even!
This would require database schema rewriting (which they do quite a bit anyway, as the project develops) and rewriting of the programming for their stand alone tagging program, Picard. Do-able as well, but slower. It’s something that I’ll be writing an official New Feature Suggestion about. Because I’ve become that sucked in.
I’ve suspected for a while that I was the kind of sad individual that got kicks out of cleaning up data. MusicBrainz has proved it. I’m going to spend the rest of my life wallowing knee-deep in databases, I just know it. I mean, look at me. I’m spending weeks cleaning up some free database on the internet.
While I was cleaning the MusicBrainz entries, I noticed these PUID thingies and wondered about how I add them. Previously, MusicBrainz identified CDs by a TRM, which is basically an identification calculated on the time data of a CD. But with re-releases, overseas releases and so on, there could be potentially infinite TRMs for the one release. On top of that, there is a chance that different releases could have the same time data and therefore the same TRMs. So, PUIDs were developed by a third (non-open source, bugger) party, and they promise to fingerprint songs much more accurately, looking not just at time, but the actual data within. Pitch, melodies, beat, bass, the actual music. An extra benefit to fingerprinting is that, looking at the fingerprints (PUIDs) of the songs in your collection, songs with similar fingerprints can be recommended to you.
Serious! If the program reads from the PUIDs that you like big beat, then it will find more songs that you may not have heard with big beat and recommend them to you. Like last.fm’s recommendations, but based on computerised calculations rather than human connections.
Either way, it helps improve data integrity, which is what my shriveled database adoring soul really wants. So I try to find out what I need to do to add PUIDs of my collection to the MusicBrainz database. That was not easy. The whole project is open source and dependent on volunteers, who can spend a lot of time bickering over how live tracks should be labelled. Eventually I found out what I have to do, so I’m writing out the steps here, partly to help me remember, and partly to help any poor soul who wants to clean up and contribute the details of their collection to MusicBrainz. This will work well as a ‘from scratch’ set of instructions.
1. Download the latest version of Picard with PUID fingerprinting functionality.
2. Install and run, go through all your mp3s (or wma, or Ogg Vorbis, whatever applies), and tag them correctly based on info from the MusicBrainz database. Here are more detailed instructions on how to do that.
3. If your mp3s aren’t listed in the database, enter them in. But correctly, please. Follow the style guidelines.
4. After your mp3s are all properly tagged, install MusicIP’s (the third party that doesn’t open source) MusicIP Mixer program. Point it in the direction of your music folder and let it go. It will automatically start analysing your mp3s to create the PUIDs. THIS TAKES A LONG TIME. To analyse one song takes approximately 80% of the song’s run time. This means that it could literally take days, weeks, months to analyse the mp3s in your collection, depending on how big it is.
The PUID data gets sent back to the MusicIP servers, although we’re assured that all privacy is kept intact.
5. Wait at least 24 hours (although there’s been waits of 7 days reported) for the data to be parsed through the MusicIP servers. After that, MusicBrainz can access it.
6. Run MusicBrainz Picard again and tag the songs again. This time it should be much quicker. It should also find the PUIDs that were created in that loooooong analysis and prompt you to add them to the MusicBrainz database.
Hurrah! You’re done. Your mp3 collection is clean and properly documented, and you helped a great open source project grow a little bit.
The crap bit is that you’ll have to repeat this again every time you buy a new CD or download music.
Ugh, time consuming >.<
