Thursday, October 09, 2014

The Worst Data Grooming Case Ever?

Last week I collected a Western Digital MyBook Live from a client who has been experiencing major problems with their music. Let me paint the picture - a busy family, husband, wife & children, they have a "family" Mac along with a wired home featuring Sonos music streaming through their home. Sensibly they have a NAS drive installed to store their music and most importantly make it available through Sonos when the family computer is switched off or otherwise engaged. In my opinion, it's a sensible set up.

Yet its gone wrong, the whole thing has become a mess which is why the NAS drive is here with us while we try to unpick the problems and put it all right.

We have been through the NAS, along with three large folders of music copied off their Mac. We've put a first pass of the entire collection into iTunes on one of our Mac Minis and this is what we've found. The total music library is just over 1.2 Tb in size and iTunes has found 3,115 albums. That's a pretty big music collection.

There are 58,619 tracks - sadly a large number which are either not labelled other than some sort of track name, tagged with an album name such as Unknown or Untitled or something that isn't right. There's a fair sprinkling of missing artist names, and one family member has tried to tackle the problem with classical music composer names by replacing the proper artist / performer field with the composer's name.

Our software has made a first pass to tackle the issue of duplicated tracks. As soon as you look at the library in iTunes you can see duplicates and the first count is 16,493 duplicate tracks. Around 25% duplicated. No wonder they've been complaining that "Sonos has gone wrong, it keeps playing the same track over and over again". Actually Sonos is fine, they are so many duplicated tracks in their library it just sounds like its playing the same one over, and over again. You can see the same track listed four or more times.

So the nest step is to remove the duplicated tracks. Our software deletes the tracks from the main music library and deletes it from the hard drive at the same time, rather than empty trash after the operation. Although this sounds simple we're erasing some 400 GB of data and that's going to take a long, long time. So what do we do tomorrow?

Once the music library has been shrunk down we'll go through the album names. We have to do that more or less manually and already we can see yet more duplication. Someone has added a chunk of CDs and helpfully added the word "NEW" before the album name, despite there being many of the same albums in iTunes already. You can see many more examples of duplicated albums being added by someone not checking to see if the music was there already. Actually iTunes does check if an album is in your library when you insert the CD but it can't if the tracks have been ripped already and added as files with a different album name.

Once we've brought some further consistency to the album names we'll take a look at performer names, another area of rich potential for duplication. Tidy up album and performer names then again look for duplicates and that will probably weed out another 500 - 600 tracks.

I've also made a note to look at Genres. This library has 98 entries under the genre heading. That's not in itself a problem but there are some typos - Hip Hip for Hip Hop, Opera and Operae - so we'll fix those. In terms of time, not a big task. What will take longer is fixing the classical composer names and examining the consistency of applying genres to classical music. If you're going to split-off some tracks from Classical into Chamber or Opera you've got to be consistent for the whole of the work in question. No point in having the first album of an opera in "Opera" if you leave the rest in Classical.

And yes, when we've done all that we run de-duplicate again. Why? We operate a pretty tight definition of a duplicate track - we remove only those tracks that have the same name, same track number, genre and appear on the same album. As we tidy up more tracks fall into the clutches of the de-dupe sweep.

Then we're on to checking for album art (on a library this size, filling gaps could be another afternoon's work) and trying to fill in the gaps in artist and album names using some of our "secret sauce". When this is all done we'll have a simplified and slimmed-down library, primarily one without duplicated tracks. The final step will be to delete the originals from the MyBook drive and copy it all back. Sounds simple, but that in itself can be a 48 - 72 hour chore. At least next weeks tube strike has been called off so we'll be able to take the drive back on Monday or Tuesday.