This analysis attempts to classify songs into their correct musical genre using audio features. It is inspired by the original analysis by Kaylin Pavlik (@kaylinquest) in her 2019 blog post Understanding + classifying genres using Spotify audio features.

knitr::opts_chunk$set(autodep = TRUE)
spotify <- read.csv("data/spotify.csv", stringsAsFactors = FALSE)
[1] 32833    15
  genre danceability energy key loudness mode speechiness acousticness
1   pop        0.748  0.916   6   -2.634    1      0.0583       0.1020
2   pop        0.726  0.815  11   -4.969    1      0.0373       0.0724
3   pop        0.675  0.931   1   -3.432    0      0.0742       0.0794
4   pop        0.718  0.930   7   -3.778    1      0.1020       0.0287
5   pop        0.650  0.833   1   -4.672    1      0.0359       0.0803
6   pop        0.675  0.919   8   -5.385    1      0.1270       0.0799
  instrumentalness liveness valence   tempo duration_ms           artist
1         0.00e+00   0.0653   0.518 122.036      194754       Ed Sheeran
2         4.21e-03   0.3570   0.693  99.972      162600         Maroon 5
3         2.33e-05   0.1100   0.613 124.008      176616     Zara Larsson
4         9.43e-06   0.2040   0.277 121.956      169093 The Chainsmokers
5         0.00e+00   0.0833   0.725 123.976      189052    Lewis Capaldi
6         0.00e+00   0.1430   0.585 124.982      163049       Ed Sheeran
1 I Don't Care (with Justin Bieber)
2                          Memories
3                      All the Time
4                     Call You Mine
5                 Someone You Loved
6   Beautiful People (feat. Khalid)
table(spotify[, 1])

  edm latin   pop   r&b   rap  rock 
 6043  5155  5507  5431  5746  4951 
spotify <- spotify[, 1:13]

Split the data into training and testing sets. The training set should have 3/4 of the samples.

numTrainingSamples <- nrow(spotify) * 3/4
trainingSet <- sample(seq_len(nrow(spotify)), size = numTrainingSamples)
spotifyTraining <- spotify[trainingSet, ]
spotifyTesting <- spotify[-trainingSet, ]

Build classification model with decision tree from the rpart package.

model <- rpart(genre ~ ., data = spotifyTraining)
plot(model, margin = 0.05)

Version Author Date
35c8864 John Blischak 2024-05-22

Calculate prediction accuracy of the model on the training and testing sets.

predictTraining <- predict(model, type = "class")
(accuracyTraining <- mean(spotifyTraining[, 1] == predictTraining))
[1] 0.38763
predictTesting <- predict(model, newdata = spotifyTesting[, -1], type = "class")
(accuracyTesting <- mean(spotifyTesting[, 1] == predictTesting))
[1] 0.3821416

Evaluate prediction performance using a confusion matrix.

table(predicted = predictTesting, observed = spotifyTesting[, 1])
predicted  edm latin  pop  r&b  rap rock
    edm    624   164  226   72   58   79
    latin  162   494  404  261  155  133
    pop      0     0    0    0    0    0
    r&b     38    84   74  242   57   87
    rap    337   470  316  540 1008  177
    rock   360   105  377  233  103  769

How does the model compare to random guessing?

predictRandom <- sample(unique(spotifyTesting[, 1]),
                        size = nrow(spotifyTesting),
                        replace = TRUE,
                        prob = table(spotifyTesting[, 1]))
(accuracyRandom <- mean(spotifyTesting[, 1] == predictRandom))
[1] 0.1525155

