Musical Structure and Segmentation Demo

Music Information Retrieval - MUSS1112
November 14, 2022
Martin Hartmann - martin.hartmann@jyu.fi

1 Introduction

In this demo we will have a look at two segmentation functions that are available in MIDI toolbox, and we will test them using a large collection of songs. We will also use an optimization algorithm to try to find an optimal parameter for applying a threshold to the results obtained to one of the segmentation functions.

2 What is MIDI Toolbox?

MIDI Toolbox provides a set of Matlab functions, which together have all the necessary machinery to analyze and visualize MIDI data. The development of the Toolbox has been part of ongoing research involved in topics relating to musical data mining, modelling music perception and decomposing the data for and from perceptual experiments.

3 Downloading MIDI Toolbox

The latest version of MIDI toolbox can be downloaded from this website: https://github.com/miditoolbox/

4 Notematrix (NMAT) representation

MIDI files read into MIDI Toolbox are represented as matrices that always contain 7 columns and a number of rows that equals the amount of note events in the MIDI file. The column labels are as follows:

ONSET (BEATS)	DURATION (BEATS)	MIDI channel	MIDI PITCH	VELOCITY	ONSET (SEC)	DURATION (SEC)
…	…	…	…	…	…	…
…	…	…	…	…	…	…

The first column indicates the onset of the notes in beats (based on pulses per quarter note) and the second column the duration of the notes in these same beat-values. The third column denotes the MIDI channel (1-16), and the fourth the MIDI pitch, where middle C (C4) is 60. The fifth column is the velocity describing how fast the key of the note is pressed, in other words, how loud the note is played (0-127). The last two columns correspond to the first two (onset in beats, duration in beats) except that seconds are used instead of beats.

Often one wants to refer only to pitch or duration values in the notematrix. For clarity and convenience, these columns may be called by few basic selector functions that refer to each specific column only. These are onset (either 'beat' or 'sec', the former is the default), dur (either 'beat' or 'sec'), channel, pitch, and velocity.

5 Starting up

We will use MIDI toolbox in this demo, so please install it and add it to the MATLAB path if it is not there yet, e.g.:

addpath(genpath('~/Downloads/miditoolbox'))

If miditoolbox is in the path, you should be able to run this command and have a look at the functions in MIDI toolbox:

help miditoolbox

A better to ensure that things are working is to read in a midi file that is included with MIDI toolbox. We can create a waveform using FM synthesis and listen to the result:

a = readmidi('laksin.mid')
playsound(a)

We can also use shepard tones instead of FM synthesis:

playsound(a,'shepard')

6 The Essen collection

The Essen folksong collection (http://essen.themefinder.org) consists of 9,354 folk songs (Schaffrath, 1995) which have been annotated based on perceived musical grouping structures. We will work with a smaller subset in NMAT format (6236 songs), which is available here. Please download essen.mat and then load the .mat file:

load('~/Downloads/essen.mat')

7 Having a look at the Essen collection .mat file

This workspace contains several variables. nm is a cell array containing the note matrices of each song. Because it is a cell, we access its elements using key brackets {}. For instance, let’s have a look at the notematrix for the 7th song:

>> nm{7}

ans =

         0    2.0000    1.0000   65.0000  100.0000         0    1.0000
    2.0000    1.0000    1.0000   60.0000  100.0000    1.0000    0.5000
    3.0000    1.0000    1.0000   65.0000  100.0000    1.5000    0.5000
    4.0000    1.0000    1.0000   65.0000  100.0000    2.0000    0.5000
    5.0000    1.0000    1.0000   65.0000  100.0000    2.5000    0.5000
    6.0000    1.0000    1.0000   67.0000  100.0000    3.0000    0.5000
    7.0000    1.0000    1.0000   69.0000  100.0000    3.5000    0.5000
    8.0000    2.0000    1.0000   65.0000  100.0000    4.0000    1.0000
   10.0000    2.0000    1.0000   69.0000  100.0000    5.0000    1.0000
   12.0000    1.0000    1.0000   70.0000  100.0000    6.0000    0.5000
   13.0000    0.5000    1.0000   69.0000  100.0000    6.5000    0.2500
   13.5000    0.5000    1.0000   67.0000  100.0000    6.7500    0.2500
   14.0000    1.0000    1.0000   67.0000  100.0000    7.0000    0.5000
   15.0000    1.0000    1.0000   69.0000  100.0000    7.5000    0.5000
   16.0000    1.0000    1.0000   65.0000  100.0000    8.0000    0.5000
   17.0000    1.0000    1.0000   69.0000  100.0000    8.5000    0.5000
   18.0000    2.0000    1.0000   67.0000  100.0000    9.0000    1.0000
   21.0000    1.0000    1.0000   65.0000  100.0000   10.5000    0.5000
   22.0000    1.0000    1.0000   65.0000  100.0000   11.0000    0.5000
   23.0000    1.0000    1.0000   69.0000  100.0000   11.5000    0.5000
   24.0000    2.0000    1.0000   72.0000  100.0000   12.0000    1.0000
   26.0000    1.0000    1.0000   65.0000  100.0000   13.0000    0.5000
   27.0000    1.0000    1.0000   67.0000  100.0000   13.5000    0.5000
   28.0000    1.0000    1.0000   69.0000  100.0000   14.0000    0.5000
   29.0000    1.0000    1.0000   72.0000  100.0000   14.5000    0.5000
   30.0000    0.5000    1.0000   69.0000  100.0000   15.0000    0.2500
   30.5000    1.5000    1.0000   65.0000  100.0000   15.2500    0.7500
   33.0000    1.0000    1.0000   60.0000  100.0000   16.5000    0.5000
   34.0000    1.0000    1.0000   65.0000  100.0000   17.0000    0.5000
   35.0000    1.0000    1.0000   69.0000  100.0000   17.5000    0.5000
   36.0000    2.0000    1.0000   72.0000  100.0000   18.0000    1.0000
   38.0000    2.0000    1.0000   69.0000  100.0000   19.0000    1.0000
   40.0000    2.0000    1.0000   67.0000  100.0000   20.0000    1.0000
   42.0000    2.0000    1.0000   65.0000  100.0000   21.0000    1.0000

What is the duration of the last note of this song?

There are also variables called name and keyword, which contains the filename and name of each song, respectively.

What is the filename of song 3925?

keyword is a character (char) array, so it is like a 2D matrix with characters instead of numbers. For instance, to access the information of song 2000, we need to look for all columns in row 2000:

keyword(2000,:)

version is also a character array containing information about the .mat file in Finnish.

What type of information is contained in the variable met and how is it represented?

For the following, we will make use of information contained in phrase, subcollection, genre, and nm. phrase tells us, for each piece, what note numbers correspond with annotated boundaries. subcollection and genre can be used to identify the genre of each piece.

8 Visualization

Let’s have a look at the piano roll representation of the first song in the collection:

pianoroll(nm{1})

What is the pitch class of song 5689?

We can use this function based on pianoroll to visualize the annotated boundaries. Please paste the following code into a new file and save it as visualboundaries.m. It can be saved in your current folder or in the miditoolbox folder (or in any other folder, as long as it belongs to the MATLAB path).

function fig = visualboundaries(nmat,b,gt)
% VISUALBOUNDARIES - Visualize a pianoroll with predicted and actual boundaries 
% FIG = VISUALBOUNDARIES(NMAT,B,GT)
%
% Input arguments:
%    NMAT = a note matrix
%    B = an index vector or a logical vector corresponding to predicted boundaries (such as obtained from a segmentation function such as segmentgestalt.m) or actual boundaries (i.e., manual annotations). B can also be a boundary profile, but in this case it will be interpolated to fit the number of beats of the notematrix.
%    GT = an index vector or a logical vector for a second point process, such as a ground truth that can be compared against the boundaries in B. GT can be a boundary profile, but it is interpolated to fit the number of beats of the notematrix. (OPTIONAL ARGUMENT)
% 
% Outputs:
%    A pianoroll with predicted boundary indications, which are plotted as dotted black lines, and actual boundary indications, plotted as magenta dashed lines.
% 
% NOTE: the number of points of B and GT should be equal to the number of rows of the NMAT.

    if nargin == 3, ground = 1; else, ground = 0; end;
    if islogical(b)
        b = find(b == 1);
    end
    on=onset(nmat);
    minimi = min(pitch(nmat));
    maksimi = max(pitch(nmat));

    pianoroll(nmat) % using only 'sec' as option
    set(gca,'Xgrid','off');

    hold on
    if ~mod(b,1) % if integer
        for k=1:length(b)
            plot([on(b(k)) on(b(k))],[minimi-2 maksimi+2],'k:','LineWidth',2);
        end
    else
        ylim = get(gca,'YLim');
        xlim = get(gca,'XLim');
        if ~isrow(b)
            b = b';
        end
        bmap = mapminmax(b,ylim(1),ylim(2));
        x = 1:numel(bmap);
        xp = linspace(x(1), x(end), xlim(end)-xlim(1));
        bimap = interp1(x,bmap,xp);
        plot(bimap,'k:','LineWidth',2)
    end
    if ground
        if islogical(gt)
            gt = find(gt == 1);
        end
        if ~mod(gt,1)
            for k=1:length(gt)
                plot([on(gt(k)) on(gt(k))],[minimi-2 maksimi+2],'m--','LineWidth',2);
            end
        else
            ylim = get(gca,'YLim');
            xlim = get(gca,'XLim');
            if ~isrow(gt)
                gt = gt';
            end
            gtmap = mapminmax(gt,ylim(1),ylim(2));
            x = 1:numel(gtmap);
            xp = linspace(x(1), x(end), xlim(end)-xlim(1));
            gtimap = interp1(x,gtmap,xp);
            plot(gtimap,'m--','LineWidth',2)
        end
    end
    hold off

Let’s now visualize the annotated boundaries for song 600 on top of its pianoroll, for instance.

visualboundaries(nm{600},phrase{600})

Listen to song 600. Would you agree with these boundary indications?

For a better exploration of the data, it is easier to write things like this:

k = 600;
visualboundaries(nm{k},phrase{k})

Now you can just change the value of k to have a look at different songs. Or we can use a for loop to visualize songs 10 to 15. We need to add figure to create a new figure at each iteration of the loop.

for k = 10:15
figure
visualboundaries(nm{k},phrase{k})
end

You can add a title to each figure that is generated in your loop. For instance, a title can be used to inform us what song number is each figure about. You can add the title after the call to visualboundaries. Add title(sprintf('Song %d', k)) to your for loop and run it again.

If you want to close all your figures at the same time, you can type close all.

9 Boundary prediction

MIDI toolbox includes three algorithms for boundary prediction, segmentgestalt, segmentprob, and boundary. You can read more about them in the MIDI toolbox documentation (pp. 32-33), which should be located in ~/Downloads/miditoolbox/documentation/. We will focus on segmentgestalt and boundary.

Let’s have a look at how does segmentgestalt segment song 600. Let’s focus on its first output, which are the locations of each clang. We can store the result in sg{k} so we know that we refer to song 600:

k = 600;
sg{k} = segmentgestalt(nm{k})

Let’s note that sg{k} is a binary vector. Each element of the vector is a note, and notes that correspond to boundaries are indicated with ones.

sg{k}

We can visually compare this prediction with the ground truth:

visualboundaries(nm{k},sg{k},phrase{k})

In this figure, predicted boundary indications are plotted as dotted black lines, whereas actual boundary indications are plotted as magenta dashed lines.

The boundaries look correct, but in the binary vector they appear one note event late. So they need to be circularly shifted down by 1.

sg{k} = circshift(sg{k},-1)

So we can fix this function. Please edit your segmentgestalt.m file by typing:

edit segmentgestalt.m

Now, add the following to the end of the function to shift the binary vectors of clangs and segments down by 1:

c = circshift(c,-1);
s = circshift(s,-1);

10 Performance statistics

We can use this function to estimate the performance of different segmentation algorithms. Please paste this code into a new file and save it as perfstats.m.

function [precision recall f_measure] = perfstats(ppi,gt)
% PERFSTATS - Obtain perfomance statistics (precision, recall, and f-measure) from point process data given a ground truth
% [PRECISION RECALL F_MEASURE] = PERFSTATS(PPI,GT)
%
% Input arguments: 
%    PPI = an index vector corresponding to the predicted point process
%    GT = an index vector for the actual point process or ground truth
%
% Output: 
%    PRECISION = hits divided by total number of estimated elements
%    RECALL = hits divided by total number of annotations
%    F_MEASURE = harmonic mean between precision and recall
% 
% NOTE: Any NaNs included in the ground truth data will be interpreted as false negatives in the evaluation

pp_bin = zeros(max([max(gt) max(ppi)]),1);
gt_bin = pp_bin;
pp_bin(ppi) = 1;
gt_bin(gt(~isnan(gt))) = 1;


% add zeros to predicted boundaries and ones to ground truth if ground truth contained NaNs
    if sum(isnan(gt)) > 0, pp_bin = [pp_bin; zeros(sum(isnan(gt)),1)]; gt_bin = [gt_bin; ones(sum(isnan(gt)),1)]; end

    hits = numel(nonzeros(pp_bin & gt_bin));
    num_estimated = numel(nonzeros(pp_bin));
    num_annotations = numel(nonzeros(gt_bin));

    precision = hits / num_estimated;
    recall = hits / num_annotations;
    f_measure =  2 * (precision * recall / (precision + recall));

Let’s first create a variable with genre labels, which will use later. We will also remove some songs that have incomplete data.

genrelabels = {
 'ALTDEU'
 'BALLADE'
 'BOEHME'
 'DVA'
 'ERK1'
 'ERK2'
 'ERK3'
 'ERK5'
 'FINK'
 'KINDER'
 'VARIANT'
 'ZUCCAL'
 'CHINA'
 'ZHEN'
 'HUANG'
 'SENGANSU'
 'PELOG1'
 'MEXICO'
 'OTHER' };

emptyCells = cellfun(@isempty,phrase);
phrase(emptyCells) = [];
nm(emptyCells) = [];
genre(emptyCells) = [];

phrase(find(genre == 0)) = [];
nm(find(genre == 0)) = [];
genre(find(genre == 0)) = [];

Now, let’s try running perfstats with our prediction and the ground truth. The function will give us three values (precision, recall, and f-measure). Each value goes to a different output. Let’s call these outputs p(k), r(k), and f(k), once again, to indicate that we are obtaining results for song 600.

[p(k) r(k) f(k)] = perfstats(sg{k}, phrase{k})

However, we get an error, because perfstats only accepts data that is formatted like phrase{k} (note onset numbers); it does not understand the binary format of sg{k}. So we need to find the vector indices of sg{k} that correspond to boundaries (that is, that are ones):

sg_indices{k} = find(sg{k} == 1)
[p(k) r(k) f(k)] = perfstats(sg_indices{k}, phrase{k})

We could add an if statement to the perfstats function: If the content of any of the inputs is just ones and zeros, then let’s find the indices of those elements that are equal to one. Please modify the perfstats function accordingly to avoid errors when using binary input(s). This workaround has been implemented in the visualboundaries function, so it will be helpful to have a look at it.

We can go a step further and try to obtain results for the whole collection using a for loop. We need to tell matlab how many notematrices we have (that is, how many elements we have in the nm variable). We can use the built-in numel function for this.

for k = 1:numel(nm)
sg{k} = segmentgestalt(nm{k});
sg_indices{k} = find(sg{k} == 1);
[p(k) r(k) f(k)] = perfstats(sg_indices{k}, phrase{k});
end

Then we can average the results that are not NaN:

mean_precision_across_songs = mean(p(~isnan(p)))
mean_recall_across_songs = mean(r(~isnan(r)))
mean_f_measure_across_songs = mean(f(~isnan(f)))

And get the performance measures for each class:

pg = accumarray(genre,p,[],@(x) mean(x(~isnan(x))));
rg = accumarray(genre,r,[],@(x) mean(x(~isnan(x))));
fg = accumarray(genre,f,[],@(x) mean(x(~isnan(x))));

musical_genre = genrelabels(unique(genre));
precision = pg(unique(genre));
recall = rg(unique(genre));
f_measure = fg(unique(genre));
number_of_songs = histc(genre,unique(genre));

disp(table(mean_precision_across_songs,mean_recall_across_songs, mean_f_measure_across_songs))

disp(table(musical_genre,precision,recall,f_measure,number_of_songs))

The other algorithm, boundary, gives us a different kind of information regarding boundary locations. Instead of a vector of zeros and ones, we get a boundary profile:

clear sg
k = 1;
sg{k} = boundary(nm{k},'fig');
sg{k}

In order to evaluate this model, we can pick peaks over a given threshold. So instead of choosing indices for values equal to 1, we will chose indices for values greater than a given threshold thres, such as .1:

thres = .1;
for k = 1:numel(nm)
sg{k} = boundary(nm{k});
sg_indices{k} = find(sg{k} > thres);
[p(k) r(k) f(k)] = perfstats(sg_indices{k}, phrase{k});
end

The results depend on the threshold that we use. But how can we find an optimal threshold? Moving on to the next section of this demo…

11 Optimization

To find a threshold value that gives us the best results, we can either try a few thresholds (say .25, .5, and .75) or use optimization to (hopefully) find the threshold that maximizes the performance of the algorithm. We will try using optimization, for which we need to make an objective function (also called cost function). The following objective function, optimize, receives a threshold value as an input and outputs the negative of the F-measure (by convention, optimization algorithms minimize an objective function, so that’s why we are interested in the negative of the performance) for the boundary algorithm. So we have one design variable (threshold value) and one objective (to minimize the negative of the F-measure).

function res = optimize(input)

thres = input;

load ('~/Downloads/essen.mat');

genrelabels = {
 'ALTDEU'
 'BALLADE'
 'BOEHME'
 'DVA'
 'ERK1'
 'ERK2'
 'ERK3'
 'ERK5'
 'FINK'
 'KINDER'
 'VARIANT'
 'ZUCCAL'
 'CHINA'
 'ZHEN'
 'HUANG'
 'SENGANSU'
 'PELOG1'
 'MEXICO'
 'OTHER' };

emptyCells = cellfun(@isempty,phrase);
phrase(emptyCells) = [];
nm(emptyCells) = [];
genre(emptyCells) = [];

phrase(find(genre == 0)) = [];
nm(find(genre == 0)) = [];
genre(find(genre == 0)) = [];




% segment

for k = 1:numel(nm)

b{k} = boundary(nm{k});

b_thres{k} = find(b{k} > thres);

[p(k) r(k) f(k)] = perfstats(b_thres{k},phrase{k});

end

neg_mean_precision_across_songs = -mean(p(~isnan(p)));
neg_mean_recall_across_songs = -mean(r(~isnan(r)));
neg_mean_f_measure_across_songs = -mean(f(~isnan(f)));

res = neg_mean_f_measure_across_songs;

disp(sprintf('Threshold = %0.5g, F-measure = %0.5g', thres ,-res))

Please paste this code into a new file and save it as optimize.m.

Now we can use the following script to run the optimization. Note that the threshold to be used should be higher than 0 and lower than 1, so we need to use an optimization algorithm that allows us to set lower and upper bounds. This is possible to do with constrained optimization algorithms such as fminbnd. This function may fail to give us a global solution but we can run it many times to see if it always converges with the same solution.

tic
nVars = 1;
lb = .1; % threshold should not exceed .1
ub = 1; % threshold should not exceed 1

fun = @optimize;

[x fval]= fminbnd(fun,lb,ub);

time = toc

What seems to be the optimal threshold value for boundary?

It is possible to create a scatter plot to visualize what is the threshold value and its performance at each point of the optimization. For that, we need a plotting function.

Please paste this code into a new file and save it as OPTIMPLOTXFVAL.m.

function stop = optimplotxfval(x,optimValues,state,varargin)
% OPTIMPLOTXFVAL Plotting function based on the built-in Matlab function OPTIMPLOTFVAL. 
% Plot current point and negative of the value of the objective function at each iteration.

stop = false;
switch state
  case 'iter'
                plotscalar(x,optimValues.fval,optimValues.iteration);
end

function plotscalar(x,fval,iteration)
% PLOTSCALAR initializes or updates a plot of the function value
% at each iteration.

if iteration == 0
    plotfval = scatter(x,-fval,'kd','MarkerFaceColor',[1 0 1]);
    title(getString(message('MATLAB:optimfun:funfun:optimplots:TitleCurrentFunctionValue',sprintf('%g',fval))),'interp','none');
    xlabel('Threshold');
    set(plotfval,'Tag','optimplotfval');
    ylabel('Performance');
else
    plotfval = findobj(get(gca,'Children'),'Tag','optimplotfval');
    newX = [get(plotfval,'Xdata') x];
    newY = [get(plotfval,'Ydata') -fval];
    set(plotfval,'Xdata',newX, 'Ydata',newY);
    set(get(gca,'Title'),'String',getString(message('MATLAB:optimfun:funfun:optimplots:TitleCurrentFunctionValue',sprintf('%g',-fval))));
end

Now we can add the plot function to our optimization script:

tic
nVars = 1;
lb = .1; % threshold should not exceed .1
ub = 1; % threshold should not exceed 1

fun = @optimize;

options = optimset('Display','iter','PlotFcns',@optimplotxfval);

[x fval]= fminbnd(fun,lb,ub,options);

time = toc

Which of these two segmentation algorithms in MIDI toolbox (segmentgestalt, boundary) seems to yield optimal results for the Essen collection?