Low-Rank Sparse Representation with Pre-Learned Dictionaries and Side Information for Singing Voice Separation ()
1. Introduction
Separating singing voice from music recording is very useful in many applications, such as music information retrieval, singer identification and lyrics recognition and alignment [1] . Although the human auditory system can easily distinguish the vocal and instrumental of music recording, it is extremely difficult for computer systems. In this context, researchers are increasingly concerned with the mining of music information. Many algorithms have been proposed to separate singing voice from music recording.
Robust Principal Component Analysis (RPCA) is a matrix factorization algorithm for solving underlying low-rank and sparse matrices [2] . Suppose we are given a large data matrix M, and know that it may be decomposed as
, where A is a low-rank matrix and E is a sparse matrix. Based on RPCA, Huang et al. [3] have separated singing-voice from music accompaniment. They assumed that the repetitive music accompaniment lies in a low-rank subspace, while the singing voices can be regarded as sparse within songs. The main drawback to this approach is that it is completely unsupervised, just based on the particular properties of each individual components to guide the decomposition. After, Yu et al. [4] utilized any pre-learned information and pre-learned universal voice and music dictionaries from isolated singing voice and background music training data. They proposed Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) for singing voice separation. Chan et al. [5] proposed a modified RPCA algorithm. This work represented one of the first attempts to incorporate vocal activity information into the RPCA algorithm, then the vocal activity detection was widely studied [6] [7] . Chan et al. [8] proposed to separate singing voice by group-sparse representation with the idea of pitch annotations separation.
In this paper, we present a model named Low-rank, Sparse representation with pre-learned dictionaries and side information (LSRi) under the ADMM framework. First, we pre-learn voice and music dictionaries from isolated singing voice and background music training data, respectively. Then, we use a sparse spectrogram and a low-rank spectrogram to model the singing voice and the background music, respectively. Outside, a residual term is added to capture the components that are not well modeled by either the sparse or the low-rank term. Finally, we combine the reconstructed voice spectrogram from the vocal annotation. Evaluations on the iKala dataset [9] show its better performance than comparison methods.
The rest of this paper is organized as follows. The overview of the music analysis model is presented in Section 2. The description of theoretical knowledge and experimental results are presented in Section 3. Final Section concludes this work.
2. The Proposed Method
Before we come up with our method, let’s review the Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) method [4] ,
(1)
where X is the input spectrogram,
is a pre-learned dictionary of the music accompaniment,
is a pre-learned dictionary of the singing voice,
is the separated instrumentals,
is the separated voice. E denotes the residual part.
are two weighting parameters for balancing the different regularization terms in this model.
Compared with the unsupervised RPCA algorithm, the LSPD algorithm adds pre-learning dictionary information and improves the separation quality. To further improve the separation quality of singing voice and music accompaniment, we proposed Low-rank, Sparse Representation with pre-learned dictionaries and side Information (LSRi).
In our model, we considered more prior information i.e., the reconstructed voice spectrogram from the annotation. Model as follows,
(2)
Here all parameters in model 2 are in accordance with model 1, and
denotes the reconstructed voice spectrogram from the annotation.
denotes the Frobenius norm. In the following, we also use the ADMM algorithm [10] to solve the optimization problem, by introducing two auxiliary variables
and
as well as three equality constraints,
(3)
The unconstrained augmented Lagrangian
is given by
(4)
where
are the Lagrange multipliers. We then iteratively update the solutions for
and
.
1) Update
:
(5)
where
2) Update
:
(6)
setting
, we have
(7)
3) Update
:
(8)
that can be solve by the soft-threshold operator
(9)
since the spectrogram is non-negative
(10)
where 0 is an all zero matrix of the size as
.
4) Update
:
(11)
setting
, we have
(12)
5) Update E:
(13)
Similar to
,
(14)
Finally, we update the Lagrange multipliers as in [11] .
3. Experiment
3.1. Dataset
Our experiment was conducted on the iKala dataset [9] . The iKala dataset contains 252 30-second clips of Chinese popular songs in CD quality. In the following experiments, we randomly select 44 songs for training (i.e., learning the dictionaries D1 and D2), leaving 208 songs for testing the performance of separation. To reduce the computational cost and the memory footprint of the proposed algorithm, we down sample all the audio recordings from 44,100 to 22,050 Hz. Then, computed its STFT by sliding a Hamming window of 1411 samples with a 75% overlap to obtain the spectrogram.
3.2. Dictionary and E0
Our implementation of Online Dictionary Learning for Sparse Coding (ODL) [12] is based on the SPAMS toolbox. Given N signals (
), ODL learns a dictionary D by solving the following joint optimization problem,
(15)
where
denotes the Euclidean and λ is a regularization parameter. The input frames are extracted from the training set after short-time Fourier transform (STFT). Following [8] , we define the dictionary size to be 100 atoms.
To get the reconstructed voice spectrogram from the annotation (E0), we first transform the human-labeled vocal pitch contours into a time-frequency binary mask. The authors in [13] have proposed a harmonic mask similar to that of [14] , which only passes integral multiples of the vocal fundamental frequencies [15] [16] ,
(16)
Here
is the vocal fundamental frequency at time t, n is the order of the harmonic, and w is the width of the mask. Then we simply define the vocal annotations as
, where
denotes the Hadamard product.
3.3. Evaluation
1http://bass-db.gforge.inria.fr/.
Separation performance is measured by BSS EVAL toolbox version 3.01. We use source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-distortion ratio (SDR) provided in the commonly used BSS EVAL toolbox version 3.0. Denotes the singing voice
, the original clean singing voice v, the source-to-distortion ratio (SDR) [17] is computed as follows,
(17)
Normalized SDR (NSDR) is the improvement of SDR from the original mixture x to the separated singing voice
[18] [19] , and is commonly used to measure the separation performance for each mixture,
(18)
For overall performance evaluation, the global NSDR (GNSDR) is calculated as,
(19)
where N is the total number of the songs and wi is the length of the i-th song. Higher values of SIR, SAR, SDR, GSIR, GSAR, GSDR and GNSDR represent better quality of the separation.
3.4. Parameter Selection
During parameter selection, we use the indicator of global normalized source-to-distortion ratio (GNSDR) as the evaluation index. The higher the value is, the better the separation quality is. In our algorithms, we set
for each
similar to [9] , Here we only adjust γ.
Figure 1 presents the GNSDR for the separated singing voice and background music, using LSPDi. In the vocal part, we can see that, the GNSDR monotonically increases with γ first and then gradually decreases. When
, the LSRi achieves the overall highest GNSDR. In the accompaniment part, the values of GNSDR increase first, steady after
. Therefore, we set the parameter
.
3.5. Comparison Results
We compare three different Low-rank, Sparse algorithms on the iKala dataset,
・ RPCA unsupervised method proposed by Huang et al. [3] , use default parameter values
.
・ LSPD Supervised method proposed by Yu et al. [4] , use default parameter values
.
・ LSRi Proposed LSRi method with Low-Rank representation and the reconstructed voice spectrogram from the annotation,
and
.
Figure 1. Separation performance measured by GNSDR for the singing voice (left) and background music (right), using our proposed method LSPDi.
Table 1. Separation quality for the singing voice and music for the iKala dataset of RPCA, LSPD and LSRi.
As shown in Table 1, whether the singing part or the accompaniment, our method has a higher value of global normalized source-to-distortion ratio (GNSDR), which suggests that LSRi algorithm performs well in the overall separation performance, and introduction of prior knowledge improve the separation performance. In the vocal part, our algorithm achieves higher GSIR than RPCA and LSPD, which shows that LSRi has better ability to remove the instrumental sounds than RPCA and LSPD. In the background music part, our algorithm achieves higher GSIR, which suggests that LSRi has better ability to remove the singing, a better performs in limiting artifacts during the separation process. But GSAR values did not improve significantly, this indicates that we need to improve on eliminating the interference of the algorithm.
4. Conclusion
In this paper, we have presented a time-frequency based source separation algorithm for music signals. LSRi considers both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, respectively. And the components that are not identified parts are specified as a residual term. Note that the dictionaries for the singing voice and background music are pre-learned from isolated singing voice and background music training data, respectively. Furthermore, LSRi incorporates vocal annotations information further, through which prior knowledge of the voice and background music is introduced to the source separation processing. Our approach has successfully exploited relevant useful information. Evaluations on the iKala dataset show the proposed methods better performance for both the separated singing voice and music accompaniment. In future studies, we can consider applying LSRi to the separation of complete songs.