A schematic illustration of the model is depicted in Figure 1. The diagram is divided into three blocks, each representing a component of the model. The first block represents the auditory periphery, which is responsible for the initial processing of auditory stimuli. The second block illustrates the network of CD cells designed with excitatory inputs. The third block signifies the speech estimator, which integrates input from multiple tonotopic channels to estimate the probability of speech presence. Notably, this estimator can receive input from either CD cells or ANFs responses.
Figure 1 A schematic description of the computational model.
Cochlear model
The cochlear model utilized in this study employs a time-domain solution of cochlear mechanics. It calculates the basilar membrane motion as a response to an acoustic stimulus while integrating the electro-mechanical non-linear motion of the outer hair cells.11-13,15 Practically, the model was simulated with an adaptive time step and 256 cochlear partitions. The derivation of the ANFs’ IRs at each cochlear partition was obtained by phenomenological model.14,16
Coincidence cells architecture
Each neural input is represented by a set of spikes that occur at instances
. This series of spikes events can be described as a random point process with IR
, and refractory period
. A general excitatory-excitatory (EE) cell,
, has
independent excitatory inputs
with corresponding IRs
, and generates a spike when at least
of its inputs spike during an interval
. To maintain simplicity, it was assumed that
and denote it as
. Such a cell generates spikes at instances
,
1
where
denote the discrete firing times of the
excitatory inputs respectively.
According to Krips and Furst,10 CD cells exhibit NHHP behavior when their inputs are also NHPP point processes. As a result, their output can be computed analytically. The expression for the
cell’s IR was obtained using this approach:
2
Despite the diversity of the
cell’s inputs, it is reasonable to presume that the firing rates of the
neurons in response to a given stimulus would be similar on average, therefore:
3
where
denotes the input cell index.
The
cell’s output,
, may be described as follows:
4
where
represents the coincidence integral.
A discrete
cell’s output,
, can be obtained using a discrete approximation of the coincidence integral
. For a time domain discretized into
equal panels, each of size
. By applying the trapezoidal rule, an approximation for
can be obtained by:
5
where
is the discrete integration window length,
the discrete time index,
is the sample time, and
is the sample rate.
As a consequence, in the discrete-time domain, the coincidence integral can be computed by convolving
with the following finite impulse response (FIR) filter
:
6
Finally, the discrete
cell’s IR,
, was obtained by:
7
The corresponding CD cells’ IRs are generated from
vectors of ANFs’ IRs received.
Speech presence estimation
When an interfering noise coincides in frequency and time with a signal of interest, they both interfere on the basilar membrane, causing both the signal and the noise to compete for the same receptors. Let
be a IRs random vector distributed across
cochlear partitions, as a function of time. In the neural activity domain, according to the tonotopic organization of the auditory system, it can be assumed that the neural response is an additive mixture of clean speech
and acoustic noise
.
Two hypotheses
and
were suggested, and indicate speech absence and speech presence respectively,
8
The process of separating an auditory scene into distinct objects was modeled as an unbiased optimal estimator of the SPP, which is the probability of speech being present in a noisy observation. Motivated by the central limit theorem,17 the IR’s distribution,
, was assumed to be a superposition of multivariate Gaussians generated by two parent processes:
;
9
where, correspondingly,
denotes a multivariate normal distribution function,
denote the prior probability of
denote the Gaussian means, and
denote the Gaussians covariance matrices. Due to the statistical independence of ANFs across multiple characteristic frequencies, it was reasonable to hypothesize that any two different
components are not correlated. The off-diagonal correlations were set to zero, resulting in a diagonal covariance matrices
, therefore
yielded:
10
Where
and
denote the cochlear position index and the Gaussians variances, respectively.
The problem was addressed as an optimization problem, with the objective of estimating a set of parameters that best fit the joint probability of the hypotheses, and was solved using the expectation-maximization (EM) approach.18
Let
be the latent vector that determine the component from
originates, s.t.,
11
During the expectation step, the weights
were defined as a ’soft’ assignment of
to Gaussian
12
where
indicates the parameters set of the model (
).
A new parameter set
was estimated throughout the maximization step by maximizing the log-likelihood with respect to the expectations,
13
Given an initial estimate, the EM algorithm cycles through [12] and [13] repeatedly, until the estimates converge.
The entire algorithm for estimating the statistical properties of both the speech and the noise neural activities was illustrated in Algorithm 1.
Algorithm 1: Estimating the speech presence probability using the EM algorithm with multivariate
normal distribution and diagonal covariance matrix.
After estimating all the parameters, the SPP can be obtained by:
18
Evaluation method
An effective method for evaluating the ability of speech estimator to separate speech from noise is to examine the area under the receiver-operator characteristic curve (AUC), with a higher AUC indicating better performance. Threshold values in the range of
were applied to SPPs outputs to categorize them as speech presence or absent. For each threshold, the true positive rate and false positive ratio were determined by calculating the proportion of correctly identified speech-containing segments and incorrectly identified noise segments, respectively. The ground truth used for the evaluation was manually labeled by inferring which segments contain speech versus which segments contain noise.
For the evaluation, a total of thirty speech utterances were taken from the NOIZEUS database, a repository of noisy speech corpus.19 The sentences were degraded with three different types of real-world noise: car, white, and babble. This was done through the addition of interfering signals at signal-to-noise ratios (SNRs) ranging from -15 to 15 dB, using method B of the ITU-T P.56.20
Auditory periphary response
Figure 2 illustrates the relationship between the cochlear response and cochlear position at different frequencies, when a linear chirp stimulus is applied at a sound pressure level (SPL) of 65 dB. The derived ANFs IRs are displayed in a color-coded format, demonstrating how the response varies with changes in input frequency along the cochlear partition.
Figure 2 ANF IR derivation as a response to a linear chirp. The frequency (in kHz) is plotted along the x-axis, while the corresponding distance from the stapes (in cm) is represented on the y-axis and denoted by ’x’.
Example outcome
Figure 3 depicts an example of the model’s outputs as a response to the English phrase “We find joy in” at level of
dB SPL. The sentence was taken from track number
of NOIZEUS database.19
Figure 3 comprises panels that depict various variables or environmental conditions. The left and right columns of the figure denoted as Panels A and B, respectively, display the model’s inputs and outputs for noisy speech degraded by car noise at SNRs of 0 dB and 15 dB. Panels A1 and B1 show the acoustic waveforms, while Panels A2 and B2 present the ANFs’ IRs as a color-coded graph in spikes/sec, with the x-axis representing post stimulus time and the y-axis representing distance from the stapes. In Panels A3 and B3, the ANFs’ SPPs are displayed with gray backgrounds indicating binary flags for speech presence (1) or absence (0). Although the SPP for speech at 15 dB SNR speech matches the manually labeled speech presence, the SPP for speech at 0 dB SNR does not clearly indicate it, regardless of the speech’s presence. Panels A4 and B4 display the CD cells’ IRs, while Panels A5 and B5 show their SPPs. The results show that the SPPs computed after CD processing better follow speech patterns and match manual labels, even when the energy of background noise equals that of the speech signal.
Figure 3 The acoustic waveforms, ANFs’ IRs, CD cells’ IRs and their corresponding SPPs were exhibited in response to the English sentence “We find joy in” at level of 65 dB SPL. The sample was obtained from file ’sp07.wav’ of NOIZEUS database between 0s and 1.20s. Panels A1 and B1 respectively display the acoustic waveform for noisy speech stimuli degraded by car noise at SNRs of 0dB and 15dB. Panels A2 and B2 illustrate the ANFs’ responses. Panels A3 and B3 show the corresponding ANFs’ SPPs. Panels A4 and B4 display the response of the CD cells’ network (with parameters
and
). Panels A5 and B5 provide the corresponding SPPs of the CD cells’ response.
Coincidence detection cell parameters tuning
To determine the optimal architecture for the CD cell, we systematically varied the number of input cells (
) and the coincidence window
, as specified in Eq [5]. The results are presented in Figure 4. Based on these results, we selected
and
as the parameters to be used in the evaluation. These parameter values correspond to those of actual CD cells found in the inferior colliculus and the ventral cochlear nucleus.21–23
Figure 4 A color-coded graph of the AUC of speech degraded by car noise at a SNR of 0dB, with various combinations of input cells
and coincidence window lengths
. The speech was obtained from file ’sp09.wav’ of NOIZEUS database.
Speech presence estimators
Figure 5 presents a comparison between CD-based and ANF-based estimators. Figure 5A shows the noises power spectrum densities, while the average AUC scores of the 30 sentences with the corresponding standard deviations are plotted as a function of the SNR for three types of noise: babble noise (Figure 5B), white noise (Figure 5C), and car noise (Figure 5D).
Figure 5 A comparison between ANF-based and CD-based estimators (with parameters M = 6, Δc = 3ms) for a healthy cochlea. The power spectrum density and AUC scores for three different real-word noises, babble, white and car noises, at SNRs of −15 to 15 dB are shown in panels a, b, c, and d respectively.
Both ANF-based and CD-based estimators showed an increase in average AUC with increasing SNR. However, CD-based estimators outperformed ANF-based estimators for all tested SNRs and noise types, with the most significant improvement observed for mid-low input SNRs. The statistical difference in performances was compared with ANOVA and yielded significant difference for all types of noises and SNRs (
). For
, the performance yielded by the ANF were reasonable (
), thus only minor improvement was yielded by the CD processing. However, for
the ANF performances yielded
for all noise types, and the additional CD processing yielded
. On the other hand, for very low SNRs, for example
, and the ANF performances were close to chance (
), the improvement yielded by the CD processing was small.