Research Article Volume 12 Issue 4

Electrical Engineering, Tel Aviv University, Israel

**Correspondence:** Asaf Zorea, Electrical Engineering, Tel Aviv University, Israel

Received: July 27, 2023 | Published: August 8, 2023

**Citation: **Zorea A, Furst M. Contribution of coincidence detection to speech segregation in noisy environments. *Biom Biostat Int J*. 2023;12(4):114-119. DOI: 10.15406/bbij.2023.12.00394

This study introduces a biologically-inspired model designed to examine the role of coincidence detection cells in speech segregation tasks. The model consists of three stages: a time-domain cochlear model that generates instantaneous rates of auditory nerve fibers, coincidence detection cells that amplify neural activity synchronously with speech presence, and an optimal spectro-temporal speech presence estimator. A comparative analysis between speech estimation based on the firing rates of auditory nerve fibers and those of coincidence detection cells indicates that the neural representation of coincidence cells significantly reduces noise components, resulting in a more distinguishable representation of speech in noise. The proposed framework demonstrates the potential of brainstem nuclei processing in enhancing auditory skills. Moreover, this approach can be further tested in other sensory systems in general and within the auditory system in particular.

**Keywords:** coincidence detection, speech segregation, speech-in-noise, computational model, auditory pathway

In our daily lives, following a conversation often involves listening to speech accompanied by some background noise. The auditory system adeptly processes and discriminates complex acoustic information, allowing us to extract relevant speech cues from the surrounding sound. Previous studies have demonstrated that speech segregation, the process of separating speech from noise, significantly contributes to speech perception and comprehension.^{1,2}

Bregman^{3} ascribes auditory segregation to auditory scene analysis and outlines two stages involved in the segregation process: segmentation and grouping. During segmentation, the input is divided into segments. In the grouping stage, the segments that are estimated to originate from the same source are clustered together. Numerous studies have adopted the auditory scene analysis approach to achieve comprehensive speech segregation. A common technique involves employing a time-frequency (T-F) representation based on the speech spectrogram, utilizing a logarithmic scale of the frequency domain. Estimating the speech presence probability (SPP) relies on analyzing the statistical characteristics of both the speech and the background noise.^{4,5} Moreover, thresholding is often utilized to generate the ideal binary mask of the speech.^{6–8}

The cochlea decompose sounds into narrow-band signals with specific characteristic frequencies. Then, auditory information propagates via the auditory nerve through multiple auditory nuclei, including the cochlear nucleus and inferior colliculus. These centers extract and process complex acoustic features from the neural input. In the inferior colliculus, one of the common cell types is the coincidence detection (CD) cell.^{9} This neuron encode information by detecting the occurrence of temporally close but spatially distributed input signals. Krips and Furst^{10} have shown that if the inputs act as a non-homogeneous Poisson process (NHPP), then the CD output also behaves as NHHP. The extracted information is transmitted to the auditory cortex, which is further processed and integrated over time to contribute to the comprehension and perception of spoken language.

This study aims to investigate the potential involvement of CD neurons in speech segregation using biologically motivated computational modeling. The model presented in this study includes three key stages: In the first stage, an initial T-F representation is obtained by a cochlear model, which generates instantaneous rates (IRs) of auditory nerve fibers (ANFs).^{11–14} In the second stage, a network of CD cells is integrated to enhance the neural representation of the auditory input. Finally, an optimal speech presence estimator is employed, enabling us to assess the effectiveness of the CD processing. The structure of this paper is organized as follows. The material and methodology are presented in Section 2. The study results are presented in Section 3. Finally, the discussion and conclusions are summarized in Section 4 and Section 5.

A schematic illustration of the model is depicted in Figure 1. The diagram is divided into three blocks, each representing a component of the model. The first block represents the auditory periphery, which is responsible for the initial processing of auditory stimuli. The second block illustrates the network of CD cells designed with excitatory inputs. The third block signifies the speech estimator, which integrates input from multiple tonotopic channels to estimate the probability of speech presence. Notably, this estimator can receive input from either CD cells or ANFs responses.

**Cochlear model**

The cochlear model utilized in this study employs a time-domain solution of cochlear mechanics. It calculates the basilar membrane motion as a response to an acoustic stimulus while integrating the electro-mechanical non-linear motion of the outer hair cells.^{11-13,15} Practically, the model was simulated with an adaptive time step and 256 cochlear partitions. The derivation of the ANFs’ IRs at each cochlear partition was obtained by phenomenological model.^{14,16}

**Coincidence cells architecture**

Each neural input is represented by a set of spikes that occur at instances $\left\{{t}_{n},n\in \mathcal{N}\right\}$ . This series of spikes events can be described as a random point process with IR $\lambda \left(t\right)$ , and refractory period ${\tau}_{r}$ . A general excitatory-excitatory (EE) cell, $E{E}_{M}^{N}$ , has $N$ independent excitatory inputs $\Psi =\left\{{E}_{1},\mathrm{..},{E}_{N}\right\}$ with corresponding IRs ${\Psi}_{\lambda}=\left\{{\lambda}_{{E}_{1}},\mathrm{..},{\lambda}_{{E}_{N}}\right\}$ , and generates a spike when at least $M$ of its inputs spike during an interval ${\Delta}_{c}$ . To maintain simplicity, it was assumed that $M=N$ and denote it as $E{E}_{M}$ . Such a cell generates spikes at instances $\left\{{t}_{{n}_{f}},{n}_{f}\in \mathcal{N}\right\}$ ,

$\begin{array}{l}{t}_{{n}_{f}}=\text{max}\left\{{t}^{1}{}_{{n}_{f}},\mathrm{...},{t}^{M}{}_{{n}_{f}}\right\}\hfill \\ \text{ifmax}\left\{{t}^{1}{}_{{n}_{f}},\mathrm{...},{t}^{M}{}_{{n}_{f}}\right\}-\text{\hspace{0.17em}}\text{min}\left\{{t}^{1}{}_{{n}_{f}},\mathrm{...},{t}^{M}{}_{{n}_{f}}\right\}{\Delta}_{c}\hfill \end{array}\}$ 1

where $\left\{{t}^{1}{}_{{n}_{f}},\mathrm{...},{t}^{M}{}_{{n}_{f}}\right\}$ denote the discrete firing times of the $M$ excitatory inputs respectively.

According to Krips and Furst,^{10} CD cells exhibit NHHP behavior when their inputs are also NHPP point processes. As a result, their output can be computed analytically. The expression for the $E{E}_{M}$
cell’s IR was obtained using this approach:

${\lambda}_{E{E}_{M}}\left(t|{\Psi}_{\lambda}\right)={\displaystyle \sum}_{m=1}^{M}\left[{\lambda}_{{E}_{m}}\left(t\right)\cdot {\displaystyle \prod}_{\tilde{m}=1,\tilde{m}\ne m}^{M}\underset{t-{\Delta}_{c}}{\overset{t}{{\displaystyle \int}}}{\lambda}_{E}{}_{\tilde{m}}\left(t\right)\right]$ 2

Despite the diversity of the $E{E}_{M}$ cell’s inputs, it is reasonable to presume that the firing rates of the $M$ neurons in response to a given stimulus would be similar on average, therefore:

${\lambda}_{{E}_{m}}(t)\text{}\stackrel{\Delta}{\text{=}}{\lambda}_{E}(t),\forall m\in \left\{1,\mathrm{..},M\right\}$ 3

where $m$ denotes the input cell index.

The $E{E}_{M}$ cell’s output,${\lambda}_{E{E}_{M}}$ , may be described as follows:

${\lambda}_{E{E}_{M}}\left(t|{\Psi}_{\lambda}\right)=M\cdot {\lambda}_{E}\left(t\right)\cdot {\underset{{I}_{c}\left(t\right)}{\underbrace{\left(\underset{t-{\Delta}_{c}}{\overset{t}{{\displaystyle \int}}}{\lambda}_{E}\left(\tau \right)d\tau \right)}}}^{M-1}$ 4

where ${I}_{c}$ represents the coincidence integral.

A discrete $E{E}_{M}$ cell’s output, ${\lambda}_{E{E}_{M}}\left[n\right]$ , can be obtained using a discrete approximation of the coincidence integral ${I}_{c}$ . For a time domain discretized into ${N}_{c}$ equal panels, each of size ${\delta}_{s}$ . By applying the trapezoidal rule, an approximation for ${I}_{c}$ can be obtained by:

$\underset{t-{\Delta}_{c}}{\overset{t}{{\displaystyle \int}}}{\lambda}_{E}\left(\tau \right)d\tau \simeq \left(\begin{array}{l}\frac{1}{2}\cdot {\lambda}_{E}\left({\tau}_{1}\right)+{\lambda}_{E}\left({\tau}_{2}\right)+\mathrm{..}+\hfill \\ +{\lambda}_{E}\left({\tau}_{{N}_{c}-1}\right)+\frac{1}{2}\cdot {\lambda}_{E}\left({\tau}_{{N}_{c}}\right)\hfill \end{array}\right)\cdot {\delta}_{s}$ 5

where ${N}_{c}={\Delta}_{c}\cdot fs$ is the discrete integration window length, ${\tau}_{i}=t\cdot {f}_{s}+i$ the discrete time index, ${\delta}_{s}=\frac{1}{{f}_{s}}$ is the sample time, and $fs$ is the sample rate.

As a consequence, in the discrete-time domain, the coincidence integral can be computed by convolving $\lambda \left[n\right]$ with the following finite impulse response (FIR) filter ${h}_{fir}\left[n\right]$ :

$\begin{array}{l}{h}_{fir}\left[n\right]={\left[\frac{1}{2},1,\mathrm{..},1,\frac{1}{2}\right]}_{{N}_{c}}\cdot {\delta}_{s}\hfill \\ {I}_{c}\left[n\right]={\lambda}_{E}\left[n\right]\text{*}{h}_{fir}\left[n\right]\hfill \end{array}\}$ 6

Finally, the discrete $E{E}_{M}$ cell’s IR, ${\lambda}_{E{E}_{M}}\left[n\right]$ , was obtained by:

${\lambda}_{E{E}_{M}}\left[n|{\Psi}_{\lambda}\right]=M\cdot {\lambda}_{E}\left[n\right]\cdot {\left({\lambda}_{E}\left[n\right]\ast {h}_{fir}\left[n\right]\right)}^{M-1}$ 7

The corresponding CD cells’ IRs are generated from $K$ vectors of ANFs’ IRs received.

**Speech presence estimation**

When an interfering noise coincides in frequency and time with a signal of interest, they both interfere on the basilar membrane, causing both the signal and the noise to compete for the same receptors. Let ${\lambda}_{K}\left(n\right)$ be a IRs random vector distributed across $K$ cochlear partitions, as a function of time. In the neural activity domain, according to the tonotopic organization of the auditory system, it can be assumed that the neural response is an additive mixture of clean speech ${\lambda}^{Speech}\left(n\right)$ and acoustic noise ${\lambda}^{Noise}\left(n\right)$ .

Two hypotheses ${H}_{1}\left[n\right]$ and ${H}_{2}\left[n\right]$ were suggested, and indicate speech absence and speech presence respectively,

$\begin{array}{l}{H}_{1}\left[n\right]:Y\left(n\right)={\lambda}^{Noise}\left[n\right]\hfill \\ {H}_{2}\left[n\right]:Y\left(n\right)={\lambda}^{Speech}\left[n\right]+{\lambda}^{Noise}\left[n\right]\hfill \end{array}\}$ 8

The process of separating an auditory scene into distinct objects was modeled as an unbiased optimal estimator of the SPP, which is the probability of speech being present in a noisy observation. Motivated by the central limit theorem,^{17 }the IR’s distribution, $\lambda $
, was assumed to be a superposition of multivariate Gaussians generated by two parent processes:

$p\left(\lambda \right)={\Sigma}_{i=1}^{2}{\pi}_{i}\mathcal{N}\left(\lambda |{\mu}_{i},{\Sigma}_{i}\right)$ ; $s.t$ $\sum}_{i=1}^{2}{\pi}_{i}=1$ 9

where, correspondingly, $\mathcal{N}$
denotes a multivariate normal distribution function, ${\pi}_{1,2}$
denote the prior probability of $\lambda \in {H}_{1,2},$
${\mu}_{1,2}$
denote the Gaussian means, and ${\Sigma}_{1,2}$
denote the Gaussians covariance matrices. Due to the statistical independence of ANFs across multiple characteristic frequencies, it was reasonable to hypothesize that any two different $\lambda $
components are not correlated. The off-diagonal correlations were set to *zero*, resulting in a diagonal covariance matrices ${\Sigma}_{1,2}$
, therefore $\mathcal{N}\left(\lambda \right)$
yielded:

$\mathcal{N}\left(\lambda |\mu ,\Sigma \right)=\frac{1}{{\left(2\pi \right)}^{K/2}}{\displaystyle \prod}_{k=1}^{K}\frac{1}{{\sigma}_{k}}\text{exp}\left\{-\frac{1}{2}{\left(\frac{{\lambda}_{k}-{\mu}_{k}}{{\sigma}_{k}}\right)}^{2}\right\}$ 10

Where $k$ and ${\sigma}_{1,2}$ denote the cochlear position index and the Gaussians variances, respectively.

The problem was addressed as an optimization problem, with the objective of estimating a set of parameters that best fit the joint probability of the hypotheses, and was solved using the expectation-maximization (EM) approach.^{18}

Let $Z$ be the latent vector that determine the component from $\lambda $ originates, s.t.,

$P\left(\lambda |Z=z\right)\sim \mathcal{N}\left({\mu}_{z},{\Sigma}_{z}\right)$ 11

During the expectation step, the weights ${w}_{j}\left[n\right]$ were defined as a ’soft’ assignment of $\lambda \left[n\right]$ to Gaussian $j,$

${w}_{j}\left[n\right]=P\left(z=j|\lambda \left[n\right];\theta \right)$ 12

where $\theta $ indicates the parameters set of the model ($\theta =\left\{\mu ,\sigma ,\pi \right\}$ ).

A new parameter set $\theta $ was estimated throughout the maximization step by maximizing the log-likelihood with respect to the expectations,

$\begin{array}{l}\mathrm{arg}{\mathrm{max}}_{\theta}{\displaystyle \sum}_{n=1}^{N}{\displaystyle \sum}_{j=1}^{2}{w}_{j}\left[n\right]\text{log}\left({\pi}_{j}N\left(\lambda \left[n\right];{\mu}_{j},{\sigma}^{2}{}_{j}\right)\right)\hfill \end{array}\}$ 13

Given an initial estimate, the EM algorithm cycles through [12] and [13] repeatedly, until the estimates converge.

The entire algorithm for estimating the statistical properties of both the speech and the noise neural activities was illustrated in Algorithm 1*.*

$\begin{array}{l}Data:{\lambda}_{1,\dots ,N}\\ Result:{\text{N}}_{\text{j=1,2}}\text{(}\lambda |{\mu}_{j},{\Sigma}_{j})\\ \text{while}{\theta}_{t+1}\ne {\theta}_{t}do\\ |\begin{array}{l}EStep:foreachn,jdo\\ |{w}_{j}[n]=\frac{{\pi}_{j}\cdot \mathcal{N}(\lambda [n]|{\mu}_{j},{\sigma}_{j})}{{\displaystyle \sum _{j=1}^{2}\mathcal{N}(\lambda [n]|{\mu}_{j},{\sigma}_{j})}}(14)\\ end\\ MStep:foreachn,jdo\\ |\begin{array}{l}{\mu}_{j}=\frac{{\displaystyle \sum _{n=1}^{N}{w}_{j}[n]\cdot \lambda [n]}}{{\displaystyle \sum _{n=1}^{N}{w}_{j}[n]}}(15)\\ {\sigma}_{j}{}^{2}=\frac{{\displaystyle \sum _{n=1}^{N}{(\lambda [n]-{\mu}_{j})}^{2}}\cdot {w}_{j}[n]}{{\displaystyle \sum _{n=1}^{N}{w}_{j}[n]}}(16)\\ {\pi}_{j}=\frac{{\displaystyle \sum _{n=1}^{N}{w}_{j}[n]}}{N}(17)\end{array}\\ end\end{array}\\ end\end{array}$

Algorithm 1: Estimating the speech presence probability using the EM algorithm with multivariate

normal distribution and diagonal covariance matrix.

After estimating all the parameters, the SPP can be obtained by:

$SPP\left(\lambda |\mu ,\sigma \right)=\frac{{\pi}_{i}\mathcal{N}\left(\lambda |{\mu}_{i},{\sigma}_{i}\right)}{{{\displaystyle \sum}}_{j=1}^{2}{\pi}_{j}\mathcal{N}\left(\lambda |{\mu}_{j},{\sigma}_{j}\right)},i\in {H}_{2}$ 18

** Evaluation method**

An effective method for evaluating the ability of speech estimator to separate speech from noise is to examine the area under the receiver-operator characteristic curve (AUC), with a higher AUC indicating better performance. Threshold values in the range of $\left[0,1\right]$ were applied to SPPs outputs to categorize them as speech presence or absent. For each threshold, the true positive rate and false positive ratio were determined by calculating the proportion of correctly identified speech-containing segments and incorrectly identified noise segments, respectively. The ground truth used for the evaluation was manually labeled by inferring which segments contain speech versus which segments contain noise.

For the evaluation, a total of thirty speech utterances were taken from the NOIZEUS database, a repository of noisy speech corpus.^{19} The sentences were degraded with three different types of real-world noise: car, white, and babble. This was done through the addition of interfering signals at signal-to-noise ratios (SNRs) ranging from -15 to 15 dB, using method B of the *ITU-T P.56*.^{20}

**Auditory periphary response**

Figure 2 illustrates the relationship between the cochlear response and cochlear position at different frequencies, when a linear chirp stimulus is applied at a sound pressure level (SPL) of 65 dB. The derived ANFs IRs are displayed in a color-coded format, demonstrating how the response varies with changes in input frequency along the cochlear partition.

**Figure 2** ANF IR derivation as a response to a linear chirp. The frequency (in kHz) is plotted along the x-axis, while the corresponding distance from the stapes (in cm) is represented on the y-axis and denoted by ’x’.

**Example outcome**

Figure 3 depicts an example of the model’s outputs as a response to the English phrase “*We find joy in*” at level of $65$
* dB SPL*. The sentence was taken from track number $7$
of *NOIZEUS database*.^{19}

Figure 3 comprises panels that depict various variables or environmental conditions. The left and right columns of the figure denoted as Panels A and B, respectively, display the model’s inputs and outputs for noisy speech degraded by car noise at SNRs of 0 dB and 15 dB. Panels A1 and B1 show the acoustic waveforms, while Panels A2 and B2 present the ANFs’ IRs as a color-coded graph in spikes/sec, with the x-axis representing post stimulus time and the y-axis representing distance from the stapes. In Panels A3 and B3, the ANFs’ SPPs are displayed with gray backgrounds indicating binary flags for speech presence (1) or absence (0). Although the SPP for speech at 15 dB SNR speech matches the manually labeled speech presence, the SPP for speech at 0 dB SNR does not clearly indicate it, regardless of the speech’s presence. Panels A4 and B4 display the CD cells’ IRs, while Panels A5 and B5 show their SPPs. The results show that the SPPs computed after CD processing better follow speech patterns and match manual labels, even when the energy of background noise equals that of the speech signal.

**Figure 3** The acoustic waveforms, ANFs’ IRs, CD cells’ IRs and their corresponding SPPs were exhibited in response to the English sentence “*We find joy in*” at level of *65 dB SPL*. The sample was obtained from file ’sp07.wav’ of NOIZEUS database between 0*s *and 1*.*20*s*. Panels A1 and B1 respectively display the acoustic waveform for noisy speech stimuli degraded by car noise at SNRs of 0dB and 15dB. Panels A2 and B2 illustrate the ANFs’ responses. Panels A3 and B3 show the corresponding ANFs’ SPPs. Panels A4 and B4 display the response of the CD cells’ network (with parameters $M=\text{}6$
and $\Delta c=\text{}3ms$
). Panels A5 and B5 provide the corresponding SPPs of the CD cells’ response.

**Coincidence detection cell parameters tuning**

To determine the optimal architecture for the CD cell, we systematically varied the number of input cells ($M$
) and the coincidence window $({\Delta}_{c})$
, as specified in Eq [5]. The results are presented in Figure 4. Based on these results, we selected $M=6$
and ${\Delta}_{c}=3ms$
as the parameters to be used in the evaluation. These parameter values correspond to those of actual CD cells found in the inferior colliculus and the ventral cochlear nucleus.^{21–23}

**Figure 4** A color-coded graph of the AUC of speech degraded by car noise at a SNR of 0*dB*, with various combinations of input cells $(M)$
and coincidence window lengths $\left(\Delta c\right)$
. The speech was obtained from file ’sp09.wav’ of NOIZEUS database.

**Speech presence estimators**

Figure 5 presents a comparison between CD-based and ANF-based estimators. Figure 5A shows the noises power spectrum densities, while the average AUC scores of the 30 sentences with the corresponding standard deviations are plotted as a function of the SNR for three types of noise: babble noise (Figure 5B), white noise (Figure 5C), and car noise (Figure 5D).

**Figure 5** A comparison between ANF-based and CD-based estimators (with parameters *M *= 6*, *Δ*c *= 3*ms*) for a healthy cochlea. The power spectrum density and AUC scores for three different real-word noises, babble, white and car noises, at SNRs of *−*15 *to *15 *dB *are shown in panels *a*,* b*, *c*, and *d *respectively.

Both ANF-based and CD-based estimators showed an increase in average AUC with increasing SNR. However, CD-based estimators outperformed ANF-based estimators for all tested SNRs and noise types, with the most significant improvement observed for mid-low input SNRs. The statistical difference in performances was compared with ANOVA and yielded significant difference for all types of noises and SNRs ($P<.001$ ). For $SNR\ge 10dB$ , the performance yielded by the ANF were reasonable ($AUC\ge 0.9$ ), thus only minor improvement was yielded by the CD processing. However, for $SNR\approx 0dB$ the ANF performances yielded $AUC\approx 0.7$ for all noise types, and the additional CD processing yielded $AUC\approx 0.9$ . On the other hand, for very low SNRs, for example $SNR=-15dB$ , and the ANF performances were close to chance ($AUC\approx 0.5$ ), the improvement yielded by the CD processing was small.

In this paper, a speech segregation model based on the physiology of the auditory pathway is presented. The proposed excitatory-only coincidence detection (CD) architecture demonstrates its effectiveness in reducing noise components in stationary noise while concurrently improving the accuracy of speech segregation. These findings highlight the potential of CD cells to contribute significantly to enhancing speech perception. To ensure broad applicability and avoid over-fitting, the models and assumptions were simplified. Using an unsupervised optimal estimator further strengthens the study’s findings, as it provides unbiased insights into the neural representation of CD processing.

CD cells are widely distributed across various auditory nuclei, with a significant presence in the trapezoid body nuclei, where they play a significant role in binaural perception.^{24–26} Binaural processes have been demonstrated to enhance speech segregation,^{27,28} implying that CD cells may be involved in this aspect of auditory perception. However, speech segregation can also occur monaurally. In natural acoustic signals, amplitude modulation (AM) serves as a critical temporal feature, and its significance has been highlighted in various perceptual tasks, such as envelope detection and segregation.^{29} Notably, CD cells have been linked to AM processing.^{9,30} Furthermore, envelope and temporal fine structure information are known to be important for speech perception.^{31–33} The CD cells presented in this paper function as auto-correlation units, effectively enhancing this information, which is essential for speech segregation. These findings provide valuable insights into the neural mechanisms underlying auditory processing.

While the tonotopic representation used in the estimator was found to be effective, it is important to acknowledge its limitations. The assumption of independence between different characteristic frequencies may not always hold true. Although spike generation in different auditory nerve fibers (ANFs) is statistically independent, the tuning curves of ANFs have a long low-frequency tail, and the tips of the curves broaden and decrease at higher sound pressure levels (SPLs).^{34–36} Consequently, the synaptic drive to different ANFs across the cochlear length is not entirely independent. Future investigations should incorporate more sophisticated models that account for the interactions between frequency channels. Moreover, an alternative architecture incorporating inhibitory inputs may be more effective for other types of noises or conditions. Future work should also consider including inhibitory inputs and evaluating the model’s performance against different noise types.

Two distinct methods for speech estimation were compared: one based on coincidence detection and the other on auditory nerve fibers. CD-based estimators consistently outperformed ANF-based estimators across all tested SNRs and noise types. The improvement was most significant for mid-low input SNRs. These findings suggested that CD information plays a crucial role in speech segregation, contributing significantly to the enhanced performance of the model.

This research was partially supported by the ISRAEL SCIENCE FOUNDATION: grant No. 563/12.

The authors declare that there are no conflicts of interest.

None.

- Li N, Loizou PC. Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction.
*J Acoust Soc Am*2008;123(3):1673–1682. - Wang D, Kjems U, Pedersen MS, et al. Speech intelligibility in background noise with ideal binary time-frequency masking.
*J Acoust Soc Am*2009;125(4):2336–2347. - Bregman AS. Auditory scene analysis: The perceptual organization of sound; Auditory scene analysis: The perceptual organization of sound. The MIT Press; 1990:xiii,200773.
- Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement.
*IEEE Signal Processing Letters.*2002;9:12–15. - Paliwal K, Schwerin B, Wójcicki K. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator.
*Speech Communication.*2012;54:282–305. - May T, Dau T. Computational speech segregation based on an auditory-inspired modulation analysis.
*J Acoust Soc Am*. 2014;136(6):3350–3359. - Han K, Wang D. A classification based approach to speech segregation.
*J Acoust Soc Am.*2012;132(5):3475–3483. - Wang D. Speech separation by humans and machines. Divenyi P, editor. Boston: Springer US; 2005:181–197.
- Joris PX, Schreiner CE, Rees A. Neural processing of amplitude-modulated sounds.
*Physiological reviews*. 2004;84(2):541–577. - Krips R, Furst M. Stochastic properties of coincidence-detector neural cells.
*Neural Computation.*2009;21(9):2524–2553. - Cohen A, Furst M. Integration of outer hair cell activity in a one-dimensional cochlear model.
*J Acoust Soc*Am. 2004;115(5 Pt 1):2185–2192. - Barzelay O, Furst M. Cochlear model with integrated tectorial membrane and outer hair cells.
*AIP Conference Proceedings*. 2011;1403:79–84. - Sabo D, Barzelay O, Weiss S, et al. Fast evaluation of a time-domain non-linear cochlear model on GPUs.
*Journal of Computational Physics.*2014;265:97–112. - Furst M. Cochlear model for hearing loss.
*Intech Open*; 2015. - Faran M, Furst M. Inner-hair-cell induced hearing loss: a biophysical modeling perspective.
*J Acoust Soc Am.*2023;153:1776–1790. - Zilany MSA, Bruce IC, Nelson PC, et al. A phenomenological model of the synapse between the inner hair cell and auditory nerve: long-term adaptation with power-law dynamics.
*J Acoust Soc Am*. 2009;126:2390–2412. - Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*. 1984;32:1109–1121. - Moon T. The expectation-maximization algorithm.
*IEEE Signal Processing Magazine.*1996;13:47–60. - Hu Y, Loizou PC. Subjective comparison and evaluation of speech enhancement algorithms.
*Speech Communication*. 2007;49(7):588–601. - P.56. Objective measurement of active speech level. ITU. 2011.
- McGinley MJ, Oertel D. Rate thresholds determine the precision of temporal integration in principal cells of the ventral cochlear nucleus.
*Hearing Research.*2006;216-217:52–63. - Wenstrup JJ, Nataraj K, Sanchez JT. Mechanisms of spectral and temporal integration in the mustached bat inferior colliculus.
*Front Neural Circuits.*2012:6:75. - Chen Y, Zhang H, Wang H. The role of coincidence detector neurons in the reliability and precision of subthreshold signal detection in noise.
*PLoS ONE.*2013;8:e56822. - Yin T, Chan J. Interaural time sensitivity in medial superior olive of cat.
*J Neurophysiol*. 1990;64(2):465–488. - McAlpine D, Jiang, D, Shackleton TM, et al. Convergent input from brainstem coincidence detectors onto delay-sensitive neurons in the inferior colliculus.
*J Neurophysiol*. 1998;18(25):6026–6039. - Caspary DM, Ling L, Turner JG, et al. Superior olivary complex-functional neuropharmacology of the principal cell types.
*The Journal of Experimental Biology*. 2008;211:1781–1791. - Rennies J, Best V, Roverud E, et al. Energetic and informational components of speech-on-speech masking in binaural speech intelligibility and perceived listening effort.
*Trends Hear*. 2019;23: 2331216519854597. - Roman N, Srinivasan S, Wang D. Binaural segregation in multisource reverberant environments.
*J Acoust**Soc Am*. 2006;120(6):4040–4051. - Yost WA. Auditory perception.
*Fundamentals of hearing*. Brill; 2006:203–221. - Nelson PC, Carney LH. Neural rate and timing cues for detection and discrimination of amplitude-modulated tones in the awake rabbit inferior colliculus. J Neurophysiol. 2007;97(1):522–539.
- Ahissar E, Ahissar M.
*Processing of the temporal envelope of speech.*Routledge: The Auditory Cortex, USA. 2005:313–332. - Rosen, S. Temporal information in speech: acoustic, auditory and linguistic aspects.
*Philos Trans R Soc Lond B Biol Sci*. 1992;336(1278):367–373. - Shannon RV, Zeng FG, Kamath V, et al. Speech recognition with primarily temporal cues.
*Science*. 1995;270(5234):303–304. - Shera CA, Guinan JJ, Oxenham AJ. Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements.
*Biological Sciences*. 2002;99:3318–3323. - Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data.
*Hearing Research.*1990;47:103–138. - Brownell WE, Bader CR, Bertrand D, et al. Evoked mechanical responses of isolated cochlear outer hair cells.
*Science (New York*,*N.Y.)*1985;227:194–196.

©2023 Zorea, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.

2 7