Akaike,, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.

Amemiya,, T. (1985). Advanced econometrics. Cambridge, MA: Harvard University Press.

Baudry,, J.‐P. (2015). Estimation and model selection for model‐based clustering with the conditional classification likelihood. Electronic Journal of Statistics, 9, 1041–1077.

Biernacki,, C., Celeux,, G., & Govaert,, G. (2000). Assessing a mixture model for clustering wit the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719–725.

Bishop,, C. M., & Svensen,, M. (2002). Bayesian hierachical mixture of experts. In *Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence*.

Bock,, A. S., & Fine,, I. (2014). Anatomical and functional plasticity in early blind individuals and the mixture of experts architechture. Frontiers in Human Neuroscience, 8, 971.

Bohning,, D., & Lindsay,, B. R. (1988). Monotonicity of quadratic‐approximation algorithms. Annals of the Institute of Mathematical Statistics, 40, 641–663.

Boos,, D. D., & Stefanski,, L. A. (2013). Essential statistical inference: Theory and methods. New York, NY: Springer.

Bradshaw,, N. P., Duchateau,, A., & Bersini,, H. (1997). Global least‐squares vs. EM training for the Gaussian mixture of experts. In *Proceedings of the International Conference on Artificial Neural Networks 1997*.

Camplani,, M., del Blanco,, C. R., Salgado,, L., Jaureguizar,, F., & Garcia,, N. (2014). Multi‐sensor background subtraction by fusing multiple region‐bassed probabilistic classifiers. Pattern Recognition Letters, 50, 23–33.

Carvalho,, A. X., & Skoulakis,, G. (2010). Time series mixtures of generalized t experts: ML estimation and an application to stock return density forcasting. Econometric Reviews, 29, 642–687.

Carvalho,, A. X., & Tanner,, M. A. (2005a). Mixtures‐of‐experts of autoregressive time series: Asymptotic normality and model specification. IEEE Transactions on Neural Networks, 16, 39–56.

Carvalho,, A. X., & Tanner,, M. A. (2005b). Modeling nonlinear time series with local mixtures of generalized linear models. The Canadian Journal of Statistics, 33, 97–113.

Carvalho,, A. X., & Tanner,, M. A. (2007). Modelling nonlinear count time series with local mixtures of Poisson autoregressions. Computational Statistics and Data Analysis, 51, 5266–5294.

Chamroukhi,, F. (2016). Robust mixture of experts modeling using the t distribution. Neural Networks, 79, 20–36.

Chamroukhi,, F. (2017). Robust mixture of experts modeling using the skew t distribution. Neurocomputing, 266, 390–408.

Chamroukhi,, F., Glotin,, H., & Same,, A. (2013). Model‐based functional mixture discriminant analysis with hidden process regression for curve classification. Neurocomputing, 112, 153–163.

Chamroukhi,, F., Same,, A., Govaert,, G., & Aknin,, P. (2009). Time series modeling by a regression approach based on a latent process. Neural Networks, 22, 593–602.

Chamroukhi,, F., Same,, A., Govaert,, G., & Aknin,, P. (2010). A hidden process regression model for functional data description. Application to curve discrimination. Neurocomputing, 73, 1210–1221.

Chen,, K., Xu,, L., & Chi,, H. (1999). Improved learning algorithms for mixture of experts in multiclass classification. Neural Networks, 12, 1229–1252.

Cohen,, S. X., & Le Pennec,, E. (2014). Unspuervised segmentation of spectral images with a spatialized Gaussian mixture model and model selection. Oil and Gas Science and Technology – Revue d`IFP Energies nouvelles, 69, 245–259.

Cotter,, N. E. (1990). The Stone‐Weierstrass theorem and its application to neural networks. IEEE Transactions on Neural Networks, 1, 290–295.

Cybenko,, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 303–314.

DasGupta,, A. (2008). Asymptotic theory of statistics and probability. New York, NY: Springer.

DasGupta,, A. (2011). Probability for statistics and machine learning. New York, NY: Springer.

Deleforge,, A., Forbes,, F., & Horaud,, R. (2015). High‐dimensional regression with Gaussian mixtures and partially‐latent response variables. Statistics and Computing, 25, 893–911.

Eavani,, H., Hsieh,, M. K., An,, Y., Erus,, G., Beason‐Held,, L., Resnick,, S., & Davatzikos,, C. (2016). Capturing heterogeneous group differences using mixutre‐of‐experts: Application to a study of aging. NeuroImage, 125, 498–514.

Emani,, M. K., & O`Boyle,, M. (2015). Celebrating diversity: A mixture of experts approach for runtime mapping in dynamic environments. ACM SIGGPLAN Notices, 50, 499–508.

Fan,, J., & Li,, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

Gan,, L., & Jiang,, J. (1999). A test for global maximum. Journal of the American Statistical Association, 94, 847–854.

Grun,, B., & Leisch,, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics and Data Analysis, 51, 5247–5252.

Grun,, B., & Leisch,, F. (2008). Flexmix version 2: Finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28, 1–35.

Hayashi,, F. (2000). Econometrics. Princeton, NJ: Princeton University Press.

He,, H., Boyd‐Graber,, J., Kwon,, K., & Daume, III, H. (2016). Opponent modeling in deep reinforcement learning. In *Proceedings of the 33rd International Conference on Machine Learning*.

Henderson,, H. V., & Searle,, S. R. (1979). Vec and vech operators for matrices with some uses in Jacobians and multivariate statistics. The Canadian Journal of Statistics, 7, 65–81.

Herzig,, J., Bickel,, A., Eitan,, A., & Intrator,, N. (2015). Monitoring cardiac stress using features extracted from *S*_{1} heart sounds. IEEE Transactions on Biomedical Engineering, 62, 1169–1178.

Huerta,, G., Jiang,, W., & Tanner,, M. A. (2003). Time series modeling via hierarchical mixtures. Statistica Sinica, 13, 1097–1118.

Ingrassia,, S., Minotti,, S. C., & Vittadini,, G. (2012). Local statistical modeling via a cluster‐weighted approach with elliptical distributions. Journal of Classification, 29, 363–401.

Jacobs,, R. A., Jordan,, M. I., Nowlan,, S. J., & Hinton,, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87.

Jennrich,, R. I. (1969). Asymptotic properties of non‐linear least squares estimators. Annals of Mathematical Statistics, 40, 633–643.

Jiang,, W., & Tanner,, M. A. (1999a). Hierachical mixtures‐of‐experts for exponential family regression models: Approximation and maximum likelihood estimation. Annals of Statistics, 27, 987–1011.

Jiang,, W., & Tanner,, M. A. (1999b). On the identifiability of mixture‐of‐experts. Neural Networks, 12, 1253–1258.

Jiang,, W., & Tanner,, M. A. (2000). On the asymptotic normality of hierachical mixtures‐of‐experts for generalized linear models. IEEE Transactions on Information Theory, 46, 1005–1013.

Jones,, P. N., & McLachlan,, G. J. (1992). Fitting finite mixture models in a regression context. Australian Journal of Statistics, 34, 233–240.

Jordan,, M. I., & Jacobs,, R. A. (1992). Hierachies of adaptive experts. In *Advances in Information Processing Systems* (pp. 985–992).

Jordan,, M. I., & Jacobs,, R. A. (1994). Hierachical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.

Kalliovirta,, L., Meitz,, M., & Saikkonen,, P. (2015). A Gaussian mixture autoregressive model for univariate time series. Journal of Time Series Analysis, 36, 247–266.

Kalliovirta,, L., Meitz,, M., & Saikkonen,, P. (2016). Gaussian mixture vector autoregression. Journal of Econometrics, 192, 485–498.

Karimu,, R. Y., & Azadi,, S. (2017). Diagnosing the ADHD using a mixture of expert fuzzy models. International Journal of Fuzzy Systems, 1–5.

Keribin,, C. (2000). Consistent estimation of the order of mixture models. Sankhya A, 62, 49–65.

Khalili,, A. (2010). New estimation and feature selection methods in mixture‐of‐experts models. The Canadian Journal of Statistics, 38, 519–539.

Khalili,, A., & Chen,, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102, 1025–1038.

Khalili,, A., & Lin,, S. (2013). Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69, 436–446.

Kullback,, S., & Leibler,, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86.

Lange,, K. (2016). MM optimization algorithms. Philadelphia, PA: SIAM.

Lee,, Y.‐S., & Cho,, S.‐B. (2014). Activity recognition with android phone using mixture‐of‐experts co‐trained with labeled and unlabeled data. Neurocomputing, 126, 106–115.

Liem,, R. P., Mader,, C. A., & Martins,, J. R. R. A. (2015). Surrogate models and mixtures of experts in aerodynamic performance prediction for aircraft mission analysis. Aerospace Science and Technology, 43, 126–151.

Lu,, Z. (2006). A regularized minimum cross‐entropy algorithm on mixture of experts for time series prediction and curve detection. Pattern Recognition Letters, 27, 947–955.

Masoudnia,, S., & Ebrahimpour,, R. (2014). Mixture of experts: A literature survey. Artificial Intelligence Review, 42, 275–293.

Massart,, P. (2007). Concentration inequalities and model selection. New York, NY: Springer.

McCullagh,, P., & Nelder,, J. A. (1989). Generalized linear models. Boca Raton, FL: CRC Press.

McLachlan,, G. J. (1988). On the choice of starting values for the EM algorithm in fitting mixture models. The Statistician, 37, 417–425.

McLachlan,, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York, NY: Wiley.

McLachlan,, G. J., & Peel,, D. (2000). Finite mixture models. New York, NY: Wiley.

Mendes,, E. F., & Jiang,, W. (2012). On convergence rates of mixture of polynomial experts. Neural Computation, 24, 3025–3051.

Meyer,, R. R. (1976). Sufficient conditions for the convergence of monotonic mathematical programming algorithms. Journal of Computer and System Sciences, 12, 108–121.

Montuelle,, L., & Le Pennec,, E. (2014). Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach. Electronic Journal of Statistics, 8, 1661–1695.

Nelder,, J. A., & Wedderburn,, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135, 370–384.

Ng,, S. K., & McLachlan,, G. J. (2007). Extension of mixture‐of‐experts networks for binary classification of hierachical data. Artificial Intelligence in Medicine, 41, 57–67.

Ng,, S. K., & McLachlan,, G. J. (2014). Mixture of random effects models for clustering multilevel growth trajectories. Computational Statistics and Data Analysis, 71, 43–51.

Ng,, S.‐K., & McLachlan,, G. J. (2004). Using the EM algorithm to train neural networks: Misconceptions and a new algorithm for multiclass classification. IEEE Transactions on Neural Networks, 15, 738–749.

Nguyen,, H. D. (2017). An introduction to MM algorithms for machine learning and statistical estimation. WIREs Data Mining and Knowledge Discovery, 7, e1198.

Nguyen,, H. D., Lloyd‐Jones,, L. R., & McLachlan,, G. J. (2016). A universal approximation theorem for mixture‐of‐experts models. Neural Computation, 28, 2585–2593.

Nguyen,, H. D., & McLachlan,, G. J. (2014). Asymptotic inference for hidden process regression models. In *Proceedings of the 2014 I.E. Statistical Signal Processing Workshop*.

Nguyen,, H. D., & McLachlan,, G. J. (2015). Maximum likelihood estimation of Gaussian mixture models without matrix operations. Advances in Data Analysis and Classification, 9, 371–394.

Nguyen,, H. D., & McLachlan,, G. J. (2016). Laplace mixture of linear experts. Computational Statistics and Data Analysis, 93, 177–191.

Norets,, A., & Pelenis,, J. (2014). Posterior consistency in conditional density estimation by covariate dependent mixtures. Econometric Theory, 30, 606–646.

Olteanu,, M., & Rynkiewicz,, J. (2011). Asymptotic properties of mixture‐of‐experts models. Neurocomputing, 74, 1444–1449.

Peralta,, B., & Soto,, A. (2014). Embedded local feature selection within mixture of experts. Information Sciences, 269, 176–187.

Perthame,, E., Forbes,, F., Olivier,, B., & Deleforge,, A. (2016). Non linear robust regression in high dimension. In *The XXVIIIth International Biometric Conference*.

Prado,, R., Molina,, F., & Huerta,, G. (2006). Multivariate time series modeling and classification via hierarchical VAR mixtures. Computational Statistics and Data Analysis, 51, 1445–1462.

R Core Team. (2016). *R: A language and environment for statistical computing*. R Foundation for Statistical Computing.

Razaviyayn,, M., Hong,, M., & Luo,, Z.‐Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal of Optimization, 23, 1126–1153.

Same,, A., Chamroukhi,, F., Govaert,, G., & Aknin,, P. (2011). Model‐based clustering and segmentation of time series with change in regime. Advances in Data Analysis and Classification, 5, 301–321.

Schwarz,, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461–464.

Shoenmakers,, S., Guclu,, U., van Gerven,, M., & Heskes,, T. (2015). Gaussian mixture models and semantic gating improve reconstructions from human brain activity. Frontiers in Computational Neuroscience, 8, 173.

Shohoudi,, A., Khalili,, A., Wolfson,, D. B., & Asgharian,, M. (2016). Simultaneous variable selection and de‐coursening in multi‐path change‐point models. Journal of Multivariate Analysis, 147, 202–217.

Stadler,, N., Buhlmann,, P., & van de Geer,, S. (2010). *l*_{1}‐penalization for mxture regression models. TEST, 19, 209–256.

Theis,, L., & Bethge,, M. (2015). Generative image modeling using spatial LSTMs. In *Advances in Neural Information Processing Systems*.

Tibshirani,, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.

van den Oord,, A., & Schrauwen,, B. (2014). Factoring variations in natural images with deep Gaussian mixture models. In *Advances in Neural Information Processing*.

Variani,, E., McDermott,, E., & Heigold,, G. (2015). A Gaussian mixture model layer jointly optimized with discrimative features within a deep neural network architecture. In *IEEE International Conference on Acoustics,* Speech and Signal Processing (ICASSP).

Wang,, D., & Li,, M. (2017). Stochastic configuration networks: Fundementals and algorithms. IEEE Transactions on Cybernetics, 47, 3466–3479.

White,, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.

Xu,, L., Jordan,, M. I., & Hinton,, G. E. (1995). An alternative model for mixtures of experts. In *Advances in Neural Information Processing Systems* (pp. 633–640).

Yuksel,, S. E., & Gader,, P. D. (2016). Context‐based classification via mixture of hidden Markov model experts with applications in landmine detection. IET Computer Vision, 10, 873–883.

Yuksel,, S. E., Wilson,, J. N., & Gader,, P. D. (2012). Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23, 1177–1193.

Zeevi,, A., Meir,, R., & Adler,, R. J. (1999). Non‐linear models for time series using mixtures of autoregressive models. *Preprint*.

Zeevi,, A. J., Meir,, R., & Maiorov,, V. (1998). Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory, 44, 1010–1025.

Zhou,, H., & Lange,, K. (2010). MM algorithms for some discrete multivariate distributions. Journal of Computational and Graphical Statistics, 19, 645–665.