The following is a list of journal and conference papers stemming from the work that was carried out to fulfill the project goals. These papers have either been published already or are at a submission/review stage. The page will be updated periodically to reflect new publications and submissions.
[1] (conference paper) Petros Giannakopoulos, Aggelos Pikrakis and Yannis Cotronis, “A Deep Reinforcement Learning Approach to Audio-based Navigation in a Multi-speaker Environment”, 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2021), 6-11 June 2021, Toronto, Ontario, Canada (link).
Abstract: In this work we use deep reinforcement learning to create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment, a problem that has received very little atten- tion in the reinforcement learning literature. Our experiments show that the agent can successfully identify a particular tar- get speaker among a set of N predefined speakers in a room and move itself towards that speaker, while avoiding collision with other speakers or going outside the room boundaries. The agent is shown to be robust to speaker pitch shifting and it can learn to navigate the environment, even when a limited number of training utterances are available for each speaker.
[2] (journal paper) Petros Giannakopoulos, Aggelos Pikrakis and Yiannis Cotronis, “Improving Post-Processing of Audio Event Detectors Using Reinforcement Learning,” in IEEE Access, vol. 10, pp. 84398-84404, 2022, doi: 10.1109/ACCESS.2022.3197907 (link).
Abstract: We apply post-processing to the class probability distribution outputs of audio event classifi- cation models and employ reinforcement learning to jointly discover the optimal parameters for various stages of a post-processing stack, such as the classification thresholds and the kernel sizes of median filtering algorithms used to smooth out model predictions. To achieve this we define a reinforcement learning environment where: 1) a state is the class probability distribution provided by the model for a given audio sample, 2) an action is the choice of a candidate optimal value for each parameter of the post-processing stack, 3) the reward is based on the classification accuracy metric we aim to optimize, which is the audio event-based macro F1-score in our case. We apply our post-processing to the class probability distribution outputs of two audio event classification models submitted to the DCASE Task4 2020 challenge. We find that by using reinforcement learning to discover the optimal per-class parameters for the post-processing stack that is applied to the outputs of audio event classification models, we can improve the audio event-based macro F1-score (the main metric used in the DCASE challenge to compare audio event classification accuracy) by 4-5% compared to using the same post-processing stack with manually tuned parameters.
[3] (technical report from the DCASE-2022 challenge) Petros Giannakopoulos and Aggelos Pikrakis, “Multi-Task Learning for Sound Event Detection Using Variational Autoencoders,” Technical Report, Detection and Classification of Acoustic Scenes and Events 2022 Challenge, June 2022 (link)
Abstract: This technical report presents a multi-task learning model based on recurrent variational autoencoders (VAEs). The proposed method employs recurrent VAEs with shared pa- rameters to simultaneously learn the tasks of strong labeling, weak labeling and feature sequence reconstruction. During the training stage, the model receives as input strongly la- beled, weakly labeled data and unlabeled data and it simulta- neously optimizes frame-based and file-based cross-entropy losses for strongly labeled and weakly labeled data, respec- tively, as well as the reconstruction loss for the unlabeled data. Using a shared posterior among all task branches, the model projects the input data for each task into a common la- tent space. The decoding of latents sampled from this com- mon latent space, in combination with the shared parame- ters among task branches act jointly as a regularizer that pre- vents the model from overfitting to the individual tasks. The proposed method is evaluated on the DCASE-2022 Task4 dataset on which it achieves an event-based macro F1 score of 32.5% on the validation set and 31.8% on the public eval- uation set.
[4] (journal paper, to submit, currently on ArXiv) Giannakopoulos P, Pikrakis A, Cotronis Y., “A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments”, arXiv preprint arXiv:2110.12778. 2021, Oct 25 (link).
Abstract: In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw sound from the environment, as a simulated human listener placed in the environment would hear it. For this purpose we create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem. We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments. Our experiments show that our agent achieves adequate performance and generalization ability in both environments, measured by quantitative metrics, even when a limited amount of training data are available or the environment parameters shift in ways not encountered during training. We also show that a degree of agent knowledge transfer is possible between the environments.
[5] (journal paper, to submit, restricted access) Petros Giannakopoulos and Aggelos Pikrakis, ” Multi-task Learning with Variational Autoencoders for Semi-supervised Sound Event Detection” (link).