[ICASSP 2018] Google's Diarization System: Speaker Diarization with LSTM

Quan

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 вер 2024
Наука та технологія

КОМЕНТАРІ • 26

@generichuman_ Рік тому
14:47 This must be a podcast with Neil Degrasse Tyson
@vinitdhamale7795 5 років тому ⁺³
Quan impressive research and work you have done. For new comer like me it will be great if you can have 1 session where you show execution of code or steps to follow for coding. Thanks again for this all help. subscribed your channel for regular updates as well.
@QuanWang 5 років тому
Thanks for your interest. Unfortunately this work is not open sourced, and is using internal codebase, so we cannot show code execution examples :(
@vinitdhamale7795 5 років тому
@@QuanWang Hi quan i have one doubt the data which you are using github.com/google/uis-rnn/tree/master/data how you got in this format from wav format? can you guide me
@aram69420 7 місяців тому
Thank you for this. As an undergrad student trying to get into research. I find it really hard to read and understand research paper, thanks a lot for the video break down of your research!
@abdulmajeedmarek 2 роки тому ⁺¹
What a great work!
@QuanWang 2 роки тому
After years of preparation, I'm excited to share that my online course on Speaker Recognition now starts to accept enrollment on Udemy: www.udemy.com/course/speaker-recognition/?referralCode=1914766AF241CE15D19A
Also this Udemy online course on Speaker Diarization: www.udemy.com/course/diarization/?referralCode=21D7CC0AEABB7FE3680F
Please contact me if you need a coupon. Looking forward to seeing you in the lectures!
@arun_das_k Рік тому
Does it covers the coding part as well ? I find it difficult to recreate the approach you followed here
@lelluc 3 роки тому
Is there a google colab demo for this? if so, could someone post a link to the notebook?
@julianasanchez7752 3 роки тому
Hi, I am trying to find some speech recognition app, software that would tell me when in the video/audio note someone else spoke. This was the closest to what I am looking for but wondered, is this something available for the public to use? I am an MBA student and would like to utilize it for my studies and identify when students participated during a class lecture. Thanks in advance for the info
@tuckb8332 3 роки тому
Hi, Juliana. Any luck with finding a good diarization system? I'm also looking.
@eniasqurku2007 3 роки тому
Hi @Quan. Do you extract mfcc features from the audio first or do you pass the raw audio directly to the LSTM network to extract d-vectors. Maybe i missed something but that's not really clear to me. I would really appreciate your help. Thanks in advance
@QuanWang 3 роки тому
We use log mel filter bank energies, not MFCC.
@chandusr1 6 років тому
I'm trying to implement this work... I've one question, "How you guys calculated sigma ( std deviation) for applying gaussian blur ? " which is discussed here 16:59 Any help will be much appreciated.
@QuanWang 6 років тому
sigma and the threshold p are the two parameters of the spectral clustering algorithm that need to be tuned. We use a dev dataset that is separated from the training or testing set, and perform a grid search of the two parameters on the dev set. We observed that mostly we have best performance when sigma=1.0.
@chandusr1 6 років тому
Thank you ! For the help. Could you also please give me a rough value of threshold p that you guys have used for your Dev test ?
@QuanWang 6 років тому
Mostly between 0.9 and 1.0 for us. But it really depends on your speaker embeddings (from the speaker recognition network). Using a different embedding system and you need to retune both sigma and p.
@QuanWang 5 років тому
@@chandusr1 Hello sir, just FYI, I recently re-implemented the spectral clustering algorithm in Python and made it public here:github.com/wq2012/SpectralCluster You can use it as a reference.
@jamesgenius1673 Рік тому
greaaaaattt.
@douglasferreira3506 5 років тому
what VAD algorithm did you use? I am having serious trouble extracting it from my wav files. My goal is to separate a conversation between two people, generating wav files from each speaker. Many time my VAD over-lapses another speaker or shortens one person's speech
@QuanWang 5 років тому
It's described in detail in the paper. It's very simple, just two full variance Gaussians. If your VAD performs poorly, consider training it with better data that are more consistent with your application.
@douglasferreira3506 4 роки тому
@Arshdeep Singh yes, I had a very small dataset, but I didn't follow the video methodology. It was a traditional method.
1. Remove noise using VAD and then slicing the audio.
2. Performed MFCC on audio slices.
3. Normalized MFCC calculating the mean every 15 chunks of 20ms.
4. Performed K-means to cluster MFCCs
I do not have a github implementation because I left the company that asked me for that, but I can help you if you create a repository.
@kareemamr5626 4 роки тому
Hello Douglas, I'm trying to implement this paper and it would be great if I could ask you some questions. Is that at all possible?
@douglasferreira3506 4 роки тому
@@kareemamr5626 of course, IDK if youtube has a messaging system if it doesn't have I give you my email
@kareemamr5626 4 роки тому
@@douglasferreira3506 I don't think it does, do you have a reddit or twitter account we could exchange emails over there?

Наступне

Автоматичне відтворення

[ICASSP 2018] Google's D-Vector System: Generalized End-to-End Loss for Speaker Verification