Usig this ope-source pipelie i productio? This pipelie is the same as It igests moo audio sampled at 16kHz ad outputs speaker diarizatio as a Pre-loadig audio files i memory may result i faster processig: Hooks are available to moitor the progress of the pipelie: I case the umber of speakers is kow i advace, oe ca use the Oe ca also provide lower ad/or upper bouds o the umber of speakers usig This pipelie has bee bechmarked o a large collectio of datasets. Processig is fully automatic: … with the least forgivig diarizatio error rate (DER) setup (amed "Full" i this paper):
Make the most of it thaks to our cosultig services.? Speaker diarizatio 3.1
pyaote/speaker-diarizatio-3.0
except it removes the problematic use of oxrutime
.
Both speaker segmetatio ad embeddig ow ru i pure PyTorch. This should ease deploymet ad possibly speed up iferece.
It requires pyaote.audio versio 3.1 or higher.Aotatio
istace:
Requiremets
pyaote.audio
3.1
with pip istall pyaote.audio
pyaote/segmetatio-3.0
user coditiospyaote/speaker-diarizatio-3.1
user coditioshf.co/settigs/tokes
.Usage
# istatiate the pipelie
from pyaote.audio import Pipelie
pipelie = Pipelie.from_pretraied(
"pyaote/speaker-diarizatio-3.1",
use_auth_toke="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
# ru the pipelie o a audio file
diarizatio = pipelie("audio.wav")
# dump the diarizatio output to disk usig RTTM format
with ope("audio.rttm", "w") as rttm:
diarizatio.write_rttm(rttm)
Processig o GPU
pyaote.audio
pipelies ru o CPU by default.
You ca sed them to GPU with the followig lies:import torch
pipelie.to(torch.device("cuda"))
Processig from memory
waveform, sample_rate = torchaudio.load("audio.wav")
diarizatio = pipelie({"waveform": waveform, "sample_rate": sample_rate})
Moitorig progress
from pyaote.audio.pipelies.utils.hook import ProgressHook
with ProgressHook() as hook:
diarizatio = pipelie("audio.wav", hook=hook)
Cotrollig the umber of speakers
um_speakers
optio:diarizatio = pipelie("audio.wav", um_speakers=2)
mi_speakers
ad max_speakers
optios:diarizatio = pipelie("audio.wav", mi_speakers=2, max_speakers=5)
Bechmark
Bechmark
DER%
FA%
Miss%
Cof%
Expected output
File-level evaluatio
AISHELL-4
12.2
3.8
4.4
4.0
RTTM
eval
AliMeetig (chael 1)
24.4
4.4
10.0
10.0
RTTM
eval
AMI (headset mix, olywords_)
18.8
3.6
9.5
5.7
RTTM
eval
AMI (array1, chael 1, olywords)_
22.4
3.8
11.2
7.5
RTTM
eval
AVA-AVD
50.0
10.8
15.7
23.4
RTTM
eval
DIHARD 3 (Full)
21.7
6.2
8.1
7.3
RTTM
eval
MSDWild
25.3
5.8
8.0
11.5
RTTM
eval
REPERE (phase 2)
7.8
1.8
2.6
3.5
RTTM
eval
VoxCoverse (v0.3)
11.3
4.1
3.4
3.8
RTTM
eval
Citatios
@iproceedigs{Plaquet23,
author={Alexis Plaquet ad Hervé Bredi},
title={{Powerset multi-class cross etropy loss for eural speaker diarizatio}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@iproceedigs{Bredi23,
author={Hervé Bredi},
title={{pyaote.audio 2.1 speaker diarizatio pipelie: priciple, bechmark, ad recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
点击空白处退出提示
评论