› Forums › Speech Synthesis › F0 estimation and epoch detection › Gibberish: Bad pitch marking or do_alignment?
- This topic has 12 replies, 3 voices, and was last updated 10 months, 2 weeks ago by Simon.
-
AuthorPosts
-
-
February 29, 2024 at 18:05 #17561
Hi,
I built my domain voice from a europarl dataset following all the same steps as I did while building my own arctic voice. The arctic voice sounds comprehensible but my domain voice is entirely gibberish (attached .wav file for the utt “tobacco farmers”, which is directly from my script).
I’m trying to figure out what went wrong:
1) Each sentence in my script was quite long (~24 words) because a lot of sentences in the parliament corpus are like that. I did my best to record them without odd pauses in between.
2) I don’t think there are awkwardly long silences at the beginning of each recording; does this voice sound like an issue that could be fixed by endpointing?
3) When I ran this from section 10.1:
festival>(build_utts “utts.data” ‘unilex-rpx)I got quite a lot of “bad pitchmarking flags” for each utt. I think this happened when I ran my arctic A as well, but that voice was not gibberish in the end.
4) I looked at what units were used to synthesise “tobacco farmers” and they were correct:
-
festival> (set! myutt (SayText “tobacco farmers.”))
#<Utterance 0x7f0f3a441cb0>
festival> (utt.relation.print myutt ‘Unit)
()
id _29 ; name #_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 15 ; source_utt venkat_a022 ; source_ph1 “[Val item]” ; source_end 14.17 ; target_cost 0.145833 ; join_cost 0 ; end 0.169188 ; num_frames 29 ;
id _30 ; name t_@ ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 2 ; source_utt venkat_a023 ; source_ph1 “[Val item]” ; source_end 12.284 ; target_cost 0.479167 ; join_cost 0.162098 ; end 0.201813 ; num_frames 4 ;
id _31 ; name @_b ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt venkat_a037 ; source_ph1 “[Val item]” ; source_end 26.374 ; target_cost 0.145833 ; join_cost 0.170008 ; end 0.308874 ; num_frames 21 ;
id _32 ; name b_a ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt venkat_a001 ; source_ph1 “[Val item]” ; source_end 1.592 ; target_cost 0 ; join_cost 0.231527 ; end 0.430811 ; num_frames 25 ;
id _33 ; name a_k ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 26 ; source_utt venkat_a001 ; source_ph1 “[Val item]” ; source_end 1.824 ; target_cost 0 ; join_cost 0 ; end 0.560561 ; num_frames 29 ;
id _34 ; name k_ou ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a027 ; source_ph1 “[Val item]” ; source_end 18.694 ; target_cost 0.25 ; join_cost 0.692359 ; end 0.564624 ; num_frames 1 ;
id _35 ; name ou_f ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt venkat_a021 ; source_ph1 “[Val item]” ; source_end 3.866 ; target_cost 5.27083 ; join_cost 0.34787 ; end 0.641812 ; num_frames 16 ;
id _36 ; name f_ar ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 26 ; source_utt venkat_a046 ; source_ph1 “[Val item]” ; source_end 14.848 ; target_cost 10.3958 ; join_cost 0.189536 ; end 0.772812 ; num_frames 27 ;
id _37 ; name ar_r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a046 ; source_ph1 “[Val item]” ; source_end 14.854 ; target_cost 10.3125 ; join_cost 0 ; end 0.777688 ; num_frames 1 ;
id _38 ; name r_m ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a046 ; source_ph1 “[Val item]” ; source_end 14.86 ; target_cost 10.4792 ; join_cost 0 ; end 0.782688 ; num_frames 1 ;
id _39 ; name m_@r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a001 ; source_ph1 “[Val item]” ; source_end 2.058 ; target_cost 20.3125 ; join_cost 0.767362 ; end 0.794125 ; num_frames 1 ;
id _40 ; name @r_r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a045 ; source_ph1 “[Val item]” ; source_end 29.414 ; target_cost 10.25 ; join_cost 0.387135 ; end 1.06937 ; num_frames 24 ;
id _41 ; name r_z ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt venkat_a011 ; source_ph1 “[Val item]” ; source_end 32.212 ; target_cost 5 ; join_cost 0.213809 ; end 1.09006 ; num_frames 2 ;
id _42 ; name z_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt venkat_a011 ; source_ph1 “[Val item]” ; source_end 32.218 ; target_cost 0 ; join_cost 0 ; end 1.09787 ; num_frames 1 ;
nilI looked at this thread: https://speech.zone/forums/topic/bad-pitch-marking/
You wrote “but then an increase in the proportion of units with bad pitchmarks effectively reduces the inventory size, which should lead to worse quality” and “These warnings relate to units without any pitchmarks at all, and this then results in a penalty.”
The fact that it is picking up the right units but producing gibberish makes me think it’s more of an issue with do_alignment than too many “bad pitchmarking”s? Still, I’m confused as to why this domain voice is so radically different from my arctic A voice when I followed all the same steps as when I made my arctic A (I think I was more consistent with my recordings for the domain one as well, since there were fewer utts).
Sorry for the long post!
Attachments:
You must be logged in to view attached files. -
February 29, 2024 at 18:11 #17564
Have you inspected the alignment? Load one of your utterances and the corresponding labels into Wavesurfer or Praat and inspect them. Try that for a few different utterances.
-
March 7, 2024 at 16:12 #17578
It has to do with alignment. I looked at a few of my utts + their .lab files on wavesurfer and some labels for words are missing entirely. The time stamps for when I say “Documents” in one utt, for example, has no non-sp/non-sil labels (just ‘sil’; see attached image).
I spoke to two others who had similar issues and what solved it for them was redoing the whole exercise from scratch (from step 1: downloading ss.zip until the final ‘run the voice’ step). I just did that and am still having this issue – I checked my utts.data file to make sure it’s formatted the same way as arctic’s. I’ll try redoing it but before I do, I think it’d help to know why or which step in the ‘building the voice’ process before do_alignment is causing alignment to go badly.
What could cause some words to not be labelled at all in the do_alignment and the next break_mlf alignment/aligned.3.mlf lab steps? (for words like “documents”, which isn’t oov)
What does this line mean (while running do_alignment, because I see it a lot):
WARNING [-8232] ExpandWordNet: Pronunciation 1 of sp is ‘tee’ word in HViteThanks!
Attachments:
You must be logged in to view attached files. -
March 7, 2024 at 16:55 #17580
Looks like your forced alignment is very poor. You will find that all the words are there, but that the labels have become collapsed to the start and end of the file.
How much speech data are the models being trained on? If it is only a small amount, you could try adding the ARCTIC A utterances to your
utts.data
(just during forced alignment), so that the models are trained on more data and are more likely to work.-
March 7, 2024 at 17:12 #17582
Thanks! Just to clarify, you mean add the utterances from arctic’s utts.data into my domain’s utts.data and the corresponding arctic .wav files to my ‘wav’ folder’ as well, before running do_alignment, correct?
-
March 7, 2024 at 17:37 #17583
Yes, that’s correct. You can use different data to train the models for alignment, than you eventually include in the unit selection database. (But be careful to report this, if it affects any of your experiments.)
-
March 7, 2024 at 20:27 #17586
Thanks! Mixing my arctic recordings with my domain (parliamentary speech) worked to fix the gibberish output.
1) But why? My domain script may be fewer in sentences but because they were long, the no. of total words in my domain script was more or less the same as arctic A. How can the same amount of training data (recordings) in one case, Arctic A, produce a comprehensible voice but not in another case (my domain)?
If that’s the case, would simply doubling the size of my current domain script (so that would be twice as many words as Arctic A) help in making a voice with that domain script alone (i.e. w/o mixing my arctic A recordings as I’ve just done)? Ideally, I would not like to mix my domain + arctic A data.
I’m inclined to think it’s not the training data size that is causing misalignment here because if that is the case, why are others having misalignment issues on their Arctic A voices themselves…? Assuming they’re using the same Arctic A script-recordings w/593 lines that was used in the default voice (section 5 on the exercise) for this assignment (wrt Noe’s reply below). And that default voice seems to run without misalignment.
Thanks!
-
March 8, 2024 at 09:24 #17589
Figuring out why forced alignment fails, and then solving that, is part of the assignment.
The most common cause is too much mismatch between the labels and the speech. That might be as simple as excessively long leading/trailing silences (solution: endpoint), or something more tricky like the voice talent’s pronunciations being too different to those in the dictionary, or letter-to-sound pronunciations which are a poor match to how the voice talent pronounced certain words.
Sometimes, the easiest solution is to use additional data (e.g., your own ARCTIC A recordings) to train the models.
Remember that this is not the same as including all of that data in the unit selection database: you could use all your data to train the alignment models, but only use specific subsets in the unit selection database for the voice you are building.
-
-
-
March 7, 2024 at 16:55 #17581
For warning 8232, search the forums.
-
March 7, 2024 at 19:32 #17584
Hi Simon/Korin,
I also seem to be having trouble with pitch marking in do_alignment. For context, I’m building a voice with all arctic utterances. I get lines like this printed for every line during this step:
Bad pitchmarking: r 2.026.
Bad pitchmarking: e 2.076.
…This then seems to lead to problems later on with creating the lpc files…
I’ve also looked at some of the waveforms and have found that for every utterance there are instances of “sp” aligned to clear segments of speech (and verified by listening). Example attached. I’ve also redone the entire pipeline from scratch and the error reproduces, so I’m a bit stuck as to why this is happening.
Thanks,
-NoeAttachments:
You must be logged in to view attached files. -
March 8, 2024 at 09:12 #17588
There are two different things going on here:
1. a handful “bad pitch marking” warnings is acceptable, but not for every segment. See this post: https://speech.zone/forums/topic/bad-pitch-marking/#post-9237
2. most
sp
labels will have zero duration, and when you view them in Wavesurfer they will be drawn on top of a correct phone label, thus making it invisible. You need to manually delete all zero-durationsp
labels before loading the file in Wavesurfer, as described in the Find and fix a labelling error step. -
March 8, 2024 at 10:57 #17591
Thanks for pointing me in the right direction.
I looked at section 4.2.3 of Clark, Richmond, & King (2007) as suggested in the first thread, but am still not sure how to proceed. It does seem that my speaking rate might be faster and more prone to elided/deleted segments than average but is that really likely to affect the alignment of every utterance? I don’t see any other glaring issues with the automatic labeling other than a handful of cases like this, where multiple phone labels are squeezed together on to a portion of the waveform that looks more like a single articulation.
(Example attached with no visible /v/ in “of”)
-Noe
Attachments:
You must be logged in to view attached files. -
March 8, 2024 at 15:26 #17593
Connected speech effects, including elision, will of course make forced alignment harder because there is a greater degree of mismatch between the labels and the speech. In your example above, there probably is no good alignment of those labels because there is acoustically little or no [v] in the speech.
This is a fundamental challenge in speech, and not easily solved!
But, if your alignments generally look OK, then you can say that forced alignment has been successful and move on through the subsequent steps of building the voice.
-
-
AuthorPosts
- You must be logged in to reply to this topic.