Forum Replies Created
-
AuthorPosts
-
Great! Thank you. Can you fill in the last gap in this part of the process for me:
How are the filter coefficients crossfaded at that one-pitch-period crossfade area?
I’ve looked at the code you posted (its on github), and while it is well-commented, it’s beyond my capability to glean the answer to this question just from looking at it. Although I assume the process is in there somewhere…
Thanks. I’m still missing one piece of the puzzle. Is there not then a ‘process’ that we might call ‘running the target cost function’ over the initial list of candidates (per target position) that returns target costs and attaches them to each unit in the list? Or do you not think of that as a distinct process?
Lastly, can you say what pre-search pruning method Festival/Multisyn is using with its ‘ob_pruning’ function? Is it based on target cost?
Yes, this is the same formula I was using in the lab. However, the same issues apply. First, this calculation assumes that the sample size ‘n’ is made of unique individuals, each answering one A/B question. So, that assumption does not hold in our case. Second, this leaves out very important and interesting information about the data. For example, it could be that 100% of the users ALWAYS chose A for 75% of the questions, and ALWAYS chose B for 25% of the questions. The pooled data would be identical(and therefore the CI would be identical, since p, z, and n are the same as before) but clearly there’s an interesting story there that’s not being told. In my hypothetical, Voice A is ‘better’ in ‘most kinds of sentences’, but is clearly ‘worse’ in ‘certain kinds of sentences’. Depending on your specific domain/end use, this could be very important indeed. The question remains, how to report findings that are scientifically robust and defensible, don’t take up too much room in the report, and enlighten rather than confuse the reader. I’m inclined to calculate and report the CI, with caveats regarding assumptions, and pointing out any anomalies in the data such as the example I’ve given here. Does that sound reasonable?
there’s this:
(postlex_unilex_vowel_reduction utt)
“Perform vowel reduction based on unilex specification of what can be reduced.”but not clear how to use it.
Ok, I think i remember, its a ‘post-lexical rule’, which means…its not technically part of the lexical spec, which is why its not considered part of the target spec. ok, fine. according to the manual:
“Our vowel reduction model uses a CART decision tree to predict which syllables should be reduced”. Can I edit this??Or, is there another approach: go into the original utt file and change the ‘uu’ to an ‘@’? Better yet, find an utt that has uu-dh in it and change the uu to ‘@’?
Back to the ‘neither’ question for a minute: reading the Zen paper on DNN synthesis, they report their subjective listening tests findings, which include a ‘Neutral’ (neither) option. While they did show a seemingly statistically significant preference for the DNN over the HMM, they also show that a full 50% of listeners chose ‘Neutral’. With numbers that high in the ‘Neutral’ category, do you believe that in any way invalidates or at least weakens the conclusions they draw from the test?
Most references to eigenvoices mention that the concept was derived from ‘eigenfaces’, of which the wikipedia entry says:
“Informally, eigenfaces can be considered a set of “standardized face ingredients”, derived from statistical analysis of many pictures of faces. Any human face can be considered to be a combination of these standard faces. For example, one’s face might be composed of the average face plus 10% from eigenface 1, 55% from eigenface 2, and even -3% from eigenface 3. Remarkably, it does not take many eigenfaces combined together to achieve a fair approximation of most faces.”That corresponds very much to what you wrote above, regarding mixing the basis vectors with different weights. Can we go further with the analogy and say that these eigenvoices can be thought of as a set of ‘standardized voice ingredients’?
And as with eigenfaces, which we can look at and they look like blurry approximations, or maybe templates, of different kinds of actual faces, could we listen to eigenvoices, and would they sound like fuzzy approximations, or some kind of aural foundation, of different kinds of voices?From our testing, it appears that du_voice.set_pruning_beam goes from 0 (most pruning – so much that ‘no best candidates found’) to 1 (least pruning – higher values than 1 had no effect). Is that correct? Can you tell us what the default value is for this setting? (Is it the numbers you posted above?) Can we query Festival for this value? We tried unsuccessfully to find more documentation/explanation of these functions in the manual.
Can you explain what the ‘observation beam pruning’ is, and how is it different from the regular ‘beam pruning’?
So…I tried making the join costs ‘rather large’. I scaled by a factor of, 10, then 100, then…1000. This definitely caused festival to make some different unit choices, but not as we would have predicted. I am including the Unit relation output here (also including a screenshot attached). I chose a sentence from the Arctic script, which is in the database in its entirety. We would expect festival to chose diphones from only that utterance if the join costs for NOT doing so were sufficiently high. It did not behave this way. Please see festival output below. Please note very large join costs (in the hundreds). Can you shed some light on this?
id _88 ; name #_n ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 9 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 0.92 ; target_cost 0.0833333 ; join_cost 0 ; end 0.139313 ; num_frames 15 ;
id _89 ; name n_aa ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 7 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.024 ; target_cost 0 ; join_cost 0 ; end 0.269188 ; num_frames 18 ;
id _90 ; name aa_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 11 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.174 ; target_cost 0 ; join_cost 0 ; end 0.346563 ; num_frames 12 ;
id _91 ; name t_a ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.186 ; target_cost 0 ; join_cost 0 ; end 0.380876 ; num_frames 5 ;
id _92 ; name a_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 4 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.24 ; target_cost 0 ; join_cost 0 ; end 0.494875 ; num_frames 15 ;
id _93 ; name t_dh ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.338 ; target_cost 0 ; join_cost 0 ; end 0.520125 ; num_frames 2 ;
id _94 ; name dh_i ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.374 ; target_cost 0 ; join_cost 0 ; end 0.557625 ; num_frames 4 ;
id _95 ; name i_s ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 2 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.404 ; target_cost 0 ; join_cost 0 ; end 0.654688 ; num_frames 12 ;
id _96 ; name s_p ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 8 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 1.574 ; target_cost 0 ; join_cost 0 ; end 0.758375 ; num_frames 10 ;
id _97 ; name p_@r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt arctic_b0233 ; source_ph1 “[Val item]” ; source_end 2.948 ; target_cost 0.3125 ; join_cost 313.7 ; end 0.822875 ; num_frames 7 ;
id _98 ; name @r_r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 5 ; source_utt arctic_b0233 ; source_ph1 “[Val item]” ; source_end 3.056 ; target_cost 0.291667 ; join_cost 0 ; end 0.892875 ; num_frames 7 ;
id _99 ; name r_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 2 ; source_utt arctic_a0299 ; source_ph1 “[Val item]” ; source_end 3.378 ; target_cost 0.291667 ; join_cost 312.956 ; end 1.00556 ; num_frames 11 ;
id _100 ; name t_i ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 8 ; source_utt arctic_a0358 ; source_ph1 “[Val item]” ; source_end 4.28 ; target_cost 0.270833 ; join_cost 175.823 ; end 1.10281 ; num_frames 11 ;
id _101 ; name i_k ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0113 ; source_ph1 “[Val item]” ; source_end 1.882 ; target_cost 0.375 ; join_cost 206.539 ; end 1.16381 ; num_frames 9 ;
id _102 ; name k_y ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 6 ; source_utt arctic_a0059 ; source_ph1 “[Val item]” ; source_end 1.416 ; target_cost 0.0833333 ; join_cost 371.751 ; end 1.22725 ; num_frames 7 ;
id _103 ; name y_@ ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt arctic_a0059 ; source_ph1 “[Val item]” ; source_end 1.426 ; target_cost 0 ; join_cost 0 ; end 1.23375 ; num_frames 1 ;
id _104 ; name @_lw ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt arctic_a0059 ; source_ph1 “[Val item]” ; source_end 1.432 ; target_cost 0.0625 ; join_cost 0 ; end 1.27219 ; num_frames 6 ;
id _105 ; name lw_@r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 9 ; source_utt arctic_b0201 ; source_ph1 “[Val item]” ; source_end 3.26 ; target_cost 0 ; join_cost 210.443 ; end 1.33881 ; num_frames 10 ;
id _106 ; name @r_r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt arctic_b0201 ; source_ph1 “[Val item]” ; source_end 3.266 ; target_cost 0.0625 ; join_cost 0 ; end 1.38719 ; num_frames 7 ;
id _107 ; name r_k ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0197 ; source_ph1 “[Val item]” ; source_end 2.784 ; target_cost 0.1875 ; join_cost 574.409 ; end 1.48956 ; num_frames 10 ;
id _108 ; name k_ei ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 10 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.322 ; target_cost 0 ; join_cost 170.495 ; end 1.612 ; num_frames 14 ;
id _109 ; name ei_s ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 11 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.436 ; target_cost 0 ; join_cost 0 ; end 1.77081 ; num_frames 18 ;
id _110 ; name s_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 6 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.608 ; target_cost 0.0625 ; join_cost 0 ; end 1.86469 ; num_frames 7 ;
id _111 ; name #_B_150 ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt nina_x1_001 ; source_ph1 “[Val item]” ; source_end 1.39 ; target_cost 0.291667 ; join_cost 392.459 ; end 1.94481 ; num_frames 8 ;
id _112 ; name B_150_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 7 ; source_utt nina_x1_001 ; source_ph1 “[Val item]” ; source_end 1.54 ; target_cost 0.270833 ; join_cost 0 ; end 2.02488 ; num_frames 8 ;
id _113 ; name #_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 0 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.612 ; target_cost 0.0833333 ; join_cost 392.495 ; end 2.06181 ; num_frames 3 ;
id _114 ; name t_aa ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 8 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.714 ; target_cost 0.3125 ; join_cost 0 ; end 2.249 ; num_frames 20 ;
id _115 ; name aa_m ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 11 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 2.95 ; target_cost 0.3125 ; join_cost 0 ; end 2.41513 ; num_frames 16 ;
id _116 ; name m_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 5 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 3.038 ; target_cost 0.375 ; join_cost 0 ; end 2.52394 ; num_frames 12 ;
id _117 ; name #_B_150 ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt nina_x1_001 ; source_ph1 “[Val item]” ; source_end 1.39 ; target_cost 0.291667 ; join_cost 336.479 ; end 2.60406 ; num_frames 8 ;
id _118 ; name B_150_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 7 ; source_utt nina_x1_001 ; source_ph1 “[Val item]” ; source_end 1.54 ; target_cost 0.270833 ; join_cost 0 ; end 2.68413 ; num_frames 8 ;
id _119 ; name #_@ ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 8 ; source_utt arctic_a0187 ; source_ph1 “[Val item]” ; source_end 0.912 ; target_cost 0.145833 ; join_cost 336.796 ; end 2.80275 ; num_frames 13 ;
id _120 ; name @_p ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 2 ; source_utt arctic_a0113 ; source_ph1 “[Val item]” ; source_end 1.212 ; target_cost 0.25 ; join_cost 384.506 ; end 2.91031 ; num_frames 12 ;
id _121 ; name p_aa ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 11 ; source_utt arctic_b0057 ; source_ph1 “[Val item]” ; source_end 1.662 ; target_cost 0.375 ; join_cost 217.368 ; end 3.06812 ; num_frames 17 ;
id _122 ; name aa_lw ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 7 ; source_utt arctic_a0203 ; source_ph1 “[Val item]” ; source_end 1.41 ; target_cost 0.145833 ; join_cost 293.85 ; end 3.19712 ; num_frames 18 ;
id _123 ; name lw_@ ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 10 ; source_utt arctic_b0382 ; source_ph1 “[Val item]” ; source_end 3.458 ; target_cost 0.1875 ; join_cost 240.775 ; end 3.26556 ; num_frames 11 ;
id _124 ; name @_jh ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 4 ; source_utt arctic_a0427 ; source_ph1 “[Val item]” ; source_end 2.744 ; target_cost 0.1875 ; join_cost 590.098 ; end 3.40162 ; num_frames 14 ;
id _125 ; name jh_ai ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 6 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 3.702 ; target_cost 0 ; join_cost 265.882 ; end 3.50825 ; num_frames 10 ;
id _126 ; name ai_z ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 10 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 3.858 ; target_cost 0 ; join_cost 0 ; end 3.66731 ; num_frames 14 ;
id _127 ; name z_d ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 3.944 ; target_cost 0 ; join_cost 0 ; end 3.71181 ; num_frames 4 ;
id _128 ; name d_hw ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 5 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 3.996 ; target_cost 0 ; join_cost 0 ; end 3.82456 ; num_frames 11 ;
id _129 ; name hw_i ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 6 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.124 ; target_cost 0 ; join_cost 0 ; end 3.909 ; num_frames 8 ;
id _130 ; name i_t ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.17 ; target_cost 0 ; join_cost 0 ; end 3.98606 ; num_frames 9 ;
id _131 ; name t_m ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 1 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.242 ; target_cost 0 ; join_cost 0 ; end 4.0315 ; num_frames 4 ;
id _132 ; name m_or ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.294 ; target_cost 0 ; join_cost 0 ; end 4.1045 ; num_frames 7 ;
id _133 ; name or_r ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 3 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.386 ; target_cost 0 ; join_cost 0 ; end 4.22169 ; num_frames 10 ;
id _134 ; name r_# ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 8 ; source_utt arctic_a0002 ; source_ph1 “[Val item]” ; source_end 4.532 ; target_cost 0 ; join_cost 0 ; end 4.31669 ; num_frames 10 ;Attachments:
You must be logged in to view attached files.ahhh ok – so this takes us back to the idea that we discussed in the lab. Festival is currently mildly biased towards naturally-contiguous units, as they are always zero join cost. So we could, in theory, increase this bias by scaling up the actual MFCC vector values in the make_norm_join_cost_coefs script, which should have the effect of increasing the euclidian distance between all NON-naturally contiguous joins, which will raise their costs, but because they all go up by the same relative amount, their ‘cost relationship’ stays the same. Thus, no net effect on unit selection EXCEPT that zero-join-costs (naturally-contiguous diphones) are now more likely to ‘win’, as their zero-join-cost becomes more valuable in offsetting non-zero target costs. Am i getting that right?
So…to you last point, is Festival NOT considering phrase position as a target cost feature? You mention phrase position several times in the lecture slides as a potential linguistic specification, but it doesn’t shown up on the chart at slide 62 of ‘Festival’s Target cost components’.
Aha!!! Now that makes sense. Thank you. So maybe this does in fact give us some intuition: numbers larger than 1 are indicating one of the ‘major penalties’, such as bad duration or bad F0, as at least part of the cost incurred. This in turn might indicate that the database was so sparse for this particular diphone that the ‘best’ selection was a durational/F0 outlier (still waiting for your response as to what constitutes a ‘bad F0’ value – see other post) – a kind of ‘last resort’ choice, which is likely to sound bad (hence the extreme penalty value, to discourage these diphones from ever being selected). Does that line of reasoning make sense? As Pilar pointed out in her original post, it is very difficult to ‘reverse engineer’ the target costs we are seeing, to determine why a particular unit was chosen over the other options. Any suggestions for how to carry out this detective work?
Oops, I didn’t mean ‘less than zero’, I meant ‘less than 1’. Apologies. I meant, if a sub-cost is 1, then scaled by weights greater than 1, how do we end up with values between 0 and 1? In fact, the majority of target costs appear to be in the range of 0 to 1, however there are occasionally costs much higher, in the 10 to 50 range. How does this ‘weight scaling’ actually it work? Is there a decimal point involved in the math somewhere that moves values into the 0 – 1 range? But then how do we sometimes get these much-larger-than 1 values?
This is from the IBM/Watson current website, where they are offering Watson’s TTS and ASR capabilities as a cloud service:
https://text-to-speech-demo.mybluemix.net/
They have a demo that lets you experiment with ‘Expressive SSML’. For example, this paragraph:
<speak>I have been assigned to handle your order status request.<express-as type=”Apology”> I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience.</express-as><express-as type=”Uncertainty”> We don’t know when those items will become available. Maybe next week but we are not sure at this time.</express-as><express-as type=”GoodNews”>Because we want you to be a happy customer, management has decided to give you a 50% discount! </express-as></speak>
There is a lot of documentation about SSML on their site, regarding what kind of tags it does and doesn’t support, etc. But they don’t divulge, and therefore what my question is, is this: in the example above, where they are switching voice ‘type’ (and it does sound somewhat convincing), are they doing what I think they are doing, and what I have been experimenting with in the lab: are they switching to an entirely different voice that has been built from a different recorded database (same speaker, same script) recorded with these different ‘expressive’ qualities? Or, is this an example of a parametric or hybrid system, and these different voice ‘type’ expressive qualities are being imposed on the synthesis parametrically?
Also: earlier in this thread you implied that SSML was an older markup language, yet IBM still seems to be using it. Thoughts on why they’ve stuck with it? Is SSML open source? Do they perhaps have another proprietary XML that they keep for their own high-end products, like Watson when he appears on game shows, etc?
Following up on this question. Looking at Lecture 2, slides 62-64:
If sub-costs are either 0 or 1, then scaled by the weights, and the weights range from 4 – 25, how do we end up with some target costs that are less than zero? Is there any intuition for what constitutes as ‘high’ or ‘low’ cost, for either targets or joins?
-
AuthorPosts