Re: [AH] Synthesis futures

From Andrew Horton
Sent Fri, Jun 22nd 2018, 13:46

I love that if you get slightly OT talking about relevant technical
stuff, the mods will come in and panic-squash the conversation. But
this inane bullshit can go on for weeks, apparently.
On Thu, Jun 21, 2018 at 9:54 PM <xxxxxxx@xxxxxxx.xxx> wrote:
>
> Some may remember the voices of Jane Barbe and Pat Fleet... the old fashi=
oned analog way ;):
>
> https://www.youtube.com/watch?v=3D0IHzWWMzqmI
>
> On June 21, 2018 at 10:43 AM Royce Lee <xxxxxxxxxx@xxxxx.xxx> wrote:
>
> The voice was nice, and much of that seems directly relevant to the kind =
of synthesis and sound quality that we like.
> Perhaps even more astounding were the snippets of concert piano music. I =
couldn't tell from the paper and website if the sounds were synthesized or =
merely the performance...but I believe the sounds were re-synthesized.
> I also thought that the fact that neural networks operated at the sample =
level was of interest to us...given our, or my, general feeling that most d=
igital synthesis has a samey, FM feel to it. Don't get me wrong, I love FM,=
 but I love FM for its limitations mostly.  Perhaps this approach would fin=
ally allow digital synthesis to break out of being a poor stepchild to anal=
ogue.
>
> On Thu, Jun 21, 2018 at 8:43 AM, John Emond <xxx.xxx@xxxxxx.xxx> wrote:
>>
>> At Bell Northern Research (BNR) we had as many people as possible recite=
 a script. This included the tri-corporate: BNR, Northern Telecom, and Bell=
 Canada. There was a phone number (of course) to call and recite. The resul=
t was voice dialing and voice menu navigation. As might be expected, recogn=
ition of the numeral 4 from Chinese people was problematic.
>>
>> Cheers,
>>
>> John
>>
>> Monde Synthesizer gives you More
>> www.mondesynthesizer.com
>>
>> On Jun 21, 2018, at 2:39 AM, annika morgan < xxxxxx.x.xxxxxx@xxxxx.xxx> =
wrote:
>>
>> I=E2=80=99m more familiar with machine learning on data patterns in a se=
curity engineering context where we train on known good and known bad data =
sets over various time intervals in order for our systems to increase the p=
robability of detecting and correlating anomalous outliers based on behavio=
ral analytics.
>>
>> In the case of =E2=80=9Ctraining time=E2=80=9D based on voice it=E2=80=
=99s that I=E2=80=99m curious how long it takes and how many voice samples =
it takes before they are able to create a representative voice model.
>>
>> In the case of some security tools that are on the market today that use=
 data science to detect security anomalies they require many millions of kn=
own good and known bad sample files to be input into a machine learning pro=
cess in order to build a reliable baseline. In this case I=E2=80=99m curiou=
s how many voice samples are required. Like, for instance could I input 1 m=
illion hours of already captioned YouTube videos into this thing and train =
on voice + text to get a reliable enough sample set to reproduce voice, or =
do I have to pay 10000 people to come in and read 100 pre-prepared scripts?
>>
>> I=E2=80=99m mostly curious what the level of effort their training exerc=
ise requires. I should have been more specific, apologies.
>>
>> On Wed, Jun 20, 2018 at 10:40 PM Mike Perkowitz < xxxx@xxxxxxxxx.xxx> wr=
ote:
>>>
>>>
>>> it's a machine learning algorithm, so "training" is when the algorithm =
examines examples of the thing it's going to model. so "at training time...=
" means that the algorithm is given recordings of human speakers, which it =
analyzes to produce a model that can spit out the kinds of sounds they demo=
nstrate.
>>>
>>> I think "training" in the context of a speech recognition tool like Dra=
gon Naturally refers to the time the user has to spend teaching the tool to=
 recognize their voice. totally different :)
>>>
>>>
>>> On Wed, Jun 20, 2018 at 9:19 PM, annika morgan <xxxxxx.x.xxxxxx@xxxxx.x=
om> wrote:
>>>>
>>>> =E2=80=9C At training time, the input sequences are real waveforms rec=
orded from human speakers. After training, we can sample the network to gen=
erate synthetic utterances.=E2=80=9D
>>>>
>>>> I=E2=80=99m curious what training time means exactly.
>>>>
>>>> On Wed, Jun 20, 2018 at 7:44 PM Royce Lee < xxxxxxxxxx@xxxxx.xxx> wrot=
e:
>>>>>
>>>>>
>>>>> This is probably old news to some of you, but it seems like this kind=
 of software would make for good music synthesis.
>>>>>
>>>>> https://deepmind.com/blog/wavenet-generative-model-raw-audio/
>>>
>>>
>
>
>