[Bug-gnubg] Re: Training neural nets: How does size matter?

bug-gnubg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-gnubg] Re: Training neural nets: How does size matter?

From:	pepster
Subject:	[Bug-gnubg] Re: Training neural nets: How does size matter?
Date:	Mon, 02 Sep 2002 19:20:17 +1200

(Using web mail - no spelling checker - sorry for the numerous spellingerrors)Douglas Zare writes:

Quoting Øystein O Johansen <address@hidden>:

About the above statement: 1000K parameters? 250K parameters? This
sounds like a lot to me. The networks gnubg is using, we have 250
input nodes and 128 hidden nodes. That's 32640 weights. Is that
what you call parameters?

Basically. It does mean the weight files I'm using are too large to fit on afloppy disk.

A curious criteria for this day and age. If you tell me your whole programfits on one floppy I resign now!

There is not enough data to even hazard a guess.

- Did the two nets differ only in number of hidden nodes or also in thenumber of inputs?

- Where they trained on the same data set?
- Of what size?

- Which training method?My totally baseless hunch would be that the bigger net did not "mature" ordid not "mature gracfully". In other words, a big net can match a data setat many arbitrary points, without developing the "right concepts".But again, this assumes either the data set was too small or not enoughtraining, which I understand is unlikly because you have both the computingpower and inovations in training.

Many years ago, I spoke to Fredrik Dahl. He doesn't say much about
the JellyFish development, but the one thing he said was, that it
wasn't much point in having to many nodes -- the training process
will just be slower.
It is valuable on multiple levels to be able to evaluate the network rapidly.However, more weights does not necessarily mean slower evaluations.

I wonder what you mean by that? Does it mean you use a sparse net topology?Otherwise it seem to ne that more weights must result in a slower net.

I muchprefer the crispness of Jellyfish's rapid play to Snowie's sluggishness,particularly given that Snowie does not seem to have a big advantage in moneyplay. (I look forward to seeing more data on this.) I think FD mentioned thatJellyfish uses about 20K weights.Computing power has improved quite a bit since then, of course.
I think this is Joseph's experience as well. When he started to work
on the gnubg networks he actually removed some of the input nodes
that he believed didn't contribute to the training. I have also asked
him about adding specific input nodes, but after some training with
these input nodes, he concludes that the new input nodes doesn't
contribute or the weights connected to this input don't converge.
Do you mean training from scratch using the new inputs, or adding the new inputwith initially low weight to an existing trained network?

I have tried both in the past. My concern (which I assume is the same asyours) is that starting with an existing net will discriminate against thenew inputs. I now belive this is not a problem for me, given my trainingmethod. However, if I belive an input *should* contibute, I try will both.As I noted in the past, I think it is very hard to prove an input does notcontribute.

Check also the history of eval.c [ref. 2] and look for changes made
by Joseph.
> First, roughly what level of improvement do you expect with mature
> networks of different numbers of hidden nodes?
No idea! I have never seen a ppg vs. hidden node chart either. I think
Tesauro gradually increased the number of hidden nodes started at only
40 hidden nodes and increased this to 80 hidden nodes, and then used
160 hidden nodes in TD-gammon 3.1 [ref. 1].
I'm familiar with his descriptions in earlier articles. However, I don't knowwhat the corresponding improvements in playing strength are supposed to havebeen (the performance in short sessions is inconclusive, of course), andwhether he felt that the networks were fully trained.It would probably be worth training networks with only a few hidden nodes and afixed input set to see how well they perform. It wouldn't take much computingtime to train the networks, but I had hoped that you all had already done it.
> The quality of a neural net is hard to quantify abstractly, so one
> could pin it down to, say, correct absolute evaluations in
> non-contact positions for the racing net, or elo, or cubeless ppg
> against a decent standard.
Yes, this is one of the problems, yes!
This is more medicine than science. I think one should pick a few benchmarksand use them, and if they aren't enough, add more. Which benchmarks are you setup to use so far?

This is exactly what I am trying to establish for the crashed net, and if itworks for the contact net as well. There can't be one benchmark, but severalin successions. I know you disaprove of my method, but I am not convinced,and definitly not aware of anything better. Chances are I am wrong but Ihave to learn it from my own experience.

> I don't think Snowie 3's nets were mature, but if they
> and Snowie 4's nets are, then how much of an improvement should one
> expect to see if Snowie 4 has neural nets with twice as many hidden

> nodes?

Same answer as above: I have no idea! Maybe Joseph has an idea.

I don't believe the assumptions, but my guess is that the answer is more than0.02 ppg.

> Second, how many fewer nodes can you use for the same quality, if
> you release the net from predicting what is covered in the racing

> database?

You don't train a network to evaluate something it is not supposed to
evaluate in the future, do you?


Of course. You do, too, from what you write below.

Again I am not sure what Doug is saying here. Of course the race net is nottrained on bearoff positions, and is not burdened with such knowledge.However, it might be worth while to check if adding some "limiting" caseswould improve the transition gaps. (limiting - i.e. race positions arrivedby one play from a contact position)

I noticed a jump in the performance of the contact network after the
crashed position was separated, and the network was only trained on
"contact" position. It was like some brain capacity was released, and
this brain capacity was used to improve the game in the contact
positions.
> Third, Tesauro mentions that a neural network seems to learn a
> linear regression first. Are there other describable qualitative
> phases that one encounters? For example, does a neural network with
> 50 nodes first imitate the linear regression, then a typical mature
> 5 node network, then 10 node?
I have no idea what so ever!
It's probably worth taking some time to understand these smaller nets. From thetime of TD-Gammon onwards, backgammon programs have been better than almost allof the human players, restricting the critiques of their play too much.However, a network with intermediate play can be analyzed in a helpful fashionby any human expert.
> It might be wishful thinking, but if it is the case, it might be
> possible to retain most of the information by training a smaller
> network to imitate the larger network's evaluations. The smaller
> network might be faster to train, and then one could pass the
> information back.

(Again) it is not clear to me what "passing back the info" means. What Iintend to do is add as many possible inputs we come up with, train, and thentry to prune the non contributing ones, and then see which is the smallestnet that can model the data. In other words, start big and work backwards.

It's about this Joseph is doing in fibs2html and mgnu_zp and other
friends. He has a very small network with only 5 hidden nodes. This
network is not only faster to train, but of course also faster to
evaluate. As I understand it, this net is used to prune candidates for
the real network. Joseph says this a huge speed improvement.
That's what Jellyfish level 3 is (though not specifically 5 nodes), right?Though its play seems laughable to me now, playing primarily against JF level 3took me from the novice level (I learned that an opening 6-1 should not beplayed 13/6 in July or August of 1999) to 1800 on FIBS in a few months. I don'tthink that I would have learned as quickly from a slower program that playedmore accurately.
> Are there thresholds for the number of nodes necessary with one
> hidden layer before particular backgammon concepts begin to be
> understood?Again, in the Tesauro article [Ref. 1], he writes:
   "The largest network examined in the raw encoding experiments had
   40 hidden units, and its performance appeared to saturate after
   about 200,000 games. This network achieved a strong intermediate
   level of play approximately equal to Neurogammon."
That doesn't say that the raw encoding (which is already quite clever,including a lot of backgammon understanding) understood the same concepts thatNeurogammon understood. Further, I'm more interested in the performance ofnetworks that have more complicated inputs than the raw encoding. In myexperience intermediate level play is achievable in a few GHz-minutes.
> In chess, people say that with enough lookahead, strategy becomes
> tactics, but how many nodes do you need before the timing issues
> of a high anchor holding game are understood by static evaluations?
> How many for a deep anchor holding game?
Hard to say. But I must also say, I believe (and I might be wrong)
this does not necessarily depend on the size of the net that much,
but rather depend on the training of it.
I think it clearly does depend on the size of the net, though of course a hugenet might not play optimally for its size. By the size of the net, I don't justmean possible increases in size, but also possible decreases. So I mean that ifyou shrink the net too much, it won't be able to understand, e.g., what a safecontact bearoff structure looks like, or how to build 3 stacked points into aprime on 0-ply. Of course, asymptotically, perfect play is achievable withsufficiently many nodes.It may well be the case that gnu's network is large enough to play much betterthan it does, and that training is the key to improvement.
Now to my questions:
I'm going to skip most of these. Suffice it to say that I and others workingwith me have introduced some innovations to most of these, but I don't want todescribe them yet.
How do you evaluate races? Do you have different inputs for your
race network? Or do you use a race database? If you use a NN,
how did you train this net? TD is completely useless here?
I'll answer here. First, from one-sided databases we have constructed some look-
up tables that are used as inputs. For example, the lookup table can includethe exact chances of winning at DMP against a pure n-roll position for each n.Second, one can apply these lookup tables even in contact positions.
How do you benchmark your nets?
I use a variety of methods. One is through checking evaluations of referencepositions. Another is the level of disagreement between plies (see my "BotConfusion" column). I expect to include rollouts of positions of one-sidederrors and variance reduced play against opponents of fixed strength soon.
Douglas Zare

I would like to stress again my current view about the importance of thedata set used to train the net. I think it is rarly considered as a problem,but nowdays I tend to see it as crucial. It takes me longer to generate itthan to train the net.-Joseph

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-gnubg] Re: Training neural nets: How does size matter?, pepster <=

Prev by Date: Re: [Bug-gnubg] CVS 020901 - Compile under Windows not possible
Next by Date: [Bug-gnubg] Re: Training neural nets: How does size matter?
Previous by thread: [Bug-gnubg] CVS 020901 - Compile under Windows not possible
Next by thread: [Bug-gnubg] (no subject)
Index(es):
- Date
- Thread