I am currently performing a Seach Engine Optimization (SEO) of HTML
web-pages of my web-site (on Win XP Home SP3). In order to do that
it is important to know, which 3 words are used most frequently on
the page. So I wrote a cross referencer (in C) to find those. The
2nd step is find the 3 most frequently used word groups, consisting
of 2 words. The results of both should be combined.
Now I have several possibilities. It is easy to do this in C as
well. Alternatives are using flex, or the combination of flex and
bison.
To have Flex identify a word is easy:
[-0-9A-Za-z]+
So is the identification of 2 words:
[-0-9A-Za-z](' '|\t)[-0-9A-Za-z]
The easiest way to implement this is to write 2 programs, and
manually combine the result.
Now my question is: Can both be combined in 1 Flex, or Flex and
Bison program. Flex will try to satisfy the longest match, so it
will not find the single word. Does this imply that I should
introduce some functionality like a 'Moving Average Filter'? Are
there better solutions?