freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] HTML support & conversion filters (cont.)


From: Henri Chorand
Subject: [Freecats-Dev] HTML support & conversion filters (cont.)
Date: Sun, 09 Feb 2003 14:42:45 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Thierry Sourbier wrote:

- another closely related issue is that JCAT only understands
well-formed HTML and XML. Due to this, it won't be able to work
>> on at least 80% of existing HTML files. This is why we prefer a
>> "dumb" approach.

1. It is easier to have a "dumb" parser read well formed HTML than a
> "smart" parser able to read "dumb" HTML.

Well, for me, in fact the question was, how is it easier to read malformed HTML?
;-)

Indeed for malformed HTML it is not only of tags being misplaced or
> missing, but also knowing what is a tag and what is not e.g. :
> "<b> This character "<" can mess up everything </b>").

I think we can:
- start from a comprehensive list of HTML tags as defined in:
http://www.w3.org/TR/html401/, version from 24 December 1999
- possibly add a number of widely used custom tags, like the "KADOV" stuff in RoboHelp's files, and known extensions of MS IE and Netscape
- enable the user to add a number of "custom" tags (<something>)

When converting a supposedly HTML file:
- for each of the above tags, make it follow our processing rule "recognize as tag" depending on its category:
    internal (fomatting), ex. <B>
    external (structure), ex. <P>
- consider anything else as text, replacing any "<" and ">" found in a source file by the corresponding HTML sequence, &lt; and &gt; - convert it as Unicode according to the character encoding specified or defaulting to Western

That way, we should be able to process most, if not all, files.
Remember that we don't want to alter the file's structure & text contents. We only want to enable the translator to edit text contents. For me, that way, the translator will still be able to manually process any non-recognized weird "tag", by leaving it untouched whenever it meets them.

So, with a "garbage" file, we only risk keeping "weird" "tags" in the segments, we won't try to understand it. My localization experience indicates that such custom "tags" is more often found in the document's structure (ex. "kadov" tag ih headers) than really mixed with text.

Internal tags are to be kept within TUs' source & target segments. The translator will freely decide whether to keep them (possibly moving them), to delete them or to add new ones (only if considered as internal ones) in the target segment. External tags are to be kept outside TUs, then restored as they were. That way, even for "very bad" formating, we don't risk producing a worst output.

2. Most malformed HTML files can be made compliant to the standard by
running them through Tidy. See http://tidy.sourceforge.net/. In a web
> l10n product I worked on before Tidy was part of the workflow.

Well, why not?


Anyway, I suggest we first begin writing a simple pair of conversion filters for ANSI text-only files (Notepad text files), in order to be able to test the server.

The next useful "exercice" could be to develop a number of still very simple conversion filters for some common resource files (including common online help formats), such as:
.CNT, .HPJ, .HHK, .HHC, .HHP, .RC
Note that nothing exists for most of these in proprietary CAT tools.


Thierry, what would you think about establishing a contact for us
>> with Yves?

I've already pointed him out to the project home page.

Thanks for this!


Regards,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]