wp-mirror-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wp-mirror-list] Fwd: [Xmldatadumps-l] Template expansion inconsistency


From: wp mirror
Subject: [Wp-mirror-list] Fwd: [Xmldatadumps-l] Template expansion inconsistency
Date: Sun, 23 Feb 2014 09:31:53 -0500

Dear Nemo,

Thanks for enlightening me regarding <title>.  I did not know that it
was intended to be a compound of namespace word and `page_title'
field.

Still, I have some thoughts on this matter.

1) importDump.php

As of WP-MIRROR 0.6, `importDump.php' is not longer used.

The disadvantage of `importDump.php' is that it is slow.  Importation
of `enwiki' takes about two months, which is greater than the interval
between XML dumps.

The advantage of `importDump.php' is that it handles any idiosyncrasy
(such as compound <title> entries) in the XML dumps.

2) mwxml2sql

As of WP-MIRROR 0.6,  `mwxml2sql' is used to convert the XML dump into
a set of SQL dumps (for the `page', `revision', `text' tables) which
can then be directly loaded into the underlying database tables.

The advantage of `mwxml2sql' is that it is very fast.  And, when used
in conjunction with MySQL 5.5 fast index creation, one can load
`enwiki' using 80% less time.

The disadvantage is that it faithfully copies the <title> field into
the SQL statement for INSERTing the `page_title' field.  We now know
that this results in pages from the Template and other namespaces
being not found by MediaWiki, which then renders them as red-links.

3) First Normal Form

One issue in the back of my mind concerns the recent changes in the
XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds
a separate namespace tag''.  To my mind, the presence of the <ns>
field should obviate the need to include a namespace word (e.g.
`Category:', `Template:', etc.) within the <title> field.

The principle is known as first normal form (1NF) which basically
means that the contents of a field should be atomic rather than
compound.

4) Solution

Granted that the objective is to faithfully mirror the WMF database
tables; the issue before us is this:  Where along the tool chain
should the patch be made.

a) My instinct is to correct the issue upstream (the XML dump generation phase).
    The WMF `page_namespace' field should be copied to the <ns> field.
    The WMF `page_title' field should be copied to the <title> field.
    Adhere to principles of database normalization.
b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML
dump prior to feeding it into `mwxml2sql'.  This I have done.
c) Third best, would be to patch `mwxml2sql'.  This I also favor, but
would like some guidance from its author, Ariel Glenn, before I start
hacking.
d) A last resort would be to write an SQL query to clean up compound
`page_title' entries in the mirror's database. But I really would
rather not load unnormalized data in the first place.

Sincerely Yours,
Kent

On 2/22/14, Federico Leva (Nemo) <address@hidden> wrote:
> wp mirror, 22/02/2014 23:40:
>> Still, it would be nice if the dump files could be fixed.
>
> Fixed? <title> is the full page name as it's supposed to be. Either
> you're doing something wrong with the import, or the import
> script/special page has a bug (not uncommon, but needs a bug report with
> steps to reproduce). I see nothing to blame on the export side.
>
> Nemo
>
> _______________________________________________
> Xmldatadumps-l mailing list
> address@hidden
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]