gnunet-svn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-SVN] r23456 - Extractor-docs/WWW


From: gnunet
Subject: [GNUnet-SVN] r23456 - Extractor-docs/WWW
Date: Mon, 27 Aug 2012 20:53:14 +0200

Author: grothoff
Date: 2012-08-27 20:53:14 +0200 (Mon, 27 Aug 2012)
New Revision: 23456

Modified:
   Extractor-docs/WWW/documentation.html
Log:
docu updates

Modified: Extractor-docs/WWW/documentation.html
===================================================================
--- Extractor-docs/WWW/documentation.html       2012-08-27 18:52:56 UTC (rev 
23455)
+++ Extractor-docs/WWW/documentation.html       2012-08-27 18:53:14 UTC (rev 
23456)
@@ -6,8 +6,8 @@
 <meta name="language" content="en">
 <meta name="description" content="Documentation for libextractor, a simple 
library for meta-data extraction.">
 <meta name="author" content="Vids Samanta and Christian Grothoff">
-<meta name="rights" content="(C) 2002,2003,2004,2005,2006,2007,2008,2009,2010 
by Vids Samanta and Christian Grothoff">
-<meta name="keywords" content="keyword, extraction, mp3, html, pdf, images, 
jpeg, gif, ps, mime, doc, xls, sxw, sdw, ogg, dvi, deb, zip, qt, asf, mpeg, 
elf, real, png, ppt, id3, id3v2">
+<meta name="rights" content="(C) 2002-2012 by Vids Samanta and Christian 
Grothoff">
+<meta name="keywords" content="keyword, extraction, mp3, html, images, jpeg, 
gif, ps, mime, doc, xls, sxw, sdw, ogg, dvi, deb, zip, qt, asf, mpeg, elf, 
real, png, ppt, id3, id3v2">
 <meta name="robots" content="index,follow">
 <meta name="revisit-after" content="28 days">
 <meta name="content-language" content="en">
@@ -34,20 +34,20 @@
 </td>
 
 <td valign="top"><h2>Further documentation</h2>
-This documentation covers the major aspects of libextractor in brief.
+This documentation covers the major aspects of GNU libextractor in brief.
 More details can be found in the GNU libextractor manual (<a 
href="extractor.html">html</a>, <a href="extractor.pdf">pdf</a>).
 The man pages for <a href="man/extract.html">extract</a> and <a 
href="man/libextractor.html">libextractor</a> are also on-line.
 <br>
-An article describing libextractor was published in the <a 
href="http://www.linuxjournal.com/";>LinuxJournal</a> and is available <a 
href="http://www.linuxjournal.com/article/7552";>here</a>.  That article 
describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x 
API.
+An article describing GNU libextractor was published in the <a 
href="http://www.linuxjournal.com/";>LinuxJournal</a> and is available <a 
href="http://www.linuxjournal.com/article/7552";>here</a>.  That article 
describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x 
API.
 
 <a name="copyright"></a>
 <h2>Copyright and Contributions</h2>
-libextractor is released under the GNU General Public License.
+GNU libextractor is released under the GNU General Public License.
 All contributions must thus be put under the <a 
href="http://www.gnu.org/copyleft/gpl.html";>GNU Public License (GPL)</a> or a 
compatible license.
 
 <h3>Mailing lists</h3>
 
-<p>libextractor has a mailing list for discussion of anything related to the 
project:
+<p>GNU libextractor has a mailing list for discussion of anything related to 
the project:
 <a href="mailto:address@hidden";>&lt;address@hidden&gt;</a>.</p>
 
 <p>To subscribe to this or any GNU mailing lists, please send an empty
@@ -63,7 +63,7 @@
 <h3>Getting involved</h3>
 
 <p>
-Development of libextractor, and GNU in general, is a volunteer
+Development of GNU libextractor, and GNU in general, is a volunteer
 effort, and you can contribute.  For information, please
 read <a href="/help/">How to help GNU</a>.  If you would like to get
 involved, it is a good idea to join the mailing list (see above).
@@ -78,9 +78,9 @@
 Our bugtracker is at <a 
href="https://gnunet.org/bugs/";>https://gnunet.org/bugs/</a>.
 </dd>
 
-<dt>Translating libextractor</dt>
+<dt>Translating GNU libextractor</dt>
 
-<dd>To translate libextractor's messages into other languages, please see the 
<a
+<dd>To translate GNU libextractor's messages into other languages, please see 
the <a
 href="http://translationproject.org/domain/libextractor.html";>Translation 
Project page for libextractor</a>.
 If you have a new translation of the message strings,
 or updates to the existing strings, please have the changes made in this
@@ -93,7 +93,7 @@
 <a name="install"></a>
 <h2>Installation</h2>
 <p>
-The simplest way to install libextractor is to use one of the binary
+The simplest way to install GNU libextractor is to use one of the binary
 packages which are available online for many distributions.  Note that
 under Debian, the extract tool is in a separate
 package <tt>extract</tt> and headers required to compile other
@@ -110,18 +110,16 @@
 $ make
 # make install
 </pre>
-Note that you need various dependencies (read <tt>README.debian</tt>
-for an up-to-date list for Debian systems) in order to compile all 
-of the plugins.
+Note that you need various dependencies (read <tt>README</tt>
+for an up-to-date list) in order to compile all of the plugins.
 </p>
 
 
-
 <a name="usage"></a>
 <h2>Using the extract tool</h2>
 
 <p>
-After installing libextractor, the extract tool can be used to obtain
+After installing GNU libextractor, the extract tool can be used to obtain
 meta data from documents.  By default, the extract tool uses the
 canonical set of plugins, which consists of all format-specific
 plugins supported by the current version of libextractor together with
@@ -129,22 +127,7 @@
 of <a 
href="http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html";>BibTeX</a>
 the option <tt>-b</tt> is likely to come in handy to automatically
 create bibtex entries from documents that have been properly equipped
-with meta-data:
-
-<pre>
-$ wget -q http://www.copyright.gov/legislation/dmca.pdf
-$ extract -b ~/dmca.pdf
-% BiBTeX file
address@hidden unite2001the_d,
-  title = "The Digital Millennium Copyright Act of 1998",
-  author = "United States Copyright Office - jmf",
-  note = "digital millennium copyright act circumvention...",
-  year = "2001",
-  month = "10",
-  key = "Copyright Office Summary of the DMCA",
-  pages = "18"
-}
-</pre>
+with meta-data (if available).
 </p>
 <p>
 Further options are described in the extract manpage 
(<tt>man&nbsp;1&nbsp;extract</tt>).
@@ -187,10 +170,10 @@
 keywords - The libextractor logo
 </pre>
 
-<h2>Using the libextractor library</h2>
+<h2>Using the GNU libextractor library</h2>
 <p>
 The following listing shows the code of a minimalistic program that
-uses libextractor.  Compiling the fragment requires passing the
+uses GNU libextractor.  Compiling the fragment requires passing the
 option <tt>-lextractor</tt> to gcc.  For details and additional
 functions for loading plugins and manipulating the keyword list, see
 the libextractor manpage (<tt>man&nbsp;3&nbsp;libextractor</tt>).
@@ -202,92 +185,51 @@
 <pre>
 #include <extractor.h>
 
-int main(int argc, char * argv[]) 
+int main (int argc, char * argv[]) 
 {
   struct EXTRACTOR_PluginList *plugins
     = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
   EXTRACTOR_extract (plugins, argv[1],
                      NULL, 0, 
-                     &EXTRACTOR_meta_data_print, stdout);
+                     &amp;EXTRACTOR_meta_data_print, stdout);
   EXTRACTOR_plugin_remove_all (plugins);
   return 0;
 }
 </pre>
 </p>
 
-<a name="plugins"></a>
-<h2>Current Plugins</h2>
-HTML, 
-PDF, 
-PS, 
-OLE2 (DOC, XLS, PPT),
-OpenOffice (sxw),
-StarOffice (sdw),
-DVI,
-MAN,
-FLAC, 
-MP3 (ID3v1 and ID3v2), 
-NSF(E) (NES music),
-SID (C64 music),
-OGG, 
-WAV,
-EXIV2,
-JPEG, 
-GIF,
-PNG, 
-TIFF,
-DEB,
-RPM, 
-TAR(.GZ),
-ZIP, 
-ELF,
-S3M (Scream Tracker 3),
-XM (eXtended Module),
-IT (Impulse Tracker),
-FLV,
-REAL,
-RIFF (AVI),
-MPEG,
-QT 
-and 
-ASF.
-
 <a name="newplugins"></a>
 <h2>Writing new Plugins</h2>
 <p>
-The most complicated thing when writing a new plugin for libextractor is the 
writing of the actual parser for a specific format.
+The most complicated thing when writing a new plugin for GNU libextractor is 
the writing of the actual parser for a specific format.
 Nevertheless, the basic pattern is always the same.
 The plugin library must be called <tt>libextractor_XXX.so</tt> where XXX 
denotes the file format supported by the plugin and
 must be placed in the plugin directory (typically 
<tt>$PREFIX/lib/libextractor/</tt>).
-The library must export a method <tt>EXTRACTOR_XXX_extract</tt> with the 
following signature:
+The library must export a method <tt>EXTRACTOR_XXX_extract_method</tt> with 
the following signature:
 <pre>
-int
-EXTRACTOR_XXX_extract (const char *data,
-                       size_t size,
-                       EXTRACTOR_MetaDataProcessor proc,
-                       void *proc_cls,
-                       const char* options);
+void
+EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);
 </pre>
 </p>
 <p>
-<tt>data</tt> is a pointer to the contents of the
-file, and <tt>size</tt> is the number of bytes available in <tt>data</tt>. Most
-plugins starting by verifying that <tt>size</tt> is sufficiently large and
+<tt>ec</tt> provides a callback to invoke with meta data as well as
+functions for reading data from the file that is being processed.
+Most plugins start by reading the first bytes of the file and checking that
 that the header of data matches the specific format.
-The <tt>extract</tt> function is expected to call <tt>proc</tt> with each
-meta data item found.  <tt>proc_cls</tt> must be passed as the first
-argument to <tt>proc</tt>, the other arguments correspond to the meta data 
found.
-Finally, <tt>options</tt> is an arbitrary string of options that the plugin is
-free to interpret. Most plugins ignore <tt>options</tt>.
+The <tt>extract</tt> function is expected to call <tt>ec-&gt;proc</tt> with 
each
+meta data item found.  <tt>ec-&gt;cls</tt> must be passed as the first
+argument to <tt>proc</tt> and other function invoked from within <tt>ec</tt>.
+Finally, <tt>ec-&gt;config</tt> is an arbitrary string of options that the 
plugin is
+free to interpret. Most plugins ignore <tt>config</tt>.
 </p>
 <p>
-If the meta data extracted is a string, it issupposed to be converted
+If the meta data extracted is a string, it is supposed to be converted
 into the UTF-8 character set by the plugin.  However, in cases where
 the character encoding used in the document is unknown, no conversion
 should be done.  Binary meta data can also be extracted.  Plugins
 indicate the format of the meta data using the <tt>format</tt>
 argument to <tt>proc</tt>.  Supported formats are UTF-8 strings, C
-Strings (for strings of unknown encoding) and binary data.  In
+strings (for strings of unknown encoding) and binary data.  In
 addition to this rough categorization, the plugin is also supposed to
 indicate the mime type of the meta data.  For strings, that mime type
 is most often <tt>text/plain</tt>.  Finally, the plugin must specify
@@ -305,10 +247,8 @@
                                            size_t data_len);
 </pre>
 <p>
-If &quot;proc&quot; returns non-zero, the plugin should abort and
-return non-zero itself.  The &quot;extract&quot; function should
-always return zero unless a call to &quot;proc&quot; returned
-non-zero, in which case the plugin must return 1.
+If &quot;proc&quot; returns non-zero, the plugin should abort 
+processing the current file and return.  
 </p>
 </td>
 </tr>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]