libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] Microsoft Office mimetype (OLE2) is not recognized re


From: Christian Grothoff
Subject: Re: [libextractor] Microsoft Office mimetype (OLE2) is not recognized reliable
Date: Mon, 1 Sep 2008 02:23:44 -0600
User-agent: KMail/1.9.9

On Saturday 30 August 2008 03:44:15 pm Marc wrote:
> Hi,
>
> great work the libextractor, I like to learn with it and figure things out,
> starting to learn python.
>
> One problem I noticed:
>
> I try to distinguish file formats of the different Microsoft-Office
> formats using the mimetype information provided by libextractor (I have no
> filename extansions of the files to investigate). The problem is that often
> only a general information e.g. "application/vnd.ms-office" are extracted.
> The result depends on the specific application which has been used at last
> save of the document/spreadsheet/presentation.
>
> I found out that other programms have similar problems to do this job:
> - In the Linux-Distro Kubuntu Hardy that I use - e.g. XLS-files without
> filename extension appears as DOC in Konqueror
> - Windows XP can't do so either (in filemanager)
> - I also tried NLNZ Metadata Extractor v3.0 without success
> - The file command on the shell gives wrong application type too

Well, AFAIK the reason is that to a large extend the 
document/spreadsheed/presentation format is pretty much the same -- and they 
all DO have the same mime-type (so it is not incorrect for LE to sometimes
report the same mime-type).  Internally, LE has one mime-type (vnd.ms-files) 
which is used if we have no idea what the actual MS application is.  If LE is 
able to determine the "generator", then the MimeType is chosen to be more 
specific:
 
  if (NULL != generator) {
    const char * mimetype = "application/vnd.ms-files";

    if((0 == strncmp(generator, "Microsoft Word", 14)) ||
       (0 == strncmp(generator, "Microsoft Office Word", 21)))
      mimetype = "application/msword";
    else if((0 == strncmp(generator, "Microsoft Excel", 15)) ||
            (0 == strncmp(generator, "Microsoft Office Excel", 22)))
      mimetype = "application/vnd.ms-excel";
    else if((0 == strncmp(generator, "Microsoft PowerPoint", 20)) ||
            (0 == strncmp(generator, "Microsoft Office PowerPoint", 27)))
      mimetype = "application/vnd.ms-powerpoint";
    else if(0 == strncmp(generator, "Microsoft Project", 17))
      mimetype = "application/vnd.ms-project";
    else if(0 == strncmp(generator, "Microsoft Visio", 15))
      mimetype = "application/vnd.visio";
    else if(0 == strncmp(generator, "Microsoft Office", 16))
      mimetype = "application/vnd.ms-office";

    prev = addKeyword(prev, mimetype, EXTRACTOR_MIMETYPE);
  }

One thing you may look at is the "generator" you get for your vnd.ms-files.  
If it is a specific application that is missing from the above list, we could 
extend our list.

I'm not aware of any alternative / better way to determine the mimetype for MS 
Office applications.

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]