xhtml2odt - XHTML to ODT XML transformation

Copyright (C) 2009-2010 Aurelien Bompard

This script can convert a wiki page to the OpenDocument Text (ODT) format, standardized as ISO/IEC 26300:2006, and the native format of office suites such as OpenOffice.org, KOffice, and others.

It uses a template ODT file which will be filled with the converted content of the XHTML page.

Website: http://xhtml2odt.org

Inspired by the work on docbook2odt, by Roman Fordinal

Usage

Call the script with the --help option to see all the available options. The main options are:

-i <file>, --input <file>
The HTML file to read from.
-o <file>, --output <file>
The ODT file to export to (will be overwritten if already present).
-t <file>, --template <file>
The ODT file to use as a template (must be readable).
-v
Be verbose (enables logging)

The full help message is:

Usage: xhtml2odt.py [options] -i input -o output -t template.odt

Options:
  -h, --help            show this help message and exit
  -i FILE, --input=FILE
                        Read the html from this file
  -o FILE, --output=FILE
                        Location of the output ODT file
  -t FILE, --template=FILE
                        Location of the template ODT file
  -u URL, --url=URL     Use this URL for relative links
  -v, --verbose         Show what's going on
  --html-id=ID          Only export from the element with this ID
  --replace=KEYWORD     Keyword to replace in the ODT template (default is
                        ODT-INSERT)
  --cut-start=KEYWORD   Keyword to start cutting text from the ODT template
                        (default is ODT-CUT-START)
  --cut-stop=KEYWORD    Keyword to stop cutting text from the ODT template
                        (default is ODT-CUT-STOP)
  --top-header-level=LEVEL
                        Level of highest header in the HTML (default is 1)
  --img-default-width=WIDTH
                        Default image width (default is 8cm)
  --img-default-height=HEIGHT
                        Default image height (default is 6cm)
  --dpi=DPI             Screen resolution in Dots Per Inch (default is 96)
  --no-network          Do not download remote images
  --stylesdir=DIR       Override the style templates directory

License

GNU LGPL v2 or later: http://www.gnu.org/licenses/lgpl-2.0.html

This program is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.

Code

class xhtml2odt.HTMLFile(options)

This class contains the HTML document to convert to ODT. The HTML code will be run through Tidy to ensure that is is valid and well-formed XHTML.

Variable options:
 An OptionParser-result object containing the options for processing.
Variable html:The HTML code.
cleanup()
Run the HTML code from the html instance variable through Tidy.
read()
Read the HTML file from options.input, run it through Tidy, and filter using the selected ID (if applicable).
select_id()
Replace the HTML content by an element in the content. The element is selected by its HTML ID.
exception xhtml2odt.ODTExportError
Base exception for ODT conversion errors
class xhtml2odt.ODTFile(options)

Handles the conversion and production of an ODT file

add_styles()
Scans the ODT XML for used styles that would not be already included in the ODT template, and adds those missing styles.
compile()
Writes the in-memory ODT XML content and styles to the disk
download_img(src)

Downloads the given image to a temporary location.

Parameter:src (str) – the URL to download
handle_images(xhtml)

Handling of image tags in the XHTML. Local and remote images are handled differently: see the handle_local_img() and handle_remote_img() methods for details.

Parameter:xhtml (str) – the XHTML content to import
Returns:XHTML with normalized img tags
Return type:str
handle_img(full_tag, src, filename)

Imports an image into the ODT file.

Parameters:
  • full_tag (str) – the full img tag in the original XHTML document
  • src (str) – the src attribute of the img tag
  • filename (str) – the path to the image file on the local disk
Turn relative links into absolute links using the handle_links() method.
handle_local_img(img_mo)

Handling of local images. This method should be called as a callback on each img tag.

Find the real path of the image file and use the handle_img() method to flag it for inclusion in the ODT file.

This implementation downloads the files that come from the same domain as the XHTML document cames from, but server-based export plugins can just retrieve it from the local disk, using either the DOCUMENT_ROOT or any appropriate method (depending on the web application you’re writing an export plugin for).

Parameter:img_mo – the match object from the re.sub callback
Do the actual conversion of links from relative to absolute. This method is used as a callback by the handle_links() method.
handle_remote_img(img_mo)

Downloads remote images to a temporary file and flags them for inclusion using the handle_img() method.

Parameter:img_mo – the match object from the re.sub callback
import_xhtml(xhtml)

Main function to run the conversion process:

  • XHTML import
  • conversion to ODT XML
  • insertion into the ODT template
  • adding of the missing styles

The next logical step is to use the save() method.

Parameter:xhtml (str) – the XHTML content to import
insert_content(content)

Insert ODT XML content into the content.xml file, replacing the keywords if needed.

Parameter:content (str) – ODT XML content to insert
open()
Uncompress the template ODT file, and read the content.xml and styles.xml files into memory.
save(output=None)

General method to save the in-memory content to an ODT file on the disk.

If output is None, the document is returned.

Parameter:output (str or file-like object or None) – where the document should be saved, see the -o option.
Returns:if output is None: the ODT document ; or else None.
xhtml_to_odt(xhtml)

Converts the XHTML content into ODT.

Parameter:xhtml (str) – the XHTML content to import
Returns:the ODT XML from the conversion
Return type:str
xhtml2odt.get_options()
Parses the command-line options.
xhtml2odt.log(msg, verbose=False)
Simple method to log if we’re in verbose mode (with the -v option).
xhtml2odt.main()
Main function, called when the script is invoked on the command line.

Indices and tables

Table Of Contents

This Page