Document

Overview

Document is the main entry point of Aspose.Words FOSS for Python. It provides the public API for loading Word documents (DOC, DOCX, RTF, TXT, Markdown), saving to multiple output formats (PDF, Markdown, TXT), and extracting plain text.

Defined in: aspose/words_foss/document.py

Constructor

SignatureDescription
Document(filepath=None, *, stream=None, data=None)Load a document from a file path, binary stream, or raw bytes. At least one source must be provided.

Parameters:

NameTypeDescription
filepathOptional[Union[str, Path]]Path to the document file to load.
streamOptional[BinaryIO]Binary stream containing document data (DOCX format only).
dataOptional[bytes]Raw bytes of the document content (DOCX format only).

Methods

SignatureDescription
save(output_path, save_format_or_options=None)NoneSave the document to PDF (SaveFormat.PDF), Markdown (SaveFormat.MARKDOWN), or plain text (SaveFormat.TEXT). Pass a SaveFormat constant for default settings, or a save-options object (PdfSaveOptions, MarkdownSaveOptions) for fine-grained control over output.
get_text()strExtract plain text from the loaded document.

Properties

NameTypeAccessDescription
light_document_modelldm.DocumentReadAccess the underlying light document model for advanced inspection of document structure (paragraphs, tables, styles, sections).

Usage

import aspose.words_foss as aw

# Load a DOCX file
doc = aw.Document("input.docx")

# Extract plain text
text = doc.get_text()

# Save as PDF
doc.save("output.pdf", aw.SaveFormat.PDF)

# Save as Markdown
doc.save("output.md", aw.SaveFormat.MARKDOWN)

Supported Formats

Input Formats

FormatLoadFormat ConstantDescription
DOCXLoadFormat.DOCXOffice Open XML Word document
DOCLoadFormat.DOCLegacy Microsoft Word binary format
RTFLoadFormat.RTFRich Text Format
TXTLoadFormat.TEXTPlain text
MarkdownLoadFormat.MARKDOWNCommonMark Markdown
AutoLoadFormat.AUTODetect format from file extension (default)

Output Formats

FormatSaveFormat ConstantSave Options
PDFSaveFormat.PDFPdfSaveOptions — configure compliance, image compression, font embedding
MarkdownSaveFormat.MARKDOWNMarkdownSaveOptions — configure table alignment, image export, list mode
Plain TextSaveFormat.TEXT
DOCXSaveFormat.DOCXRead-only constant. Document.save() does not support DOCX — raises ValueError.
DOCSaveFormat.DOCRead-only constant. Document.save() does not support DOC — raises ValueError.

Light Document Model

The light_document_model property provides access to the internal document structure (ldm.Document, defined in light_document_model.py). This pydantic BaseModel exposes parsed paragraphs, tables, styles, sections, headers, footers, and structural queries like find_style() and headings(). Most users do not need to access the LDM directly — the public Document methods above cover standard workflows.

See Also

  • SaveFormat — output format constants used with Document.save()
  • LoadFormat — input format constants for explicit format specification
  • PdfSaveOptions — fine-grained PDF export control (compliance level, compression, font embedding)
  • MarkdownSaveOptions — Markdown export options (table alignment, image handling, list mode)