PdfExtractor

Overview

PdfExtractor is a class in Aspose.PDF FOSS for Java. Inherits from: Closeable.

Facade for extracting text and images from PDF documents.

This class provides 41 methods for working with PdfExtractor objects in Java programs. Available methods include: PdfExtractor, bindPdf, close, extractAttachment, extractImage, extractText, getAttachNames, getAttachment, getEndPage, getExtractImageMode, getExtractTextMode, getImageCount, and 18 additional methods. All public members are accessible to any Java application after installing the Aspose.PDF FOSS for Java package. Properties: attachNames, bidi, endPage, extractImageMode, extractTextMode, imageCount, and 5 more.

Properties

Name	Type	Access	Description
`password`	`String`	Read	Returns the password applied when opening encrypted PDFs in subsequent
{@link #bindPdf(String)} or {@link #bindPdf(InputStream)} calls.
`startPage`	`int`	Read	Returns the start page.
`endPage`	`int`	Read	Returns the end page.
`extractTextMode`	`int`	Read	Returns the current text extraction mode.
`resolution`	`Resolution`	Read	Returns the configured extraction resolution.
`extractImageMode`	`ExtractImageMode`	Read	Returns the image extraction mode.
`textAsString`	`String`	Read	Returns the extracted text as a string.
`textSearchOptions`	`TextSearchOptions`	Read	Returns text search options associated with the extractor.
`bidi`	`boolean`	Read	Returns whether the current extraction contains bidi text.
`attachNames`	`List<String>`	Read	Returns the currently prepared attachment names.
`imageCount`	`int`	Read	Returns the number of extracted images.

Methods

Signature	Description
`PdfExtractor()`	Creates a new PdfExtractor instance.
`PdfExtractor(document: Document)`	Creates a new PdfExtractor bound to an existing document.
`PdfExtractor(stream: InputStream)`	Creates a new PdfExtractor bound to a PDF stream.
`bindPdf(inputFile: String)`	Binds a PDF file to this extractor.
`bindPdf(stream: InputStream)`	Binds a PDF from an input stream.
`getPassword()` → `String`	Returns the password applied when opening encrypted PDFs in subsequent
{@link #bindPdf(String)} or {@link #bindPdf(InputStream)} calls.
`setPassword(password: String)`	Sets the password used by subsequent {@code bindPdf} calls to open
encrypted PDFs.
`bindPdf(document: Document)`	Binds an existing Document to this extractor.
`setStartPage(page: int)`	Sets the start page for extraction (1-based).
`getStartPage()` → `int`	Returns the start page.
`setEndPage(page: int)`	Sets the end page for extraction (1-based).
`getEndPage()` → `int`	Returns the end page.
`setExtractTextMode(mode: int)`	Sets the text extraction mode.
`getExtractTextMode()` → `int`	Returns the current text extraction mode.
`setResolution(resolution: Resolution)`	Sets image extraction resolution for API parity.
`getResolution()` → `Resolution`	Returns the configured extraction resolution.
`setExtractImageMode(extractImageMode: ExtractImageMode)`	Sets the image extraction mode.
`getExtractImageMode()` → `ExtractImageMode`	Returns the image extraction mode.
`extractText()`	Extracts text from the page range.
`extractText(encoding: Charset)`	Extracts text from the page range using the requested output encoding.
`getText(outputPath: String)`	Writes extracted text to a file.
`getText(stream: OutputStream)`	Writes extracted text to an output stream.
`getTextAsString()` → `String`	Returns the extracted text as a string.
`getTextSearchOptions()` → `TextSearchOptions`	Returns text search options associated with the extractor.
`setTextSearchOptions(textSearchOptions: TextSearchOptions)`	Sets text search options used by subsequent extraction calls.
`hasNextPageText()` → `boolean`	Returns whether page-by-page extracted text remains available.
`getNextPageText(outputPath: String)`	Writes the next page text to a file.
`getNextPageText(stream: OutputStream)`	Writes the next page text to a stream.
`isBidi()` → `boolean`	Returns whether the current extraction contains bidi text.
`extractAttachment()`	Prepares attachment extraction for all embedded files.
`extractAttachment(name: String)`	Prepares extraction for a specific attachment key or file name.
`getAttachNames()` → `List<String>`	Returns the currently prepared attachment names.
`getAttachment(outputPath: String)`	Writes the prepared attachment(s) to the given file or directory.
`extractImage()`	Extracts images from the page range.
`hasNextImage()` → `boolean`	Returns whether there are more extracted images to retrieve.
`getNextImage(outputPath: String)`	Saves the next extracted image to a file.
`getNextImage(outputPath: String, format: ImageFormat)`	Saves the next extracted image to a file using the requested output format.
`getNextImage(stream: OutputStream)`	Saves the next extracted image to an output stream.
`getNextImage(stream: OutputStream, format: ImageFormat)`	Saves the next extracted image to a stream using the requested output format.
`getImageCount()` → `int`	Returns the number of extracted images.
`close()`	Closes this extractor and releases the bound document.

PdfExtractor

Overview

Properties

Methods

See Also