PdfExtractor

Overview

PdfExtractor is a class in Aspose.Pdf FOSS for Java. Inherits from: Closeable.

Facade for extracting text and images from PDF documents.

Properties

NameTypeAccessDescription
passwordStringReadReturns the password applied when opening encrypted PDFs in subsequent
{@link #bindPdf(String)} or {@link #bindPdf(InputStream)} calls.
startPageintReadReturns the start page.
endPageintReadReturns the end page.
extractTextModeintReadReturns the current text extraction mode.
resolutionResolutionReadReturns the configured extraction resolution.
extractImageModeExtractImageModeReadReturns the image extraction mode.
textAsStringStringReadReturns the extracted text as a string.
textSearchOptionsTextSearchOptionsReadReturns text search options associated with the extractor.
bidibooleanReadReturns whether the current extraction contains bidi text.
attachNamesList<String>ReadReturns the currently prepared attachment names.
imageCountintReadReturns the number of extracted images.

Methods

SignatureDescription
PdfExtractor()Creates a new PdfExtractor instance.
PdfExtractor(document: Document)Creates a new PdfExtractor bound to an existing document.
PdfExtractor(stream: InputStream)Creates a new PdfExtractor bound to a PDF stream.
bindPdf(inputFile: String)Binds a PDF file to this extractor.
bindPdf(stream: InputStream)Binds a PDF from an input stream.
getPassword()StringReturns the password applied when opening encrypted PDFs in subsequent
{@link #bindPdf(String)} or {@link #bindPdf(InputStream)} calls.
setPassword(password: String)Sets the password used by subsequent {@code bindPdf} calls to open
encrypted PDFs.
bindPdf(document: Document)Binds an existing Document to this extractor.
setStartPage(page: int)Sets the start page for extraction (1-based).
getStartPage()intReturns the start page.
setEndPage(page: int)Sets the end page for extraction (1-based).
getEndPage()intReturns the end page.
setExtractTextMode(mode: int)Sets the text extraction mode.
getExtractTextMode()intReturns the current text extraction mode.
setResolution(resolution: Resolution)Sets image extraction resolution for API parity.
getResolution()ResolutionReturns the configured extraction resolution.
setExtractImageMode(extractImageMode: ExtractImageMode)Sets the image extraction mode.
getExtractImageMode()ExtractImageModeReturns the image extraction mode.
extractText()Extracts text from the page range.
extractText(encoding: Charset)Extracts text from the page range using the requested output encoding.
getText(outputPath: String)Writes extracted text to a file.
getText(stream: OutputStream)Writes extracted text to an output stream.
getTextAsString()StringReturns the extracted text as a string.
getTextSearchOptions()TextSearchOptionsReturns text search options associated with the extractor.
setTextSearchOptions(textSearchOptions: TextSearchOptions)Sets text search options used by subsequent extraction calls.
hasNextPageText()booleanReturns whether page-by-page extracted text remains available.
getNextPageText(outputPath: String)Writes the next page text to a file.
getNextPageText(stream: OutputStream)Writes the next page text to a stream.
isBidi()booleanReturns whether the current extraction contains bidi text.
extractAttachment()Prepares attachment extraction for all embedded files.
extractAttachment(name: String)Prepares extraction for a specific attachment key or file name.
getAttachNames()List<String>Returns the currently prepared attachment names.
getAttachment(outputPath: String)Writes the prepared attachment(s) to the given file or directory.
extractImage()Extracts images from the page range.
hasNextImage()booleanReturns whether there are more extracted images to retrieve.
getNextImage(outputPath: String)Saves the next extracted image to a file.
getNextImage(outputPath: String, format: ImageFormat)Saves the next extracted image to a file using the requested output format.
getNextImage(stream: OutputStream)Saves the next extracted image to an output stream.
getNextImage(stream: OutputStream, format: ImageFormat)Saves the next extracted image to a stream using the requested output format.
getImageCount()intReturns the number of extracted images.
close()Closes this extractor and releases the bound document.

See Also