PdfExtractor

Overview

PdfExtractor is a class in Aspose.PDF FOSS for .NET. Inherits from: IDisposable.

Facade for extracting text and images from a PDF document.

This class provides 29 methods for working with PdfExtractor objects in .NET programs. Available methods include: BindPdf, Close, Dispose, ExtractAttachment, ExtractImage, ExtractText, GetAttachNames, GetAttachment, GetAttachmentInfo, GetNextImage, GetNextPageText, GetText, and 3 additional methods. All public members are accessible to any .NET application after installing the Aspose.PDF FOSS for .NET package. Properties: EndPage, ExtractImageMode, ExtractTextMode, IsBidi, Password, Resolution, and 2 more.

Properties

NameTypeAccessDescription
StartPageintRead/Write1-based start page for extraction.
EndPageintRead/Write1-based end page for extraction.
ExtractTextModeintRead/WriteText extraction mode.
ExtractImageModeExtractImageModeRead/WriteImage extraction strategy.
ResolutionintRead/WriteRendering resolution for image extraction (DPI).
IsBidiboolReadTrue when the most recently extracted text contains a right-to-left script run (Hebrew, Arabic, Syriac, Thaana, etc.), i.e.
Passwordstring?Read/WritePassword used when binding encrypted PDFs.
TextSearchOptionsAspose.Pdf.Text.TextSearchOptionsRead/WriteText-search options applied during ExtractText().

Methods

SignatureDescription
PdfExtractor()Calls PdfExtractor on this PdfExtractor instance.
PdfExtractor(document: Document)
BindPdf(inputFile: string)Calls BindPdf on this PdfExtractor instance.
BindPdf(document: Document)
BindPdf(inputStream: Stream)
ExtractText()Walks every page in the bound document with TextAbsorber and concatenates the extracted text.
ExtractText(outputFile: string)Extract text from the bound PDF and save it to the given file (UTF-16 LE bytes).
ExtractText(encoding: Encoding)Extract text from the bound PDF using the given encoding.
GetText(outputStream: Stream)Writes the extracted text as bytes into outputStream, using the encoding set by the most recent ExtractText(Encoding) call (defaults to UTF-16 LE).
GetText(outputStream: Stream, filterNotAscii: bool)Writes the extracted text as bytes, optionally dropping non-ASCII codepoints first.
GetText(outputFile: string)Writes the extracted text to the given file path.
HasNextPageText()True if there is another page’s text available via GetNextPageText.
GetNextPageText(outputFile: string)Writes the next page’s text to the given file path (UTF-16 LE bytes).
GetNextPageText(outputStream: Stream)Writes the next page’s text to the given stream (UTF-16 LE bytes).
ExtractImage()Calls ExtractImage on this PdfExtractor instance.
ExtractImage(outputDirectory: string)Extract every image to the given directory, naming files image_N.jpg for JPEG sources and image_N.png otherwise.
HasNextImage()True while GetNextImage(string) has a remaining image to write.
GetNextImage(outputStream: Stream)Calls GetNextImage on this PdfExtractor instance.
GetNextImage(outputFile: string)
GetNextImage(outputStream: Stream, format: System.Drawing.Imaging.ImageFormat)
GetNextImage(outputFile: string, format: System.Drawing.Imaging.ImageFormat)
ExtractAttachment()Select every embedded file in the bound document; a subsequent GetAttachment(string) writes all of them.
ExtractAttachment(attachmentFileName: string)Select a single embedded file by name for extraction.
GetAttachNames()Names of every embedded file in the bound document.
GetAttachmentInfo()FileSpecification entries for every embedded file in the bound document.
GetAttachment()Return the selected attachments’ content as MemoryStreams (all attachments when none were explicitly selected).
GetAttachment(outputPath: string)Write the selected attachments (all when none were explicitly selected) into the outputPath directory, one file per attachment named after its file name.
Close()Calls Close on this PdfExtractor instance.
Dispose()Calls Dispose on this PdfExtractor instance.

See Also