PdfExtractor
Overview
PdfExtractor is a class in Aspose.PDF FOSS for .NET.
Inherits from: IDisposable.
Facade for extracting text and images from a PDF document.
This class provides 29 methods for working with PdfExtractor objects in .NET programs.
Available methods include: BindPdf, Close, Dispose, ExtractAttachment, ExtractImage, ExtractText, GetAttachNames, GetAttachment, GetAttachmentInfo, GetNextImage, GetNextPageText, GetText, and 3 additional methods.
All public members are accessible to any .NET application after installing the Aspose.PDF FOSS for .NET package.
Properties: EndPage, ExtractImageMode, ExtractTextMode, IsBidi, Password, Resolution, and 2 more.
Properties
| Name | Type | Access | Description |
|---|---|---|---|
StartPage | int | Read/Write | 1-based start page for extraction. |
EndPage | int | Read/Write | 1-based end page for extraction. |
ExtractTextMode | int | Read/Write | Text extraction mode. |
ExtractImageMode | ExtractImageMode | Read/Write | Image extraction strategy. |
Resolution | int | Read/Write | Rendering resolution for image extraction (DPI). |
IsBidi | bool | Read | True when the most recently extracted text contains a right-to-left script run (Hebrew, Arabic, Syriac, Thaana, etc.), i.e. |
Password | string? | Read/Write | Password used when binding encrypted PDFs. |
TextSearchOptions | Aspose.Pdf.Text.TextSearchOptions | Read/Write | Text-search options applied during ExtractText(). |
Methods
| Signature | Description |
|---|---|
PdfExtractor() | Calls PdfExtractor on this PdfExtractor instance. |
PdfExtractor(document: Document) | |
BindPdf(inputFile: string) | Calls BindPdf on this PdfExtractor instance. |
BindPdf(document: Document) | |
BindPdf(inputStream: Stream) | |
ExtractText() | Walks every page in the bound document with TextAbsorber and concatenates the extracted text. |
ExtractText(outputFile: string) | Extract text from the bound PDF and save it to the given file (UTF-16 LE bytes). |
ExtractText(encoding: Encoding) | Extract text from the bound PDF using the given encoding. |
GetText(outputStream: Stream) | Writes the extracted text as bytes into outputStream, using the encoding set by the most recent ExtractText(Encoding) call (defaults to UTF-16 LE). |
GetText(outputStream: Stream, filterNotAscii: bool) | Writes the extracted text as bytes, optionally dropping non-ASCII codepoints first. |
GetText(outputFile: string) | Writes the extracted text to the given file path. |
HasNextPageText() | True if there is another page’s text available via GetNextPageText. |
GetNextPageText(outputFile: string) | Writes the next page’s text to the given file path (UTF-16 LE bytes). |
GetNextPageText(outputStream: Stream) | Writes the next page’s text to the given stream (UTF-16 LE bytes). |
ExtractImage() | Calls ExtractImage on this PdfExtractor instance. |
ExtractImage(outputDirectory: string) | Extract every image to the given directory, naming files image_N.jpg for JPEG sources and image_N.png otherwise. |
HasNextImage() | True while GetNextImage(string) has a remaining image to write. |
GetNextImage(outputStream: Stream) | Calls GetNextImage on this PdfExtractor instance. |
GetNextImage(outputFile: string) | |
GetNextImage(outputStream: Stream, format: System.Drawing.Imaging.ImageFormat) | |
GetNextImage(outputFile: string, format: System.Drawing.Imaging.ImageFormat) | |
ExtractAttachment() | Select every embedded file in the bound document; a subsequent GetAttachment(string) writes all of them. |
ExtractAttachment(attachmentFileName: string) | Select a single embedded file by name for extraction. |
GetAttachNames() | Names of every embedded file in the bound document. |
GetAttachmentInfo() | FileSpecification entries for every embedded file in the bound document. |
GetAttachment() | Return the selected attachments’ content as MemoryStreams (all attachments when none were explicitly selected). |
GetAttachment(outputPath: string) | Write the selected attachments (all when none were explicitly selected) into the outputPath directory, one file per attachment named after its file name. |
Close() | Calls Close on this PdfExtractor instance. |
Dispose() | Calls Dispose on this PdfExtractor instance. |