Extract the table from PDF in C#

PDF is a common file format in the office, and its application in work is becoming more and more common. Due to the high degree of integration and security and reliability of PDF files, editing content in PDF is a complex and difficult task. But what if we sometimes need to extract data or tables from it due to work needs? Don't worry, today I will introduce a method to extract table content from PDF through C#/VB.NET code. The following are the steps and codes I have organized for your reference.

Programming Environment

In this test, Free Spire.PDF for .NET is introduced into the program. The Spire.PDF.dll file can be referenced by:

Method 1: Download Free Spire.PDF for .NET locally, unzip it, and install it. After the installation is complete, find Spire.PDF.dll in the BIN folder under the installation path. Then open the "Solution Explorer" in Visual Studio, right-click "References", "Add Reference", and add a reference to the dll file in the BIN folder of the local path to the program.

Method 2: Install via NuGet. It can be installed by the following 2 methods:

(1) You can open the "Solution Explorer" in Visual Studio, right-click "References", "Manage NuGet Packages", then search for "Free Spire.PDF", and click "Install". Wait for the program installation to complete.

(2) Copy the following content to the PM console installation.

Install-Package FreeSpire.PDF -Version 8.6.0

Extract the table from PDF:

The following are the steps to extract tables from PDF,

l Instantiate an object of the PdfDocument class using PdfDocument.LoadFromFile() method to load the document.

l Extract the table in the specified page using PdfTableExtractor.ExtractTable(int pageIndex) method.

l Get the text content of cells in specific rows and columns using PdfTable.GetText(int rowIndex, int columnIndex) method.

l Save the document as a TXT file.

【C#】

using Spire.Pdf;

using Spire.Pdf.Utilities;

using System.IO;

using System.Text;

namespace ExtractTable

{

class Program

{

static void Main(string[] args)

{

// Initialize an instance of PdfDocument class.

PdfDocument pdf = new PdfDocument();

// Load a PDF document

pdf.LoadFromFile("programming language.pdf");

// Create an object of the StringBuilder class class.

StringBuilder builder = new StringBuilder();

// Initialize an instance of PdfTableExtractor

PdfTableExtractor extractor = new PdfTableExtractor(pdf);

// Declare an array of tables of the PdfTable class

PdfTable[] tableLists;

// Traverse PDF pages

for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)

{

//Extract the table from pages.

tableLists = extractor.ExtractTable(pageIndex);

// Check if the table list is empty

if (tableLists != null && tableLists.Length > 0)

{

//Traverse the table

foreach (PdfTable table in tableLists)

{

// Get the number of rows and columns in a table

int row = table.GetRowCount();

int column = table.GetColumnCount();

// Traverse table rows and columns

for (int i = 0; i < row; i++)

{

for (int j = 0; j < column; j++)

{

// Get text in rows and columns

string text = table.GetText(i, j);

// Write text to StringBuilder container

builder.Append(text + " ");

}

builder.Append("\r\n");

}

// Save the extracted table content as a txt document

File.WriteAllText("extract table.txt", builder.ToString());

}

【VB.NET】

Imports Spire.Pdf

Imports Spire.Pdf.Utilities

Imports System.IO

Imports System.Text

Namespace ExtractTable

Class Program

Private Shared Sub Main(args As String())

' Initialize an instance of PdfDocument class.

Dim pdf As New PdfDocument()

' Load a PDF document

pdf.LoadFromFile("programming language.pdf")

' Create an object of the StringBuilder class class.

Dim builder As New StringBuilder()

' Initialize an instance of PdfTableExtractor

Dim extractor As New PdfTableExtractor(pdf)

' Declare an array of tables of the PdfTable class

Dim tableLists As PdfTable()

' Traverse PDF pages

For pageIndex As Integer = 0 To pdf.Pages.Count - 1

' Extract the table from pages.

tableLists = extractor.ExtractTable(pageIndex)

' Check if the table list is empty

If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then

' Traverse the table

For Each table As PdfTable In tableLists

' Get the number of rows and columns in a table

Dim row As Integer = table.GetRowCount()

Dim column As Integer = table.GetColumnCount()

' Traverse table rows and columns

For i As Integer = 0 To row - 1

For j As Integer = 0 To column - 1

' Get text in rows and columns

Dim text As String = table.GetText(i, j)

' Write text to StringBuilder container

builder.Append(text & Convert.ToString(" "))

builder.Append(vbCr & vbLf)

End If

' Save the extracted table content as a txt document

File.WriteAllText("extract table.txt", builder.ToString())

End Sub

End Class

End Namespace

Effective shot:

Search This Blog

Carina

Extract the table from PDF in C#

Programming Environment

Comments

Post a Comment

Popular posts from this blog

How to Change Font Color in Word via Java

How to Convert OpenDocument Presentation (.odp) to PDF via Java Application