Extract the table from PDF in C#

 

PDF is a common file format in the office, and its application in work is becoming more and more common. Due to the high degree of integration and security and reliability of PDF files, editing content in PDF is a complex and difficult task. But what if we sometimes need to extract data or tables from it due to work needs? Don't worry, today I will introduce a method to extract table content from PDF through C#/VB.NET code. The following are the steps and codes I have organized for your reference.

Programming Environment

In this test, Free Spire.PDF for .NET is introduced into the program. The Spire.PDF.dll file can be referenced by:

 

Method 1: Download Free Spire.PDF for .NET locally, unzip it, and install it. After the installation is complete, find Spire.PDF.dll in the BIN folder under the installation path. Then open the "Solution Explorer" in Visual Studio, right-click "References", "Add Reference", and add a reference to the dll file in the BIN folder of the local path to the program.

 

Method 2: Install via NuGet. It can be installed by the following 2 methods:

 

(1) You can open the "Solution Explorer" in Visual Studio, right-click "References", "Manage NuGet Packages", then search for "Free Spire.PDF", and click "Install". Wait for the program installation to complete.

 

(2) Copy the following content to the PM console installation.

 

Install-Package FreeSpire.PDF -Version 8.6.0

 

Extract the table from PDF:

The following are the steps to extract tables from PDF,

 

l  Instantiate an object of the PdfDocument class using PdfDocument.LoadFromFile() method to load the document.

l  Extract the table in the specified page using PdfTableExtractor.ExtractTable(int pageIndex) method.

l  Get the text content of cells in specific rows and columns using  PdfTable.GetText(int rowIndex, int columnIndex) method.

l  Save the document as a TXT file.


C#

using Spire.Pdf;

using Spire.Pdf.Utilities;

using System.IO;

using System.Text;

 

namespace ExtractTable

{

    class Program

    {

        static void Main(string[] args)

        {

            // Initialize an instance of PdfDocument class.

            PdfDocument pdf = new PdfDocument();

 

            // Load a PDF document

            pdf.LoadFromFile("programming language.pdf");

 

            // Create an object of the StringBuilder class class.

            StringBuilder builder = new StringBuilder();

 

            // Initialize an instance of PdfTableExtractor

            PdfTableExtractor extractor = new PdfTableExtractor(pdf);

 

            // Declare an array of tables of the PdfTable class

            PdfTable[] tableLists;

 

            // Traverse PDF pages

            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)

            {

                //Extract the table from pages.

                tableLists = extractor.ExtractTable(pageIndex);

 

                // Check if the table list is empty

                if (tableLists != null && tableLists.Length > 0)

                {

                    //Traverse the table

                    foreach (PdfTable table in tableLists)

                    {

                        // Get the number of rows and columns in a table

                        int row = table.GetRowCount();

                        int column = table.GetColumnCount();

 

                        // Traverse table rows and columns

                        for (int i = 0; i < row; i++)

                        {

                            for (int j = 0; j < column; j++)

                            {

                                // Get text in rows and columns

                                string text = table.GetText(i, j);

 

                                // Write text to StringBuilder container

                                builder.Append(text + " ");

                            }

                            builder.Append("\r\n");

                        }

                    }

                }

            }

 

            // Save the extracted table content as a txt document

            File.WriteAllText("extract table.txt", builder.ToString());

        }

    }

}

 

VB.NET

 

Imports Spire.Pdf

Imports Spire.Pdf.Utilities

Imports System.IO

Imports System.Text

 

Namespace ExtractTable

         Class Program

                  Private Shared Sub Main(args As String())

                          ' Initialize an instance of PdfDocument class.

                          Dim pdf As New PdfDocument()

 

                          ' Load a PDF document

                          pdf.LoadFromFile("programming language.pdf")

 

                          ' Create an object of the StringBuilder class class.

                          Dim builder As New StringBuilder()

 

                     ' Initialize an instance of PdfTableExtractor

                          Dim extractor As New PdfTableExtractor(pdf)

 

                          ' Declare an array of tables of the PdfTable class

                          Dim tableLists As PdfTable()

 

                          ' Traverse PDF pages

                          For pageIndex As Integer = 0 To pdf.Pages.Count - 1

                              ' Extract the table from pages.

                                   tableLists = extractor.ExtractTable(pageIndex)

 

                                   ' Check if the table list is empty

                                   If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then

                                            ' Traverse the table

                                            For Each table As PdfTable In tableLists

                                                     ' Get the number of rows and columns in a table

                                                     Dim row As Integer = table.GetRowCount()

                                                     Dim column As Integer = table.GetColumnCount()

 

                                                  ' Traverse table rows and columns                           

                               For i As Integer = 0 To row - 1

                                                             For j As Integer = 0 To column - 1

                                                                      ' Get text in rows and columns

                                                                      Dim text As String = table.GetText(i, j)

 

                                                                      ' Write text to StringBuilder container

                                                                      builder.Append(text & Convert.ToString(" "))

                                                             Next

                                                             builder.Append(vbCr & vbLf)

                                                     Next

                                            Next

                                   End If

                          Next

 

                          ' Save the extracted table content as a txt document

                          File.WriteAllText("extract table.txt", builder.ToString())

                  End Sub

         End Class

End Namespace


Effective shot:

 





Comments

Popular posts from this blog

How to Change Font Color in Word via Java

How to Convert OpenDocument Presentation (.odp) to PDF via Java Application