Extract the table from PDF in C#
PDF is a common file format
in the office, and its application in work is becoming more and more common.
Due to the high degree of integration and security and reliability of PDF
files, editing content in PDF is a complex and difficult task. But what if we
sometimes need to extract data or tables from it due to work needs? Don't
worry, today I will introduce a method to extract table content from PDF
through C#/VB.NET code. The following are the steps and codes I have organized
for your reference.
Programming Environment
In this test, Free Spire.PDF
for .NET is introduced into the program. The Spire.PDF.dll file can be
referenced by:
Method 1:
Download Free
Spire.PDF for .NET locally, unzip it, and install it. After the
installation is complete, find Spire.PDF.dll in the BIN folder under the
installation path. Then open the "Solution Explorer" in Visual
Studio, right-click "References", "Add Reference", and add
a reference to the dll file in the BIN folder of the local path to the program.
Method 2: Install
via NuGet. It can
be installed by the following 2 methods:
(1) You can open the
"Solution Explorer" in Visual Studio, right-click
"References", "Manage NuGet Packages", then search for
"Free Spire.PDF", and click "Install". Wait for the program
installation to complete.
(2) Copy the following
content to the PM console installation.
Install-Package FreeSpire.PDF
-Version 8.6.0
Extract the table from PDF:
The following are the steps
to extract tables from PDF,
l Instantiate an object of the PdfDocument class using PdfDocument.LoadFromFile()
method to load the document.
l Extract the table in the specified page using PdfTableExtractor.ExtractTable(int
pageIndex) method.
l Get the text content of cells in specific rows and columns using
PdfTable.GetText(int rowIndex, int
columnIndex) method.
l Save the document as a TXT file.
【C#】
using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.IO;
using System.Text;
namespace ExtractTable
{
class Program
{
static void Main(string[] args)
{
// Initialize an instance of PdfDocument class.
PdfDocument pdf = new PdfDocument();
// Load a PDF document
pdf.LoadFromFile("programming language.pdf");
// Create an object of the StringBuilder class class.
StringBuilder builder = new StringBuilder();
// Initialize an instance of PdfTableExtractor
PdfTableExtractor extractor = new
PdfTableExtractor(pdf);
// Declare an array of tables of the PdfTable class
PdfTable[] tableLists;
// Traverse PDF pages
for (int pageIndex = 0;
pageIndex < pdf.Pages.Count; pageIndex++)
{
//Extract the table from pages.
tableLists =
extractor.ExtractTable(pageIndex);
// Check if the table list is empty
if (tableLists != null &&
tableLists.Length > 0)
{
//Traverse the table
foreach (PdfTable table in tableLists)
{
// Get the number of rows and columns in a table
int row =
table.GetRowCount();
int column =
table.GetColumnCount();
// Traverse table rows and columns
for (int i = 0; i < row;
i++)
{
for (int j = 0; j <
column; j++)
{
// Get text in rows and columns
string text =
table.GetText(i, j);
// Write text to StringBuilder container
builder.Append(text + " ");
}
builder.Append("\r\n");
}
}
}
}
// Save the extracted table content as a txt document
File.WriteAllText("extract table.txt", builder.ToString());
}
}
}
【VB.NET】
Imports Spire.Pdf
Imports Spire.Pdf.Utilities
Imports System.IO
Imports System.Text
Namespace ExtractTable
Class Program
Private Shared Sub Main(args
As String())
' Initialize an instance of PdfDocument class.
Dim pdf As New PdfDocument()
' Load a PDF
document
pdf.LoadFromFile("programming language.pdf")
' Create an object of the StringBuilder class class.
Dim builder As New StringBuilder()
' Initialize an instance
of PdfTableExtractor
Dim extractor As New
PdfTableExtractor(pdf)
' Declare an array of tables of the PdfTable class
Dim tableLists As PdfTable()
' Traverse PDF pages
For pageIndex As Integer = 0 To pdf.Pages.Count - 1
' Extract the table from
pages.
tableLists =
extractor.ExtractTable(pageIndex)
' Check if the table list is empty
If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then
' Traverse the table
For
Each table As PdfTable In tableLists
' Get the number of rows and columns in a table
Dim
row As Integer =
table.GetRowCount()
Dim
column As Integer = table.GetColumnCount()
' Traverse
table rows and columns
For i As Integer = 0 To
row - 1
For
j As Integer = 0 To column - 1
' Get text in rows and columns
Dim
text As String = table.GetText(i, j)
' Write text to StringBuilder container
builder.Append(text & Convert.ToString(" "))
Next
builder.Append(vbCr & vbLf)
Next
Next
End If
Next
' Save the extracted table content as a txt document
File.WriteAllText("extract table.txt", builder.ToString())
End Sub
End Class
End Namespace
Effective shot:
Comments
Post a Comment