Full-Text Search with PDF documents in SQL Server 2014

FullTextSearch smallFull-Text Search is a powerful engine when it comes to in-text searching. Some time ago, I made a quick tutorial of Full-Text Search basic functionalities. It was based on indexing a VARCHAR column which is the most popular case, but there are other. For example, PDF documents may be stored in a database and used for searching. I think the most difficult part of this work is configuring an environment so SQL Server could know how to read and index documents in the PDF format. And this is exactly what I would like to show you this time.

This article is a video presentation which can be found at the bottom of the page. Below you can find basic steps of configuring an environment and scripts used in the video. For step by step details, please go to the video.

 

Configuring Full-Text Search to work with PDF documents

  1. Full-Text Search feature must be installed in the SQL Server instance. If you are not sure if it is installed, use the following query.
    select serverproperty('IsFullTextInstalled')
    If the query returns 1, it means Full-Text Search is installed. If it returns 0, it has to be installed. In that case, see Full-Text Search installation steps.
  2. Install Adobe iFilter. It can be found here - http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542. Sometimes it happens that for different reasons SQL Server cannot used iFilter despite it is installed. To reduce risk of such problems, remember to:
    1. Install it in a short path without spaces.
    2. Grant access rights to the iFilter directory to the SQL Server account.
    3. Add a path of the iFilter bin directory to the system PATH variable, not the user one.
  3. Tell SQL Server to load OS resources and to skip signature verification.
    EXEC sp_fulltext_service @action='load_os_resources', @value=1;
    EXEC sp_fulltext_service @action='verify_signature', @value=0;
  4. Verify if SQL Server knows about iFilter and associated it with PDF files.
    SELECT document_type, path
    FROM sys.fulltext_document_types
    WHERE document_type = '.pdf'
    It should return a row with a correct path to iFilter.

 

Testing Full-Text Search with PDF documents

I followed the below steps to verify correctness of the configuration.

  1. I created a table for documents.
    CREATE TABLE dbo.articles(
       id INT IDENTITY(1, 1) NOT NULL,
       content VARBINARY(MAX),
       extension VARCHAR(5),
       title NVARCHAR(255)
    );
    ALTER TABLE dbo.articles ADD CONSTRAINT PK_articles PRIMARY KEY CLUSTERED (
       id
    );
  2. I prepared a few PDF documents.
  3. I loaded those documents using queries like the one below.
    INSERT INTO dbo.articles(content, extension, title);
    SELECT bulkcolumn, '.pdf', 'Database-independent script in Liquibase'
    FROM OPENROWSET(BULK 'c:\temp\DBA presents - Database-independent script in Liquibase.pdf', SINGLE_BLOB) AS
    If you need to load many files, check this article - Loading all files from directory to table.
  4. A Full-Text Search catalog was created.
    CREATE FULLTEXT CATALOG DbaPresentsArticles; 
  5. A Full-Text Search index was too.
    CREATE FULLTEXT INDEX ON dbo.articles(
       content TYPE COLUMN extension LANGUAGE 1033
    )
    KEY INDEX PK_articles ON DbaPresentsArticles;

 

Using Full-Text Search

I used two queries to show how it works.

SELECT *
FROM dbo.articles
WHERE contains(content,'xml');

SELECT *
FROM dbo.articles
WHERE contains(content,'xml AND NOT liquibase');