Loading all files from directory to table

dirIn one of my previous articles (Full-Text Search with PDF documents) I loaded a few files into a table using OPENROWSET function. It was a good way to achieve what I wanted to do but I understand it might be a bit painful if there are hundreds or thousands of files to load. A today's question is - is there a way to load all PDF files from a specific directory without loading them one by one? 

 

Goal - load all PDF files from a directory

I have PDF documents in c:\temp\ directory which I would like load into a table in a SQL Server database. There might be many purposes of it, for example to index them for Full-Text Search. If there are a few files only, it can be done by listing them one by one and loading them separately:

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Database-independent script in Liquibase'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Database-independent script in Liquibase.pdf', SINGLE_BLOB ) AS t;

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Full-Text Search - practical introduction [video]'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Full-Text Search - practical introduction [video].pdf', SINGLE_BLOB ) AS t;

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Generating XML from SQL - part 1'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Generating XML from SQL - part 1.pdf', SINGLE_BLOB ) AS t;

A problem emerges when there multiple of documents - hundreds, thousands, millions. Loading them one by one using the above statements might be though.

 

Solution - automate scripts generation

Is there a way to tell SQL Server to load all files from a directory to a table? Hmm ... not an easy one.

 

OPENROWSET with MULTIPLE_BLOBS option?

If you take a look at OPENROWSET documentation, you will not find an option like MULTIPLE_BLOBS next to SINGLE_BLOB that would require a directory location instead of a path to a file. We need to keep looking.

 

Iterate through all files using T-SQL?

It seems possible to write a script that uses xp_DirTree function to return all files from a directory and uses this result set to load each file with OPENROWSET. There is one big BUT - xp_DirTree is undocumented and unsupported. Because of it, I do not recommend that way.

 

Generate T-SQL script using Windows CMD

This is a solution I prefer. It works perfectly if you need to load multiple files once. I create a BAT file (genSQL.bat):

@echo off
echo. > loadFiles.sql
for /r %%i in (*.pdf) do (
	echo INSERT INTO dbo.articles(content, extension, title^)
	echo SELECT bulkcolumn, '.pdf', '%%~ni'
	echo FROM OPENROWSET(BULK '%%i', SINGLE_BLOB^) AS t;
) >> loadFiles.sql

When executed in the Windows CMD:

c:\temp>genSQL.bat

c:\temp>

it iterates through all PDF files in the current directory and prints an insert statement to loadFiles.sql for each one. Then, loadFiles.sql contains a separate insert statement for each document:

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Database-independent script in Liquibase'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Database-independent script in Liquibase.pdf', SINGLE_BLOB ) AS t;

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Full-Text Search - practical introduction [video]'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Full-Text Search - practical introduction [video].pdf', SINGLE_BLOB ) AS t;

INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Generating XML from SQL - part 1'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Generating XML from SQL - part 1.pdf', SINGLE_BLOB ) AS t;
...

This content can be executed against a SQL Server database to load all PDF files to a table.

This solution follows a two phase approach: generate SQL script and execute it. It is good enough in many real cases.

Do you have any other idea? Share in the comments.