In one of my previous articles (Full-Text Search with PDF documents) I loaded a few files into a table using OPENROWSET function. It was a good way to achieve what I wanted to do but I understand it might be a bit painful if there are hundreds or thousands of files to load. A today's question is - is there a way to load all PDF files from a specific directory without loading them one by one?
Goal - load all PDF files from a directory
I have PDF documents in c:\temp\ directory which I would like load into a table in a SQL Server database. There might be many purposes of it, for example to index them for Full-Text Search. If there are a few files only, it can be done by listing them one by one and loading them separately:
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Database-independent script in Liquibase'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Database-independent script in Liquibase.pdf', SINGLE_BLOB ) AS t;
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Full-Text Search - practical introduction [video]'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Full-Text Search - practical introduction [video].pdf', SINGLE_BLOB ) AS t;
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Generating XML from SQL - part 1'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Generating XML from SQL - part 1.pdf', SINGLE_BLOB ) AS t;
A problem emerges when there multiple of documents - hundreds, thousands, millions. Loading them one by one using the above statements might be though.
Solution - automate scripts generation
Is there a way to tell SQL Server to load all files from a directory to a table? Hmm ... not an easy one.
OPENROWSET with MULTIPLE_BLOBS option?
If you take a look at OPENROWSET documentation, you will not find an option like MULTIPLE_BLOBS next to SINGLE_BLOB that would require a directory location instead of a path to a file. We need to keep looking.
Iterate through all files using T-SQL?
It seems possible to write a script that uses xp_DirTree function to return all files from a directory and uses this result set to load each file with OPENROWSET. There is one big BUT - xp_DirTree is undocumented and unsupported. Because of it, I do not recommend that way.
Generate T-SQL script using Windows CMD
This is a solution I prefer. It works perfectly if you need to load multiple files once. I create a BAT file (genSQL.bat):
@echo off echo. > loadFiles.sql for /r %%i in (*.pdf) do ( echo INSERT INTO dbo.articles(content, extension, title^) echo SELECT bulkcolumn, '.pdf', '%%~ni' echo FROM OPENROWSET(BULK '%%i', SINGLE_BLOB^) AS t; ) >> loadFiles.sql
When executed in the Windows CMD:
c:\temp>genSQL.bat c:\temp>
it iterates through all PDF files in the current directory and prints an insert statement to loadFiles.sql for each one. Then, loadFiles.sql contains a separate insert statement for each document:
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Database-independent script in Liquibase'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Database-independent script in Liquibase.pdf', SINGLE_BLOB ) AS t;
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Full-Text Search - practical introduction [video]'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Full-Text Search - practical introduction [video].pdf', SINGLE_BLOB ) AS t;
INSERT INTO dbo.articles (content, extension, title )
SELECT bulkcolumn, '.pdf', 'DBA presents - Generating XML from SQL - part 1'
FROM OPENROWSET (BULK 'c:\temp\DBA presents - Generating XML from SQL - part 1.pdf', SINGLE_BLOB ) AS t;
...
This content can be executed against a SQL Server database to load all PDF files to a table.
This solution follows a two phase approach: generate SQL script and execute it. It is good enough in many real cases.
Do you have any other idea? Share in the comments.