Pages

Monday, March 23, 2009

Coordinating Parallel Processes Using a SQL Control Table

There have been a number of times I've found myself in a situation where the need has arisen to coordinate several processes to ensure some level of parallelism.  In most cases, I wanted to ensure certain activities occurred serially (one at a time).  Sometimes, I wanted to enforce a parallelism limit - such as allowing a maximum of three processes to access/manipulate a resource concurrently.
In standard application development, there are quite a few ways of managing this.  Using built-in functionality of programming languages like .Net's Monitor class to manage multiple threads is an example.  However, when it comes to database operations there are a lot of different technologies used to manipulate the data - such as stored procedures, SQL Agent Jobs, SSIS packages, .Net applications, Web Services, etc... so it's less clear what tools you may have for restricting or controlling if (or how many) jobs can run at one time.
What follows is a mechanism to do just that - although it is by no means the only way, or even the best way for your particular situation.  But hopefully it gives you a place to start.
The Idea
In general, the idea we're going to work with here is central to any coordination of multiple processes.  You need a central resource that "manages things".  Call it a traffic cop in your unruly world of processes - except that we have to assume your world isn't entirely unruly - we're going to respectfully require that all participants behave nicely, and "register" with the authorities and follow some rules before operating in the system.
The traffic cop in this system is a simple table in a SQL database.  It has very few required columns - but you can extend it with other information to make monitoring your system more meaningful.  When a process wants to use the "protected" resource, it first registers itself with the table, then waits for its turn.  When it gains permission to proceed, it has to remember to inform the table when it no longer needs access to the resource, to allow other processes to use it.
The Central Control Table
The control table requires two columns: an identity column, and a process identifier column.  The identity column will allow us to queue up the processes in FIFO order, and the process identifier column will allow the processes to be aware of their position in the "line", so they know if it's their turn to go.
Here is a script for creating a very basic control table in SQL Server:
CREATE TABLE ProcessControl
  (queue_position INT IDENTITY,
  requesting_id CHAR(255) NOT NULL)
The Request Process
When any process wants to access the protected resource managed by the ProcessControl table, it has to first insert its request into the ProcessControl table.  In essence, it's "taking a number" from the control system much like you would at any DMV - although you hopefully don't have to wait for hours...  A command like this one should be issued:
INSERT INTO ProcessControl
VALUES ('Process A')
The above command places the process in the "queue."  At this point, the process needs to place itself in a polling state, querying the database every so often to determine if it's first in the queue.  (Or, if you're allowing a predefined number of processes to execute in parallel, if it's in the "top X" rows in the queue.)  To do this, a simple query such as this will accomplish the task:
SELECT COUNT(*)
FROM ProcessControl
WHERE queue_position < (SELECT queue_position FROM ProcessControl WHERE requesting_id = 'Process A')
The above query will return the number of entries in the queue that are "ahead" of the requesting process (Process A).  A .Net program could periodically issue this command in between Sleep calls.  A stored procedure could execute the above statement as the predicate to a WHILE loop that encapsulates a WAITFOR DELAY statement - like this:
WHILE ((SELECT COUNT(*) FROM ProcessControl WHERE queue_position < (SELECT queue_position FROM ProcessControl WHERE requesting_id = 'Process A')) > 0)

BEGIN
  WAITFOR DELAY '00:00:01'
END
The Release Process
Of course, the processes involved have to be well-behaved and release their "lock" on the resource once they're done with it.  This is as simple as issuing a DELETE statement to remove their request from the queue.  Again, super-simple:
DELETE FROM ProcessControl WHERE requesting_id = 'Process A'
Trying It Out
Don't believe this will work?  Try it out with a simple set of commands in SSMS (SQL Server Management Studio).  First, open two (or more) query panes so that you can mimic two processes attempting to execute in parallel.  Paste the following commands into both windows:
DECLARE @RequestingID AS VARCHAR(255)

SET @RequestingID = CAST(@@SPID AS VARCHAR(255))
PRINT @RequestingID + ' requesting control at ' + CAST(GETDATE() AS CHAR(20))
INSERT INTO ProcessControl
  VALUES (@RequestingID)

PRINT @RequestingID + ' waiting for control'
WHILE ((SELECT COUNT(*) FROM ProcessControl WHERE queue_position < (SELECT queue_position FROM ProcessControl WHERE requesting_id = @RequestingID)) > 0)
BEGIN
  WAITFOR DELAY '00:00:01'
END

PRINT @RequestingID + ' received permission to use the resource at ' + CAST(GETDATE() AS CHAR(20))
-- Simulate some processing that takes 5 seconds
WAITFOR DELAY '00:00:05'
DELETE FROM ProcessControl WHERE requesting_id = @RequestingID
PRINT @RequestingID + ' giving up control at ' + CAST(GETDATE() AS CHAR(20))
Execute them a line at a time each (in order to simulate race conditions), or execute one or either entirely.  You should see that the first window that is able to insert their "request" into the table gets to finish first, without interference from the other query.  In the above example, the "requesting ID" used is the SQL SPID - the process ID of the connection that each query window uses.
Extending the System
Of course, the above will work just fine as a generic solution.  But there are opportunities to make it more complicated, and a little more functional.
Recovering From Errors - Or Misbehaving Participants
It's inevitable that at some point, one of the participants in your system will fail while it has the "lock" on the resource, or will simply misbehave and fail to delete its entry in the control table.  To recover from that kind of problem, you should add another column to your table - a datetime type with a default of GETDATE().  As part of each processes' "polling" cycle, you could issue a DELETE statement to remove "stale" requests.  For this to work, you would have to set a maximum allowed time limit for any one process to have an outstanding request - and have that limit be safely in excess of what any one process would reasonably require.
Robust "Requesting IDs"
In an example above, I'd used fairly generic "requesting IDs" like "Process A".  This is probably nowhere near robust enough to ensure that two participants don't use the same ID.  If that happens, then your control mechanism will break - badly.
Here are some suggestions for "better" sources for the "requesting ID":
* For SQL Agent Jobs and sprocs: The SPID - retrievable by using the @@SPID variable in any script.  Any separate execution of the job will result in a different SPID.
* For SSIS Packages: The package's Execution GUID.  Any separate execution of the package will result in a unique GUID.
* For .Net apps: The computer name, plus the process execution handle.
Limited Parallelism
As I've mentioned before, sometimes you don't want to completely serialize your processes (limit your system to only one process executing at a time).  Sometimes you want a limit on how parallel your system can operate.  In that case, you just need to modify the "Request Process" to allow the participant to proceed if their request ID is in the "first X" entries in the queue, rather than being the first item in the queue.  Participants will be allowed to continue as soon as any one of the first X others in the queue issues their DELETE command.
Multiple "Resources", One Control Table
It's almost inevitable that this won't be the only time you need to manage parallelism.  Once you figure out how to solve a problem once, it's quite natural to use the same tool over again.  But you don't need to create a brand-new control table to handle a new resource.  Simply add a "resource ID" column to your control table, and have participants identify which resource they want to use.  Make sure you alter your "am I first in the queue" queries to filter on the resource, as well as your DELETE commands.
Multiple Readers or One Writer - "Thread Safety"
This is a little more advanced, and less obvious - but it's definitely valuable when you need it!  There are a few situations where this can be used effectively - one of which is when you may be maintaining some kind of a list or queue in your database.  If your participants are iterating over the list, or pushing and popping items onto and off a queue - you have to be concerned with "thread safety" of course.  But you don't have to get overly restrictive and limit readers of the list or queue to one at a time.  When you're dealing with lists and queues, it's perfectly acceptable to have multiple readers using the resource at the same time.  But as soon as you need to modify the queue or list, you need to restrict access severely - otherwise unexpected things could happen as your "writer" modifies the list, or a list entry, while a "reader" tries to read it.  How can you deal with that?
It's obviously a little more complicated - but not too onerous.  First, you need to add an "access type" column to your control table to identify if the request is a "read" or a "write".  You then need to modify your "Request Process" slightly so that readers can access the queue, regardless of their position in the queue - as long as there is not a "writer" in a position higher than them in the queue.  The corollary is that writers can't access the resource until they're first in the queue.  Releasing the resource is identical - using the "standard" DELETE statement.
The net effect is that readers can operate as much in parallel as they like - until a writer comes along.  As soon as a writer posts an access request in the control table, any readers that request access later will have to wait until the writer is done.
Final Thoughts
I've used the exact same framework (with extensions) to control a T-SQL based ETL system, managing three separate resources in one control table, with each resource allowing a different number of concurrent operations.

No comments:

Post a Comment