Sunday, December 6, 2009

Cap job runtime: Debugging periodic job policy in a Condor pool

Job policy includes a set of periodic expressions; PeriodicHold, PeriodicRelease, and PeriodicRemove. Periodic expressions on a job are evaluated in the context of the job’s ad. They should evaluate to a boolean; if PeriodicRemove is TRUE, then remove the job. They are evaluated by the Schedd and a Shadow. The Schedd evaluates them as frequently as PERIODIC_EXPR_INTERVAL, which defaults to 60 seconds in condor 7.4 and 300 before. The Shadow evaluates them periodically, based on PERIODIC_EXPR_INTERVAL, and every time it receives an update from the Starter running the job. Updates occur periodically, controlled by STARTER_UPDATE_INTERVAL on the Starter.

Say you want to put a job on hold if it runs for more than 60 seconds. To do this you need job policy and two points of reference; the start time of the job, and the current time.

The job policy you want to use is PeriodicHold.

For the two time references you need to look at a job’s ad to see what is available. You can see a job’s ad with condor_q -long. The start time is available as JobCurrentStartDate, seconds since Epoch, and it would appear that ServerTime is the current time, seconds since Epoch.

Here’s a job you might write:

executable = /bin/sleep arguments = 29m notification = never periodic_hold = (ServerTime - JobCurrentStartDate) >= 60 queue 1

If you submit it and wait a minute you will see it is on Hold.

$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 11/5 19:03 0+00:00:19 H 0 2.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held

But, it only ran for 19 seconds. Check why.

$ condor_q -hold -- Submitter: robin.local : : robin.local ID OWNER HELD_SINCE HOLD_REASON 1.0 matt 11/5 19:04 The UNKNOWN (never set) PeriodicHold expres 1 jobs; 0 idle, 0 running, 1 held

The hold reason is truncated, but you can use condor_q -long to see all of it.

$ condor_q -long | grep ^HoldReason HoldReason = "The UNKNOWN (never set) PeriodicHold expression '' evaluated to UNDEFINED" HoldReasonCode = 5 HoldReasonSubCode = 0

This is saying that something went wrong and your PeriodicHold expression evaluated to UNDEFINED. You wanted it to evaluate to a boolean. The way expression evaluation in ClassAds happens, if part of an expression evaluates to UNDEFINED then chances are the entire expression will. For instance, the expression A + 1 will evaluate to the value of the A attribute plus 1. When A is a number like 2, the expression evaluates to 2 + 1 => 3. When A is not defined, the expression evaluates to UNDEFINED + 1 => UNDEFINED.

There is a handy way to debug expression in Condor, use the debug() ClassAd function.

Your job becomes:

executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((ServerTime - JobCurrentStartDate) >= 60) queue 1

When you submit the job and look in either the SchedLog or ShadowLog you will see “Classad debug” messages.

ShadowLog:

...: Classad debug: ServerTime --> UNDEFINED ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257476798 ...: Classad debug: debug((ServerTime - JobCurrentStartDate) >= 60) --> UNDEFINED

Very clearly, ServerTime is evaluating to UNDEFINED. That may seem strange because it looked like it was part of the job ad. However, ServerTime is somewhat special. Likely buggy. It is only added to a job ad in response to a query, e.g. condor_q. It is not actually part of the job ad. Annoying.

There is a solution. Another special attribute is CurrentTime. It is available to all expressions, but it is not a visible attribute on a job ad. Also annoying. You have to know it is there. Using it we can rewrite the job as follows.

executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((CurrentTime - JobCurrentStartDate) >= 60) queue 1

After submitting the job and waiting a minute you can see that it has been put on hold.

$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 matt 11/5 19:09 0+00:01:00 H 0 0.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held

The details of the hold are also much more what you would expect.

$ condor_q -long | grep ^HoldReason HoldReason = "The job attribute PeriodicHold expression 'debug((CurrentTime - JobCurrentStartDate) >= 60)' evaluated to TRUE" HoldReasonCode = 3 HoldReasonSubCode = 0

In the ShadowLog you can see that CurrentTime is available as expected.

ShadowLog:

...: Classad debug: CurrentTime --> 1257477062 ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257477002 ...: Classad debug: debug((CurrentTime - JobCurrentStartDate) >= 60) --> 1

Remember to remove debug() from your expressions.

[Via http://spinningmatt.wordpress.com]

No comments:

Post a Comment