" /> Status for Andrew DeFaria: September 2006 Archives

« July 2006 | Main | October 2006 »

September 29, 2006

Load Balancing Redirection

  • Implemented a load balancing redirection scheme for cqweb

Load Balancing CQ Web Servers based on Number of CQ Web Users

The task at hand was to write a redirector that load balances amongst a number of CQ Web servers based on the number of CQ Web Users currently on each server. Additionally, based on how the user came into the CQ Web server farm, redirect them to the proper schema.

Determining Load

The old IIS CQ Web Server used to allow you to query the number of active CQ Web Users. The new Apache/Tomcat server only allows admins to do this. Additionally the admin need to be logged in, thus have a valid token. IBM/Rational suggests using Apache's server-status URL to determine load. However that only displays number of Apache requests in progress not number of CQ Web Users.

If ExtendedStatus is turned on then Apache lists each connection and the URL they are working on. By filtering "GET /cqweb" we can get a rough estimate of the number of CQ Web Users. There is a problem in that the redirector script cannot query the same web server that it's running on. Additionally this information can only be obtained if ExtendedStatus is turned on.

Algorithm for selecting a server

The algorithm for selecting a non busy server is described as:

Pick a lightly loaded server out of the pool. Note that if a server is not running with ExtendedStatus on then $cq_users will be undef. This is different than the case where the server has ExtendedStatus on but there just aren't any CQ Web users (which would be denoted by $cq_users = 0). Thus we may have the condition where:

Server cq_users ExtendedStatus
server1 undef off
server2 20 on
server3 0 on
server4 10 on

In such a case we wish to pick server3 since it has no current CQ web users.

The algorithm used here will be to remove all servers from the pool who are not running with ExtendedStatus on since we cannot reliably tell how loaded the server is from the standpoint of CQ Web users. If, however, no servers have ExtendedStatus on (thus all $cq_users return as undef) then we will consider the $nbr_apache_requests. IOW $nbr_apache_requests is not equivalent with $cq_users and thus they cannot be compared together. But if no server is running with ExtendedStatus on we need to pick something!

Note: If $random then a server is simply randomly chosen.

Unfortunately, given this algorithm, if we had the following situation:

Server cq_users ExtendedStatus
server1 undef on
server2 undef off
server3 undef off
server4 10 on

Then this algorithm will always return server4.

Important Note: The web server doing the redirection cannot be queried. Attempting to do so hangs! Therefore it cannot participate in the server pool. It is recommended that another web server be set up as the redirector and the DNS name cqweb assigned to it. This web server can, however, participate by being a Clearquest Request Manager.

Random Redirection

The script can also redirect randomly instead of relying on load of CQ Web Users. Currently there are 3 servers in the pool. Only one of them has ExtendedStatus turned on. As such redirecting by load will always resolve to the one server using, the one running with ExtendedStatus on. This is not good. So currently it just picks a server randomly from the pool. This behavior is controlled by the lb parameter (currently defaulted to off meaning pick server randomly).

Defining the Server Pool

The server pool is defined by a small file, servers.cfg, which simply list the servers participating in the pool. Servers can be added or removed dynamically.

Mapping Redirection

In the past users went to http://cqweb/<area>. These were HTML files in the DocumentRoot which redirected to a series of redirection scripts. It was hoped that HTTP_REFERER could be used to determine where to redirect the visitor. Unfortunately HTTP_REFERER is not guaranteed and indeed it's undefined on the web servers!

Instead one must specify the group parameter to the redirector script. The script then maintains a map between <areas> -> Schema/ContextIDs. If the group is not specified or not in the map then the user is redirected to the main login page. This is not viewed as a hardship because we need redirecting <area> files anyway. The new form of redirecting <area> file is:

<html>
<head>
<mdeta http-equiv="refresh" content="0; url=http://cqweb.itg.ti.com/cgi-bin/redirect.pl?group=<area>>
</head>

Redirect Map

The redirect map, stored in redirect.map, is a file of key/value pairs. For example:

CMDT:           &schema=CMDT.2003.06.00&contextid=CMDT
CSSD:           &schema=omap.2002.05.00&contextid=OMAPS
DLP-Play:       &schema=DLP.2003.06.00&contextid=Play
DLP:            &schema=DLP.2003.06.00&contextid=DLP
DMD-p:          &schema=DLP.2003.06.00&contextid=DMD-p
DMD:            &schema=DLP.2003.06.00&contextid=DMD
GCM:            &schema=CMDT.2003.06.00&contextid=GCM
HPALP:          &schema=HPA_MKT_LP&contextid=HPALP
LDM:            &schema=CMDT.2003.06.00&contextid=LDM
NV:             &schema=CMDT.2003.06.00&contextid=NV
SDO:            &schema=SDS.2003.06.00&contextid=SDSCM
SDO_TEST:       &schema=SDS_TST_DEV&contextid=SDSCM
WiMax:          &schema=WiMax.SR5&contextid=WiMax
mDTV:           &schema=mDTV.2003.06.00&contextid=MDTV
mDTV_play:      &schema=mDTV.2003.06.00&contextid=PLAY

Parameters for redirect.pl

The following parameters, specified in the URL, are supported by redirect.pl:

group
Specifies the key into the redirect.map for the schema/contextid. If not specified then defaults to main login page of the selected server
lb
If set then load balancing is attempted based on ExtendedStatus and CQ Web Users as described above. Default: undefined (off)
debug
If specified the user is not redirected rather debugging information is output.

September 28, 2006

JVM Stack/Heap Sizes

  • Looked into JVM stack and heap sizes on dfls83-85

As you know there have been service interruptions in CQWeb. I keep looking at the logs for clues. About the only consistent thing is an error similar to this:

2006-09-28 00:10:20 Ajp13Processor[8009][15] process: invoke
java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(Unknown Source)
        at java.net.SocketOutputStream.write(Unknown Source)
        at org.apache.ajp.Ajp13.send(Ajp13.java:525)
        at org.apache.ajp.RequestHandler.finish(RequestHandler.java:495)
        at org.apache.ajp.Ajp13.finish(Ajp13.java:395)
        at org.apache.ajp.tomcat4.Ajp13Response.finishResponse(Ajp13Response.java:196)
        at org.apache.ajp.tomcat4.Ajp13Processor.process(Ajp13Processor.java:464)
        at org.apache.ajp.tomcat4.Ajp13Processor.run(Ajp13Processor.java:551)
        at java.lang.Thread.run(Unknown Source)

Now "Connection reset by peer" could be an error that the process gets because the service has stopped so this could be more of a symptom than a cure. However searching for "Ajp13Processor socket write error" points me to this post which suggests increasing the stack and heap sizes for the JVM. Problems that are intermittent can be consistent with running out of stack or heap size.<./p>

According to the Clearquest Web Administration Guide:


Controlling Java VM Memory Consumption

You can configure the memory consumption of Java processes used by New ClearQuest Web by adjusting the parameters in property files under the various components.

Windows

This section describes the configuration changes for New ClearQuest Web Java VM memory consumption for processes running on Microsoft Windows. To specify the VM memory consumption:

  1. Open the appropriate configuration file for the New ClearQuest Web component whose memory consumption you want to reconfigure. For the ClearQuest Web application:
Component Configuration file
Apache Tomcat Server C:\Program Files\Rational\Common\rwp\bin\jk_service2.in.properties
Rational Web Platform C:\Program Files\Rational\Common\rwp\bin\jk_service2.properties

For the ClearQuest server:

Component Configuration file
ClearQuest Request Manager C:\Program Files\Rational\ClearQuest\cqweb\cqserver\requestmgr_service.properties
ClearQuest Registry Server C:\Program Files\Rational\ClearQuest\cqweb\cqregsvr\cqregsvr_service.properties
  • Modify the section shown below:

    # # JVM Options #
    # Useful Options:
    # -Xms2m = Initial heap size, modify for desired size
    # -Xmx256m = Maximum heap size, modify for desired size
    # -Xrs = Available in Jdk1.3.1 to avoid JVM termination during logoff
    #
    wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
        

  • I looked at these config files on the three machines (dfls83-85) and they were pretty much set to the default:

    Ltx0062320:for server in 83 84 85; do grep
    wrapper.jvm.options= //dfls$server/Rational/Common/rwp/bin/jk_service2*properties
    //dfls$server/Rational/ClearQuest/cqweb/cqserver/requestmgr_service.properties
    //dfls
    $server/Rational/ClearQuest/cqweb/cqregsvr/cqregsvr_service.properties;
    done
    //dfls83/Rational/Common/rwp/bin/jk_service2.default.properties:wrapper.jvm.options=-Xrs
    -Xms2m -Xmx256m
    //dfls83/Rational/Common/rwp/bin/jk_service2.in.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls83/Rational/Common/rwp/bin/jk_service2.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls83/Rational/ClearQuest/cqweb/cqserver/requestmgr_service.properties:wrapper.jvm.options=-Xrs
    //dfls83/Rational/ClearQuest/cqweb/cqregsvr/cqregsvr_service.properties:wrapper.jvm.options=-Xrs
    //dfls84/Rational/Common/rwp/bin/jk_service2.default.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls84/Rational/Common/rwp/bin/jk_service2.in.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls84/Rational/Common/rwp/bin/jk_service2.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls84/Rational/ClearQuest/cqweb/cqserver/requestmgr_service.properties:wrapper.jvm.options=-Xrs
    //dfls84/Rational/ClearQuest/cqweb/cqregsvr/cqregsvr_service.properties:wrapper.jvm.options=-Xrs
    //dfls85/Rational/Common/rwp/bin/jk_service2.default.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls85/Rational/Common/rwp/bin/jk_service2.in.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls85/Rational/Common/rwp/bin/jk_service2.properties:wrapper.jvm.options=-Xrs -Xms2m -Xmx256m
    //dfls85/Rational/ClearQuest/cqweb/cqserver/requestmgr_service.properties:wrapper.jvm.options=-Xrs
    //dfls85/Rational/ClearQuest/cqweb/cqregsvr/cqregsvr_service.properties:wrapper.jvm.options=-Xrs

    All these machines have 2 gig of main memory and largely just serve CQWeb. Indeed the CQWeb service processes are consuming most of the memory:

    dfls83
    dfls84
    dfls85

    I think we should try setting at least the following:

    wrapper.jvm.options=-Xrs -Xms128m -Xmx512m
    

    A restart of all CQ Web services would probably be needed for the changes to become effective. The above settings start off the jvm @ 128m for all 4 processes thus a total memory footprint of 512 Meg and limit each process to 512 Meg max for a total footprint of 2 Gig (when full). We might want to bounce this idea off IBM/Rational support to see if all 4 process should have the same settings or if we show vary them.

    September 27, 2006

    OMAPS.pm bug

    • Tracked down and fixed minor bug in OMAPS

    I found a minor bug in OMAPS.pm that is called from the CSSD ClearQuest Account Creation page (http://dfls85/cgi-bin/create.pl). The error appears in the log files as:

    [Wed Sep 27 10:52:39 2006] [error] [client 128.247.39.85] [Wed Sep 27 10:52:39 2006] 
    create.pl: Useless use of concatenation (.) or string in void context at OMAPS.pm line 318, <CNF> line 139.
    

    Line 318 of OMAPS.pm is:

    debug ("add user $data->{login_name} to team ") . $cgi->param("Team");
    
    But it should read:
    debug ("add user $data->{login_name} to team " . $cgi->param("Team"));
    

    As we are watching the log files carefully for signs of Clearquest web hangs and outages it would be helpful if this superfluous error were eliminated.

    I fixed this by hand on dfls[83-85] but it should be fixed in the original.

    September 26, 2006

    CQ log files

    • Looked into yet another hang up with CQ web servers

    CQ Web logs

    We get a lot of errors in the logs of the form:

    [Tue Sep 26 11:08:51 2006] [error] [client 128.247.39.85] File does not exist: 
    C:/Program Files/Rational/Common/rwp/webapps/cqweb/dct/html/images, referer: 
    http://dfls85.itg.ti.com/cqweb/dct/html/download_en.html
    [Tue Sep 26 11:08:51 2006] [error] [client 128.247.39.85] File does not exist:
    C:/Program Files/Rational/Common/rwp/webapps/cqweb/dct/html/images, referer:
    http://dfls85.itg.ti.com/cqweb/dct/html/download_en.html
    
    These errors are not a big deal except they cloud the log files with meaningless stuff that you need to skip over all the time. I decided to look into this and see where they were coming from. In the file .../rwp/webapps/wre/common/script/common.js there appeared the following code:
    var arrowOff=new Image();
         arrowOff.src="images/shim.gif";
    var arrowOn=new Image();
          arrowOn.src="images/arrow_red.gif" ;
    

    This appears to be causing the problem so I updated that JavaScript to:

    var arrowOff=new Image();
        arrowOff.src="/wre/common/images/shim.gif";
    var arrowOn=new Image();
            arrowOn.src="/wre/common/images/arrow_red.gif" ;
    

    I'm not sure if this is a Rational problem or something that TI has done but with the above fix the error seems to go away. Well at least for me. I suspect others are still generating the error because JavaScript is cached by the browser. Hopefully as people restart there browsers this will go away.

    Additionally the following error is still appearing in the logs:

    [Mon Sep 25 20:31:06 2006] [error] [client 172.24.80.20] File does not exist:
    C:/Program Files/Rational/Common/rwp/htdocs/favicon.ico
    

    I've put a favicon.ico in the proper area on dfls85. The error seems to have diminished however I don't see a favicon in the browser so I'm not sure if this is working.

    The two "fixes" above will need to be replicated to the other servers (dfls83 and 84) at some time.

    Finally another error shows up:

    [Tue Sep 26 18:31:18 2006] [error] [client 128.247.39.85] [Tue Sep 26 18:31:18 2006] create.pl: 
    Useless use of concatenation (.) or string in void context at OMAPS.pm line 318,  line 139.
    

    This seems to be an error in CSSD's code which is located under .../Rational/cgi-bin. Going to http://dfls85/cgi-bin/create.pl first redirects me to TI's authentication web page but then back to, in my case, here. Going to that page generates this error in the error log everytime.

    September 21, 2006

    CQ: DMD Date changes

    • Looked at DMD requests
    • Fixed problem where Needed_Date and Target_Date could not be set to Submit_Date

    September 18, 2006

    enable_ldaptk

    • Started coding a PerlTK version of enable_ldap
    • Solved problem with not being able to write to Samba mounted home drive. Seems one should not use smbntsec in $CYGWIN when the Samba Server is not in the domain

    September 15, 2006

    enable_ldap

    • Added LDAP calls to enable_ldap to check the parms as we go

    Integrating LDAP to enable_ldap

    I decided it would be good if as enable_ldap gathers parameters, it checks to see if they are correct. It does this by actually calling LDAP calls to validate the things like the server, port, etc. The goal is to make enable_ldap insure that the parameters are indeed correct. Unfortunately this makes enable_ldap dependent on the Net::LDAP module but I think it's worth it to allow enable_ldap to check the parameters and the mapping the user is describing.

    I still need to tighten up the code where it queries LDAP and attempts to prove to the user that the mapping is correct. As I understand it you are basically attempting to map a Clearquest field to an LDAP field so that Clearquest can find the correct record. Once that linkage is established Clearquest can "pull" the password from LDAP and thus authenticate the user's password to the LDAP password.

    What enable_ldap does is effectively this, however, it's not that informative to simply say "The user ID 'foo' was found in the LDAP directory" rather I want to say "The user id 'foo' corresponds with '<fullname>'". However does "fullname" always appear exactly as that in LDAP?

    Additionally, I need to handle the cases where it's not a match or where say multiple entries are returned (not sure how that can happen unless the user specifies an attribute that can have dups or perhaps enters in a wildcard, e.g. "defaria*").

    September 11, 2006

    Clearquest License Server

    • Investigated Clearquest License server

    Time Spent: 3 Hours

    Dylan Ko wrote:

    I have already turned on the sons-clearcase. As we are busy integrating on several projects now, we can not afford to have sons-clearcase down and thus cripple the ClearQuest and the sync between SC and SH office.

    We’ll have to find some other appropriate time to turn off sons-clearcase and look into these issues further. Preferably that time that both sites are off – between 3AM to 9AM PST.

    OK, here's what I found out so far. Using adefaria as a test machine I first checked to see what FlexLM license server was being used on that machine by selecting Start: All Programs: Rational Software: Rational License Key Administrator. It was using just sons-clearcase. Next I attempting to talk to Clearquest by both the Clearquest GUI and cqc. Then I stopped the FlexLM service on sons-clearcase. I then started the Clearquest GUI and it complained about no license server. Interestingly cqc continued to work. This may be because cqc/cqd opens the Clearquest database in a read only mode.

    Next I added sons-sc-cc as a FlexLM License server and retested. Both the Clearcase GUI and cqc were able to obtain a license from sons-sc-cc with no problems. I even shutdown Clearcase on sons-clearcase and I was still able to use Clearcase GUI and cqc from adefaria with no problems.

    I then restarted both FlexLM and Clearcase on sons-clearcase.

    Dylan, perhaps you want to test this on your workstation. Try adding sons-sc-cc as a FlexLM license server for your desktop. You can toggle off sons-clearcase as a license server and attempt to access Clearquest. You can then stop the FlexLM service on sons-clearcase and test Clearquest access from your machine again. Finally try shutting down Clearcase on sons-clearcase and retest Clearquest access from your machine.

    Adding a FlexLM License Server to your Desktop

    • Select Start: All Programs: Rational Software: Rational License Key Administrator:

    • Select License Keys: License Key Wizard


    • Select Next



    • Select Advanced Server Options:



    • Select Add Server:



    • Click on Values under the Settings on the right on the Server Name column type sons-sc-cc and Enter. The server name should change from "New Server" -> "sons-sc-cc":



      At this point you can toggle on or off either sons-clearcase or sons-sc-cc as a license server provider.

    • Select OK to close this dialog box. You should see something like:



      Note that there are 3 lines, two serviced by sons-clearcase (Rational ClearCase LT and Rational ClearQuest) and one serviced by sons-sc-cc (Rational ClearQuest). Also note that if you stop FlexLM on sons-clearcase and refresh or return the Rational Key License Administrator then licenses served by sons-clearcase will not be listed (since the server cannot be contacted.

    Similarly, all desktops (or Clearquest GUI clients) will have to adjust their FlexLM License Key Server in a similar manner. Also note that you can, instead of adding a server, simply click on the sons-clearcase server then click on Values under the Settings on the right on the Server Name column and replace sons-clearcase with sons-sc-cc.

    If the test is successful then I believe that sons-clearcase can be powered off. Again, I think we should still run about a week this way and if things are OK then I can rmreplica the US replicas leaving only the China (sons-cc) ones and the SantaClara (sons-sc-cc) ones.

    September 10, 2006

    Clearquest Install

    • Looked into Clearquest install area

    Silent Install and Multiple License Servers

    I had hoped to be able to user the /g parm to setup.exe so that the install would be silent and automatic. Clearquest now installs without need for a reboot. But silent install installs silently but then reboots! Ugh!

    Also wanted to be able to specify multiple license servers as TI uses 3 of them. I hoped that merely updating sitedefs.dat to list three of them, comma separated would work. But it doesn't. Perhaps spaces? Also need to check siteprep to see if the area can be reprepped with 3 license servers

    September 1, 2006

    Lost Packet

    • Fixed Multisite problems

    Time spent: 2 hours

    I looked on sons-sc-cc first for multisite errors. In /apps/Rational/Clearcase/var/logs there are logs regarding multisite. Some of the log files pointed me to look in the Event Viewer. In the Event Viewer I saw things like:

    Event Type:        Error
    Event Source:      ClearCase
    Event Category:    Shipping_server
    Event ID:          1024
    Date:              7/15/2006
    Time:              10:59:43 AM
    User:              SALIRA\ccadmin
    Computer:          SONS-SC-CC
    Description:
        shipping_server.exe(4448): Error: unable to contact the
        albd on host 'sons-cc': timed out trying  to communicate
        with ClearCase remote server
    Data:
        0000: 60 11 00 00               `...   
    

    I remember problems with sons-cc occasionally having its albd_server go whacky and taking up 50% of the CPU. Doesn't seem to be the case this time. Then again, the RDP session you have on adefaria -> sons-cc died. Perhaps the rebooted sons-cc. It's been up for about a day now.

    I RDPed to sons-cc and CC Doctor complained about a version incompatibility between CC and CQ. Looked around on sons-cc - there's no CQ installed there! Why was it removed?

    Hmmm... I RDPed to sons-clearcase. Seems it's only been up 3 hours! Must have been recently rebooted....

    It seems that a packet was lost somewhere. This is the complicated to do and complicated to explain procedure of setting epoch numbers back in time so as to replay the transactions and get everyone in the replica family on the same page...

    I think I got it all straightened out now...