CGI Programming on the World Wide Web

Previous Chapter 10
Gateways to Internet Information Servers
Next
 

10.6 Archie

Archie is a database/index of the numerous FTP sites (and their contents) throughout the world. You can use an Archie client to search the database for specific files. In this example, we will use Brendan Kehoe's Archie client software (version 1.3) to connect to an Archie server and search for user-specified information. Though we could have easily written a client using the socket library, it would be a waste of time, since an excellent one exists. This Archie gateway is based on ArchiPlex, developed by Martijn Koster.

#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$archie = "/usr/local/bin/archie";
$error = "CGI Archie Gateway Error";
$default_server = "archie.rutgers.edu";
$timeout_value = 180;

The archie variable contains the full path to the Archie client. Make sure you have an Archie client with this pathname on your local machine; if you do not have a client, you have to telnet to a machine with a client and run this program there.

The default server to search is stored. This is used in case the user failed to select a server.

Finally, timeout_value contains the number of seconds after which an gateway will return an error message and terminate. This is so that the user will not have to wait forever for the search results.

%servers = ( 
  'ANS Net (New York, USA)',         'archie.ans.net',
  'Australia',                       'archie.au',
  'Canada',                          'archie.mcgill.ca',
  'Finland/Mainland Europe',         'archie.funet.fi',
  'Germany',                         'archie.th-darmstadt.de',
  'Great Britain/Ireland',           'archie.doc.ac.ac.uk',
  'Internic Net (New York, USA)',    'ds.internic.net',
  'Israel',                          'archie.ac.il',
  'Japan',                           'archie.wide.ad.jp',
  'Korea',                           'archie.kr',
  'New Zealand',                     'archie.nz',
  'Rutgers University (NJ, USA)',    'archie.rutgers.edu',
  'Spain',                           'archie.rediris.es',
  'Sweden',                          'archie.luth.se',
  'SURANet (Maryland, USA)',         'archie.sura.net',
  'Switzerland',                     'archie.switch.ch',
  'Taiwan',                          'archie.ncu.edu.tw',
  'University of Nebrasksa (USA)',   'archie.unl.edu' );

Some of the Archie servers and their IP names are stored in an associative array. We will create the form for this gateway dynamically, listing all of the servers located in this array.

$request_method = $ENV{'REQUEST_METHOD'};
if ($request_method eq "GET") {
    &display_form ();

The form will be created and displayed if this program was accessed with the browser.

} elsif ($request_method eq "POST") {
    &parse_form_data (*FORM);
    $command = &parse_archie_fields ();

All of the form data is decoded and stored in the FORM associative array. The parse_archie_fields subroutine uses the form data in constructing a query to be passed to the Archie client.

    $SIG{'ALRM'} = "time_to_exit";
    alarm ($timeout_value);

To understand how this array is used, you have to understand that the UNIX kernel checks every time an interrupt or break arrives for a program, and asks, "What routine should I call?" The routine that the program wants called is a signal handler. Perl associates a handler with a signal in the SIG associative array.

As shown above, the traditional way to implement a time-out is to set an ALRM signal to be called after a specified number of seconds. The first line says that when an alarm is signaled, the time_to_exit subroutine should be executed. The Perl alarm call on the second line schedules the ALRM signal to be sent in the number of seconds represented by the $timeout_value variable.

    open (ARCHIE, "$archie $command |");
    $first_line = <ARCHIE>;

A pipe is opened to the Archie client. The command variable contains a "query" that specifies various command-line options, such as search type and Archie server address, as well as the string to search for. The parse_archie_fields subroutine makes sure that no shell metacharacters are specified, since the command variable is "exposed" to the shell.

    if ($first_line =~ /(failed|Usage|WARNING|Timed)/) {
        &return_error (500, $error,
            "The archie client encountered a bad request.");
    } elsif ($first_line =~ /No [Mm]atches/) {
        &return_error (500, $error,
        "There were no matches for <B>$FORM{'query'}</B>.");
    }

If the first line from the Archie server contains either an error or a "No Matches" string, the return_error subroutine is called to return a more friendly (and verbose) message. If there is no error, the first line is usually blank.

    print "Content-type: text/html", "\n\n";
    print "<HTML>", "\n";
    print "<HEAD><TITLE>", "CGI Archie Gateway", "</TITLE></HEAD>", "\n";
    print "<BODY>", "\n";
    print "<H1>", "Archie search for: ", $FORM{'query'}, "</H1>", "\n";
    print "<HR>", "<PRE>", "\n";

The usual type of header information is output. The following lines of code parse the output from the Archie server, and create hypertext links to the matched files. Here is the typical format for the Archie server output. It lists each host where a desired file (in this case, emacs) is found, followed by a list of all publicly accessible directories containing a file of that name. Files are listed in long format, so you can see how old they are and what their sizes are.

Host amadeus.ireq-robot.hydro.qc.ca
    Location: /pub
      DIRECTORY drwxr-xr-x        512  Dec 18 1990  emacs
Host anubis.ac.hmc.edu
    Location: /pub
      DIRECTORY drwxr-xr-x        512  Dec  6 1994  emacs
    Location: /pub/emacs/packages/ffap
      DIRECTORY drwxr-xr-x        512  Apr  5 02:05  emacs
    Location: /pub/perl/dist
      DIRECTORY drwxr-xr-x        512  Aug 16 1994  emacs
    Location: /pub/perl/scripts/text-processing
           FILE -rwxrwxrwx         16  Feb 25 1994  emacs

We can enhance this output by putting in hypertext links. That way, the user can open a connection to any of the hosts with a click of a button and retrieve the file. Here is the code to parse this output:

    while (<ARCHIE>) {
        if ( ($host) = /^Host (\S+)$/ ) {
            $host_url = join ("", "ftp://", $host);
            s|$host|<A HREF="$host_url">$host</A>|;
            <ARCHIE>;

If the line starts with a "Host", the specified host is stored. A URL to the host is created with the join function, using the ftp scheme and the hostname--for example, if the hostname were ftp.ora.com, the URL would be ftp://ftp.ora.com. Finally, the blank line after this line is discarded.

        } elsif (/^\s+Location:\s+(\S+)$/) {
            $location = $1;
            s|$location|<A HREF="${host_url}${location}">$location</A>|;
        } elsif ( ($type, $file) = /^\s+(DIRECTORY|FILE).*\s+(\S+)/) {
            s|$type|<I>$type</I>|;
            s|$file|<A HREF="${host_url}${location}/${file}">$file</A>|;
        } elsif (/^\s*$/) {
            print "<HR>";
        }
        
        print;
    }

One subtle feature of regular expressions is shown here: They are "greedy," eating up as much text as they can. The expression (DIRECTORY|FILE).*\s+ means match DIRECTORY or FILE, then match as many characters as you can up to whitespace. There are chunks of whitespace throughout the line, but the .* takes up everything up to the last whitespace. This leaves just the word "emacs" to match the final parenthesized expression (\S+).

[Graphic: Figure from the text]

The rest of the lines are read and parsed in the same manner and displayed (see Figure 10.1). If the line is empty, a horizontal rule is output--to indicate the end of each entry.

Figure 10.1: Archie results

[Graphic: Figure 10-1]

    $SIG{'ALRM'} = "DEFAULT";
    close (ARCHIE);
    print "</PRE>";
    print "</BODY></HTML>", "\n";

Finally, the ALRM signal is reset, and the file handle is closed.

} else {
    &return_error (500, $error, "Server uses unspecified method");
}
exit (0);

Remember how we set the SIG array so that a signal would cause the time_to_exit subroutine to run? Here it is:

sub time_to_exit
{
    close (ARCHIE);
    &return_error (500, $error,
        "The search was terminated after $timeout_value seconds.");
}

When this subroutine runs, it means that the 180 seconds that were allowed for the search have passed, and that it is time to terminate the script. Generally, the Archie server returns the matched FTP sites and its files quickly, but there are times when it can be queued up with requests. In such a case, it is wise to terminate the script, rather than let the user wait for a long period of time.

Now, we have to build a command that the Archie client recognizes using the parse_archie_fields subroutine:

sub parse_archie_fields
{
    local ($query, $server, $type, $address, $status, $options);
    $status = 1;
    $query = $FORM{'query'};
    $server = $FORM{'server'};
    $type = $FORM{'type'};
    if ($query !~ /^\w+$/) {
        &return_error (500, $error, 
            "Search query contains invalid characters.");

If the query field contains non-alphanumeric characters (characters other than A-Z, a-z, 0-9, _), an error message is output.

    } else {
        foreach $address (keys %servers) {
            if ($server eq $address) {
                $server = $servers{$address};
                $status = 0;
            }
        }

The foreach loop iterates through the keys of the servers associative array. If the user-specified server matches the name as contained in the array, the IP name is stored in the server variable, and the status is set to zero.

        if ($status) {
            &return_error (500, $error, "Please select a valid archie host.");

A status of non-zero indicates that the user specified an invalid address for the Archie server.

        } else {
            if ($type eq "cs_sub") {
                $type = "-c";
            } elsif ($type eq "ci_sub") {
                $type = "-s";
            } else {
                $type = "-e";
            }

If the user selected "Case Sensitive Substring", the "-c" switch is used. The "-s" switch indicates a "Case Insensitive Substring". If the user did not select any option, the "-e" switch ("Exact Match") is used.

            $options = "-h $server $type $query";
            return ($options);
        }
    }
}

A string containing all of the options is created, and then returned to the main program.

Our last task is a simple one--to create a form that allows the user to enter a query, using the display_form subroutine. The program creates the form dynamically because some information is subject to change (i.e., the list of servers).

sub display_form
{
    local ($archie);
    print <<End_of_Archie_One;
Content-type: text/html
<HTML>
<HEAD><TITLE>Gateway to Internet Information Servers</TITLE></HEAD>
<BODY>
<H1>CGI Archie Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/archie.pl" METHOD="POST">
Please enter a string to search from: <BR>
<INPUT TYPE="text" NAME="query" SIZE=40>
<P>
What archie server would you like to use (<B>please</B>, be considerate
and use the one that is closest to you): <BR>
<SELECT NAME="server" SIZE=1>
End_of_Archie_One
    foreach $archie (sort keys %servers) {
        if ($servers{$archie} eq $default_server) {
            print "<OPTION SELECTED>", $archie, "\n";
        } else {
            print "<OPTION>", $archie, "\n";
        }
    }        

This loop iterates through the associative array and displays all of the server names.

    print <<End_of_Archie_Two;
</SELECT>
<P>
Please select a type of search to perform: <BR>
<INPUT TYPE="radio" NAME="type" VALUE="exact" CHECKED>Exact<BR>
<INPUT TYPE="radio" NAME="type" VALUE="ci_sub">Case Insensitive Substring<BR>
<INPUT TYPE="radio" NAME="type" VALUE="cs_sub">Case Sensitive Substring<BR>
<P>
<INPUT TYPE="submit" VALUE="Start Archie Search!">
<INPUT TYPE="reset"  VALUE="Clear the form">
</FORM>
<HR>
</BODY>
</HTML>
End_of_Archie_Two
}

The dynamic form looks like that in Figure 10.2.

Figure 10.2: Archie form

[Graphic: Figure 10-2]

This was a rather simple program because we did not have to deal with the Archie server directly, but rather through a pre-existing client. Now, we will look at an example that is a little bit more complicated.


Previous Home Next
Checking Hypertext (HTTP) Links Book Index Network News on the Web

HTML: The Definitive Guide CGI Programming JavaScript: The Definitive Guide Programming Perl WebMaster in a Nutshell