Report 3

PGS Prototyping at ECS Science & Technology Lab

Name: Parag Ambardekar
PGS Manager
Hughes Applied Information Systems, Inc.
Landover, MD 20785

As part of the prototyping activities at the ECS Science and Technology Lab (STL), we plan to acquire, port and run algorithm code from various instruments on the distributed/parallel testbed and HPCC sites to the extent possible. The purpose of this activity is to assess various processing technologies (DCE, workstation multiprocessors, MPPs, etc.), evaluate and validate hardware architecture for PGS, and share lessons learned with the science community. It is also intended to identify issues for resolution to facilitate algorithm integration and testing. We plan to communicate the progress on these activities and important lessons learned for hardware architecture and algorithm integration and testing from time to time. We have already sent two reports so far, the first in April and the second in May. This is the third set of reports.

The reports are meant for informal exchange of technical information rather than formal ECS programmatic status. Consequently the plans for future activities may change based on schedule and resource constraints. These reports represent a snapshot of PGS activities in ECS STL at a given time. We do not spend time and resources to polish the reports and present the information in a logical manner. The attachment consists of several reports. Some duplication of information is unavoidable. The attached reports are:

* Distributed computing and parallel processing: Available alternatives

* Distributed computing of AVHRR/Land Pathfinder algorithm using OSF/DCE

* ScanSAR processor benchmark software for evaluating PGS hardware

* Plans for AVHRR/PGS toolkit prototyping

* Status of other activities

* Work planned for July-Aug 1994


PGS Prototyping at ECS Science & Technology Lab Progress Report #3

Date: 21 June, 1994

Distributed Computing and Parallel Processing: Available Alternatives

Narayan Prasad
Mac McDonald

1. Background

Technology offers a wide variety of options in using multiple processors to increase the performance of a computer system. There are many base architectures, different programming models and different interconnect schemes to make the CPUs work together as a team. Further, to complicate matters, systems employing multiple processors are able to provide different types of performance for different applications. In the light of recent developments in technology, potential speed-ups from multiprocessing are very attractive. Standard microprocessors are being designed for multiprocessing applications, their price/performance decreasing, and the speed of multiprocessing systems is improving dramatically as improved system designs and programming techniques are developed.

With the availability of powerful workstations, one approach to multiprocessing is by distributed cooperative computing with homogeneous or heterogeneous clustered workstations which offer a very viable and cost effective means of computing, and also provide entry into parallel computing. Distributed computing means, the cooperation of two or more machines communicating over a network as a single computational resource, and thus exploit the aggregate power of many machines to solve important scientific problems. A homogeneous computing environment consists of a number of computers of the same architecture running the same operating system. A heterogeneous computing environment employs computers with a variety of architectures and operating systems. In this context, a cluster (also called a farm) can be defined as a collection of computers on a network that can function as a single computing resource through additional system management software. In a clustered environment, it is typical to have a large number of different computers with widely varying resource and performance considerations to handle I/O operations, perform graphics and visualization, intensive computation, etc. The system management software harnesses the collective computing power of all the workstation platforms that form the cluster. It also efficiently distributes the total power available in the cluster to the users. The conventional vision of the distributed computing environment is characterized by the user or application searching a network, finding available resources (such as idle CPU cycles, etc.) and gathering them together to perform the required application.

Parallel programming on multiprocessor workstations has recently become immensely popular. With the introduction of powerful workstations like the SGI Challenge series, Sequent Symmetry Systems, Convex Exemplar multi-purpose parallel computer, etc. with multiple processors which provide true multiprocessor capability (not array or vector processors) that are tightly coupled (typically up to 24 processors) sharing a single pool of memory to enhance resource sharing and communication among different processes, parallel programming has been made easier and less expensive. All processors are identical, and can execute both user code and kernel (operating system) code. Programs written for single-processor system can usually run on these systems without modification for multiprocessing support.

In contrast to parallel programming on workstations, parallel programming on massively parallel systems like the Cray T3D, MasPar, etc. employ hundreds of processors to solve a problem. The CPUs are very tightly interconnected in which the speed of communications is balanced with the speed of each CPU. This allows hundreds or thousands of CPUs to be teamed together. In this approach, software needs to be modified to take advantage of the numerous processors.

2. Available alternatives

2.1 Distributed computing with DCE

One such environment that can harness the computing resource of several workstations or mainframe supercomputers is the OSF/Distributed Computing Environment (DCE). The DCE provides services and tools that support the creation, use and maintenance of distributed applications in a heterogeneous computing environment. The machines can physically be located anywhere, and are connected over the network. DCE provides interoperability and portability across heterogeneous platforms. DCE is based on three distributed computing models - client/server, remote procedure call and data sharing. The client/server model is a way of organizing a distributed application. The distributed application is divided into two parts, one part residing on each of the two computers that will be communicating during the distributed computation. The Remote Procedure Call (RPC) model is a way of communicating between parts of a distributed application. In this model, the client makes a procedure call, which is translated into network communications by the underlying RPC mechanism. The server receives a request, executes the procedure, returning the results to the client. The data sharing model is a way of handling data in a distributed system. In this model, the data is shared by distributing it throughout the system. In data sharing, a copy of the server's data is sent to the client, and the client accesses the file locally.

2.1.1 Selection criteria

The key to success is to find concurrency and parallelism in an algorithm. Some typical categories are:

* Placing the complete, sequential algorithm on each processor but segment the dataset so that each processor handles one segment of data at a time. For example, each orbit of data can be processed independently on a workstation in a typical client-server mode. The server may contain the user's application. The client creates the necessary distributed processing environment and assigns processing to the individual processors. This kind of processing is coarse-grained. It is usually the easiest to implement. A majority of the ECS algorithms may lend themselves to this kind of processing.

* I/O operations which are typically slower than other computational operations can be performed independently, usually targeting platforms that are more suitable for such an operation.

* Identifying areas within the algorithm where concurrency is exhibited, and spreading the processing across multiple processors in the cluster. The smaller the dependency between segments, lesser is the communication among processors, and greater the performance.

2.1.2 Applicability to processing scenarios

Programmers want to exploit the computing power and inherent parallelism available throughout the distributed environment. The DCE Threads Service provides portable facilities that support concurrent programming, allowing an application to perform many actions simultaneously. While one thread performs initialization of data structures, another thread can process user input. The Threads Service includes operations to create and control multiple threads of execution in a single process and to synchronize access to global data within an application. Because a server process using threads can handle many clients at the same time, the Threads Service is ideally suited to dealing with multiple clients in client/server-based applications.

2.1.3 Performance

DCE threads improve the performance (throughput, computational speed, responsiveness, or some combination of these) of a program. Multiple threads are useful in a multiprocessor system where threads run concurrently on separate processors. In addition, multiple threads also improve program performance on single processors systems by permitting the overlap of input and output or other slower operations with computational operations. Also, by distributing parts of the application, the overall throughput can be increased.

2.1.4 Portability

Portability across the entire distributed environment depends upon how portable the science algorithm is. Besides, portability across the distributed environment also requires file system and I/O transparency. The DCE file system gives users a uniform name space, file location transparency, and high availability. This means, data available in one location can be shared by a processor at another location.

2.1.5 Advantages

* Distributed clustered systems are economical, and the user can decide the configuration depending upon the requirements. Configuration can also be changed with changing requirements.

* Even sequential algorithms can be used in a coarse-grained distributed mode

* Any combination of workstations, supercomputers and even multi-processor systems can be used anywhere on the network

* Allows for targeting machines for I/O and CPU intensive portions of the program

* A cluster provides several advantages over a single system. Expensive storage resources are shared by the nodes in the cluster. It provides cost effective high availability by providing features that increase the time during which an application is available. Layered database products specifically designed for a shared disk environment perform better when run on a cluster.

* The DCE Threads Service provides a simple programming model for building concurrent applications

* Transparent multiprocessing support where applications need not know whether threads are executing on one or several processors

* An advantage of using multiple threads over using separate processors is that the former share a single address space, all open files, and other resources.

* Multiple threads can reduce the complexity of some applications that are inherently suited for threads.

* Provides data conversion for operating in a heterogeneous environment using DCE stubs generated from the interface IDL file

2.1.6 Disadvantages

* The performance depends on networks that cluster the workstations together. Relatively low bandwith of the interconnect, and the high communication overhead severely limit the types of algorithms that can effectively be processed in this mode. Algorithms that work well are those in which there is minimal coupling required between cooperating processors.

* Managing multiple threads can become complex in an application

2.2 Parallel processing The fundamental idea behind parallel computing is to task CPUs as a team to solve single or multiple problems. Parallel processing has historically been hardware- and performance-driven. Parallel systems are generally classified by the method in which they process data and instructions (i.e. in singular or multiple data or instruction streams). There are four combinations of ways of processing instructions and data:

* SISD (Single-Instruction, Single-Data)

* SIMD (Single-Instruction, Multiple-Data)

* MISD (Multiple-Instruction, Single-Data)

* MIMD (Multiple-Instruction, Multiple-Data)

The two most common architectures are SIMD and MIMD. In SIMD, each processor executes exactly the same command at each clock cycle on multiple data. MIMD on the other hand allows multiple instructions on multiple data during each clock cycle. Furthermore, depending on the memory organization for parallel systems, they can either share memory or have distributed memories as discussed later.

2.2.1 Selection criteria Whether a new parallel application is designed or an existing application from a serial system is being ported, the key to success is being able to see the application from a new viewpoint and disregarding older, more traditional solutions. Parallelism should not be forced on the problem. Instead, the point is to find and exploit any potential parallelism inherent in the problem. In some applications, the calculations on one set of data are completely independent of calculations on other sets of data (i.e., there is no communication between the calculations), and the calculations all require approximately the same amount of time. While in other applications, the time required may not be the same.

Depending upon the nature of the problem at hand, the user can create a parallel application on symmetric multiprocessors like (SGI Challenge, Sequent Symmetric Systems, Convex Exemplar Parallel Computer, etc.) or can distribute the application on clustered workstation platforms using parallelization tools that aid in the distribution of portions of the user's program to available processors on the network.

2.2.1.1 Parallel computing on multiprocessor workstations

Symmetric multiprocessing involves using a number of processors (probably not a "massive" number) to increase aggregate throughput of a computer system. All CPUs are identical and any CPU can execute both user code and kernel (operating system) code. The CPUs operate on a peer basis, executing a single copy of the operating system executive or kernel. There is no designated master CPU except during system startup and diagnostic operations. Any process (program) in any state can execute on any CPU. Several options are provided to the application developer. The first option is to simply recompile existing codes and let the compiler detect whatever parallelization possible. Such an option is called shared memory (SM) parallel. SM typically results in fairly fine-grained parallelization (for example, at the loop level). This design is called tightly-coupled, because the processors must coordinate access to the common memory. Each processor has cache memory which allows it to limit its accesses to the common memory and thereby minimize its use of the system bus. Parallel computation can occur at the statement, procedure, or program level, so these systems can accomodate anything from medium- to large-grain parallelism. The second option is the support of programs that explicitly support parallelization through message passing (MP). Parallel Virtual Machine (PVM), a message passing interface from Oak Ridge National Laboratories is one of the message passing libraries that is popular today. The benefits of using the MP programming model through PVM (or any other MP libraries) is that the application will often execute on a variety of platforms. This type of design is described as loosely-coupled, meaning that each processor is almost entirely self-sufficient due to its local memory. Distributed-memory multiprocessors are most useful for problems that can be broken down into totally independent or unrelated parts, each of which requires extensive computation or other activity (like I/O, etc.). The above two options may be combined to provide two levels of parallelism within an application.

2.2.1.2 Distributed/parallel computing on clustered platforms

Distributing parallel applications across clustered platforms fall under the message passing paradigm. The message passing method can run parallel applications across clustered heterogeneous platforms interconnected by a high performance network. In addition to PVM which add parallel support for computers distributed across a network, commercial packages like Linda (Scientific Computing) and Express (Parasoft) also add parallel support for multiple processors. Besides power compilers that are available (e.g. SGI Challenge, Sequent, Convex, etc.), commercial parallelizing compilers (like Applied Parallel Research, Inc.) are available that aid in creating parallel applications that can be distributed across a cluster of heterogeneous workstation platforms. APR's parallelizing compilers like xHPF90 and Forge90 generate a parallelized SPMD (Single Program, Multiple Data) Fortran 77 program for execution on a distributed memory multiprocessor system. Digital's Alpha AXP farms provide comprehensive solutions with heterogeneous workstation clusters through software for compute sharing, file systems support, and parallel computing. Their LSF software provide clustering capability by integrating UNIX systems and low-cost servers to effectively use the computing power available from distributed systems. The components of the cluster are Alpha AXP sytems (or any supported UNIX platform), Ethernet, FDDI, or GIGAswitch, UNIX cluster software LSF, PVM and PolyCenter can together increase throughput (through batch queuing, load leveling, interactive job management), perform parallel processing and utilize unused distributed compute cycles.

2.2.2 Performance

In symmetric systems as the processing load increases, several CPUs are available to service the increasing number of processes. Each process is given more time with a CPU, each process spends less time waiting, and the system spends less time switching from one process to the next. Performance is, thereby, greatly increased. Programs can be automatically or manually tailored to use multiple CPUs simultaneously, and thus obtain a dramatic increase in execution speed. Parallel applications can be structured to adapt themselves to the number of CPUs present in the system. Thus, the number of CPUs is simply a tuning parameter: the same parallel application can be run on systems of different configurations without any major modifications to the existing software.

In a distributed mode with clustered processors, the performance is affected greatly by the speed of interconnect (Ethernet, FDDI, FDDI+GIGAswitch, ATM, etc.).

2.2.3 Portability

The parallel nature of symmetric architectures is transparent to sequential programs, and most programs can be ported from other systems with little or no change to the source code. Both parallel programs and ordinary sequential programs can run simultaneously on these systems. However, for distributed heterogeneous architectures, portability is governed by the individual platforms the application will be run, and also the underlying communication and resource management software.

2.2.4 Advantages

Parallel processing on a cluster of workstations

* Share the advantages discussed in Section 2.1.5

Parallel processing using SMP

* Share the same memory subsystem, let alone the same cabinet

* By sharing common memory and I/O subsystems, the processor may share peripherals, eliminating redundancy and wasted disk space.

* Transparency - programs written for a uniprocessor system can run on a symmetry system without modifications for multiprocessing support.

* Dynamic load balancing - CPU's automatically schedule themselves to ensure that all CPUs are kept busy as long as there are executable processes available.

2.2.5 Disadvantages

Parallel processing on a cluster of workstations

* I/O is not parallelizable

* In a clustered workstation mode, speed of interconnect can affect performance

2.3 Parallel processing using Massively Parallel Processors (MPPs)

Although the term massively parallel processing is overused, it typically refers to processing done on numerous small processors. The paradigms discussed in Section 2.2 apply to massively parallel processing using MPPs. Table 1 lists a few MPPs along with the number of processors.

TABLE: 1

Number of processors

SIMD

Maspar MP-1 16384

TMC CM-2 2048

MIMD

nCUBE2 1024

nCUBE3 65536

iPSC/860 128

Intel Delta 512

Intel Paragon 4096

TMC CM-5 16384

Cray T3D 32-2048

2.3.1 Selection criteria

Unlike compiling serial code on a symmetric parallel processor using power compilers, serial code for a massively parallel processor requires major modification. The problem should lend itself to parallelism at a very fine-grained level to take advantage of the number of processors available. Loops should be redesigned to reduce dependencies. Usually the programmer is required to program the communication between processors, although newer tools are being developed for automatic parallelization of applications. However, parallel applications developed on distributed workstations may be more easily portable to MPPs than porting a serial code. In short, the spectrum of applications suitable for an MPP architecture is more limited.

2.3.2 Performance

MPP systems definitely have the potential of satisfying the need for more performance for less price. Performance depends upon how well the problem is tuned to the hardware. But they are limited by the programming model used, availability of application software development tools and limitations on parallelization.

2.3.3 Portability

* MPP Fortran dialects hide data communication and inter-processor management issues, but the tools they offer for data layout are machine specific, presenting an obstacle to porting applications from one MPP to another. Perhaps a language like High Performance Fortran (HPF) will go a long way in solving portability issues. Commercial MPP fortran implementations are only available on SIMD architectures. Commercial tools for MIMD are still too low-level. There is no widely accepted, architecture-independent language of the C family that has emerged.

* Despite advances in languages and compilers for parallel machines, programming a parallel machine will remain more challenging than programming a sequential machine, especially during debugging and performance tuning. There are more tuning parameters in parallel machines (such as data placement, granularity, etc.) than a sequential machine.

* MPP systems often require abstruse extensions to otherwise standard languages to extract performance from the architecture. In the absence of standards, computer system vendors have developed their own proprietary language extensions. This often leaves the resulting code riddled with non-portable constructs and thus limits portability. Perhaps, high-performance languages like HPF can make parallel applications hardware independent and more portable.

2.3.4 Advantages

* MPPs allow multiple I/O paths into the system. If the data can be partitioned among different I/O channels, then current MPPs can provide an acceptable solution.

2.3.5 Disadvantages

* Data distribution across I/O channels is not very mature than data distribution across CPUs

* Expensive

* Limited spectrum of applications

* Developing (or porting existing) applications to MPP architectures require a lot of programmer effort. Development can usually be accomplished via one of the distributed computing modes discussed earlier.


Distributed computing of AVHRR/land Pathfinder algorithm using OSF/DCE

Narayan Prasad

1. Background

While we are awaiting other ECS algorithms, we are continuing our technology assessment with the AVHRR/Land Pathfinder algorithm. Distributed computing using DCE (Distributed Computing Environment) is one approach to processing AVHRR/Land Pathfinder data. A prototype has been set up at the ECS Science and Technology Lab (STL) distributed/parallel testbed. The following is a brief summarization of DCE and its application to the Land Pathfinder algorithm. Some useful lessons learned using DCE is outlined to give the science algorithm developers a better understanding of the available options for distributing processing to multiple processors.

2. Identify parallelism/concurrency

The very first step in the process of distributing any application is to analyze the algorithm for parallelism.

2.1 Coarse-grained parallelism - distribution of orbits

In a very coarse-grained fashion, the AVHRR/Land Pathfinder algorithm lends itself to distributed computing. Each orbit of data is processed through all four steps from initialization to product generation. Processing each orbit can, therefore, be done independent of other orbits. The DCE thread capability can be used to concurrently process AVHRR data. The client machine creates multiple threads and assigns the processing of one or more orbits to each thread created. The threads are then mapped to available processors. For example, in an ideal case, to process 1 day (14 orbits) of AVHRR data, 14 threads can be created and mapped to 14 different processors on heterogeneous platforms. The 14 orbits of data are processed concurrently and independent of other orbits. The current processing rate (throughput) can be drastically improved with the proposed approach, depending upon the number of processors available. An attempt to distribute the AVHRR algorithm in a coarse-grained fashion (at least the initialization and navigation parts) is made as a part of the prototyping at the ECS Science and Technology Lab (STL).

2.2 Medium-grained parallelism - distribution of chunks

The AVHRR/Land Pathfinder algorithm also exhibits medium-grained parallelism. Each orbit is divided into units or "chunks" of scans. Processing is performed on each chunk as a separate unit. Each chunk is processed completely, from ingest through product generation, until a complete orbit is processed. Each of these chunks can be processed independently.

Processing in parallel introduces additional complexities depending upon the type of algorithm. If they require a common memory which is constantly updated, then a shared memory model is more appropriate than a distributed memory model. It is the nature of the problem that decides on what kind of processing would be the most appropriate to maximize efficiency.

2.3 Background and terminology of Distributed Computing Environment (DCE)

The DCE provides services and tools that support the creation, use and maintenance of distributed applications in a heterogeneous computing environment. "Distributed computing" means, the cooperation of two or more machines communicating over a network. The machines can physically be located anywhere, and are connected over the network. DCE provides interoperability and portability across heterogeneous platforms. DCE is based on three distributed computing models - client/server, remote procedure call and data sharing. The client/server model is a way of organizing a distributed application. The distributed application is divided into two parts, one part residing on each of the two computers that will be communicating during the distributed computation. The Remote Procedure Call (RPC) model is a way of communicating between parts of a distributed application. In this model, the client makes a procedure call, which is translated into network communications by the underlying RPC mechanism. The server receives a request, executes the procedure, returning the results to the client. The data sharing model is a way of handling data in a distributed system. In this model, the data is shared by distributing it throughout the system. In data sharing, a copy of the server's data is sent to the client, and the client accesses the file locally.

2.3.1 DCE Cells

A Cell is a basic unit of configuration and administration in DCE. It is the locality of reference. A cell consists of some number of systems that perform most of their communications with each other. A cell can have any number of machines. Every machine belongs to one cell. There are no geographical restrictions placed on cells. Cells can be created, changed, moved or even eliminated over time. All DCE cells are organized into contiguous namespace. This structure makes it easy for clients in one cell to locate and use services provided in another cell.

2.3.2 Cell Directory Service (CDS)

In order to help clients find servers in a flexible manner, DCE provides a name service to store binding information. A name service is a distributed database service used by applications to store and retrieve information. The CDS is a particular name service supplied with DCE. RPC servers store binding information (which consists of protocol sequence like the Internet Protocol, Network Computing Architecture connection-oriented protocol and the Transmission Control Protocol for transport, server host name, server process number, etc.), in the name service database so that RPC clients can retrieve the binding information and find servers.

2.3.3 DCE threads

A thread is a single, sequential flow of control within a process. In a traditional computer program, there is only one thread of control. Execution of the program proceeds sequentially, and at any given time, there is only one point in the program that is currently executing. The DCE Multithreading Service allows multiple threads, that is, multiple, concurrent flows of control, within a single process. The multiple threads can be mapped to multiple processors when they are available. All threads within a process use a common virtual address space. Threads may progress independent of one another. That is, one or more threads in a process can wait for I/O or events while others continue to run.

2.3.4 DCE RPC

The client-server model for distributed applications has a client program (client) and a server program (server), usually running on different systems of a network. The client makes a request to the server, which is usually a continuously running daemon process, and the server sends a response back to the client. The RPC mechanism is the simplest way to implement client-server applications. Applications that make remote procedure calls (RPC applications) is divided into two parts: an RPC server, which offers one or more sets of remote procedures, and an RPC client, which makes remote procedure calls to RPC servers. The RPC call looks very much like a local procedure call. The underlying details of the network communication are hidden from the user.

2.3.5 Interface between client and server

The first step in coding a DCE application is to define one or more interfaces through which the application's clients and servers will communicate. An interface definition contains a set of procedure declarations and data types. Just as programmers select functions from libraries, client application writers use interface definitions to determine the data type of the remote procedure's return value, and the number, order and the data types of the arguments. It is the interface definition that ties the client and the server together. The server and the client communicate via the interface IDL file. In the AVHRR/Land Pathfinder prototype, the interface includes names of the input and output data files, and metadata information. Interfaces are defined in a declarative C-like Interface Definition Language (IDL) and then compiled by the IDL compiler. The clients and servers communicate through a Universal Unique Identifier (UUID). Generating a UUID for an application's interface is the very first step in the IDL process. The UNIX command "uuidgen -i" produces an UUID. An example of an UUID is given below:

[
uuid(c21b62ec-7e3f-11cd-b2ba-08000918aeab),
version(1.0)
]
interface INTERFACENAME
{

}

The IDL compiler creates stubs for both the client and server. The IDL stubs take care of all the byte conversions from one machine architecture to another. For example, if parts of an application are to be distributed across heterogeneous architectures, the user need not be concerned with the byte-ordering, or whether it is 32/64 bit architecture when passing variables or data. It is this interoperable feature that would be very beneficial to science processing.

2.3.6 Developing a server

2.3.6.1 Server structure

A server consists of two portions of code: the server initialization portion, and the procedural or "manager" portion.

* The server initialization portion

This code initializes the server, enabling it to receive client requests across the network. The server makes the majority of its calls to RPC runtime routines during server initialization. The protocol sequence is selected here, the server is advertised, and then enters a server "listen" mode for remote procedure calls.

* The procedural or "manager" portion

This code implements the remote procedures. The procedures declared in the interface definition should be implemented. In the prototype, Procedure "pathfinder" is called in the manager portion of the server code.

The server exports its interface and binding information with the CDS, and in addition registers its CDS entry as a member of the CDS group. The server uses the UDP transport for efficiency. TCP could also be used. The server processes each request from a client in a thread.

2.3.7 Developing a client

The four principal phases of client development are as follows:

1. Implementing binding management method

The client uses the binding handle to connect to an appropriate server. The explicit method passes the binding handle to RPC runtime as a parameter. This method uses an individual remote procedure call to use a specific server. It is best suited to applications that make calls to many different servers.

The implicit binding offers the application developer basic control over which server is used by the application. This method is best suited to applications that use only one server.

Automatic binding is the least difficult for the client application developer to implement, but it offers the least control over which server provides remote procedures to the application. Automatic binding uses the CDS to manage the binding to the server.

2. Write code to obtain a binding handle. Can use the CDS to obtain the binding handle

3. Write code to make remote procedure calls

Once the client application has a binding handle (or handles) for the server(s) it needs, it can make remote procedure calls to the procedures offered by that client.

4. Write code that anticipates both successful and unsuccessful remote procedure calls

The client application must be prepared for both successful remote procedure calls, and unsuccessful ones - RPCs may fail due to communications failures, server failures, or for other reasons. Code for error handling and recovery should be written to handle failures.

The client looks up all the members of the CDS group in which the remote load servers in this cell have registered. The client creates a vector of binding handles, one for each of them. Then, for each server in the group, the client creates a separate thread to make an RPC to perform the requested server application (processing of the AVHRR data).

2.3.8 Where to embed the application for execution on a server?

The manager function contains the calls to the AVHRR pathfinder program. The pathfinder program is executed by the client creating a thread, making an RPC call to the server, and executing it on the server. For a coarse-grained concurrent application, the entire pathfinder algorithm goes into the manager function. Otherwise, only that portion of the algorithm that is to be executed on a server resides on the server as a part of the manager function.

2.3.9 Start the server application in each remote platform

Prior to starting the server in each platform, dce_login into each server with the user name. This allows registering of the server binding information with the CDS. If the server binding information is not registered for each server on each platform, the client will not automatically look for the appropriate server to run the application. In which case, the client needs to know explicitly the IP addresses of each server platform. The server is started in each platform where the distributed section of the code is to be executed. If continuous processing is required, the server process runs as a daemon. The server can be physically located anywhere on the network.

2.3.10 Start the client from a remote platform

Either the names of the servers are explicitly stated as command-line arguments to the client, or the client is made to choose the servers that are waiting to process the request via the CDS. The UUID number in the IDL file is used to match the client with the appropriate server processors. If all goes well, how many threads the client creates, and where the actual application is processed is transparent to the user.

2.4 Additional issues to be considered in a parallel approach

There are important issues that need to considered in a parallel approach.

2.4.1 Preparing output products

Some issues affect almost all the ECS algorithms, more specifically the AVHRR/Land Pathfinder algorithm. No assumption should be made on when a distributed process or thread completes. If the output products are binned in some order (for example chunk by chunk, or orbit by orbit), output product generation should be done intelligently without losing the performance gains due to parallelism (i.e., the client process should not wait until all the server processes are finished before writing output product files). One way to generate an output product in a parallel environment is through a random access file, which segment data based on record number, etc. In the current version of the Land Pathfinder algorithm, however, output product generation is designed for a sequential algorithm. No attempt is made in the prototype to redesign output product binning.

2.5 Lessons Learned

* In a distributed client-server environment good programming practice is essential. Whenever memory is allocated via pointers (using C functions calloc or malloc), it is important to release memory when not needed. Otherwise when the server application is finished, memory is still not deallocated because the server never terminates (the server is always in a listening mode waiting to process requests from the client). Depending upon the number of client requests, the server application can quickly run out of memory. The DCE call rpc_ss_allocate can instead be used which would automatically release memory when the client-server communication is terminated.

* Using Unix function sleep inside a DCE threads application causes an illegal instruction exception to be generated. HP DCE support recommends use of DCE routines (like wait) that can emulate the sleep function.

* Have plenty of swap space on the system. Unpredictable results occur which are often hard to trace.

* It is important to have sufficient communication time between the server and client. DCE calls in the client give options to manipulate communication parameters. A communication timeout can cause bizarre and unpredictable side effects.

2.6 General comments about DCE and its use in scientific applications

Scenarios where DCE may provide an excellent environment are:

* Applications exhibiting medium- to coarse-grained parallelism

* Applications where part of the algorithm does searching, or any database operation. The part that does this function can be installed on a server just dedicated to providing the result of the database search. The server can process multiple requests simultaneously and independently on a programmer specified number of threads, and providing the result to the client.

* Threads capability can be used for processing very intense application (like I/O) on one thread while proceeding with other less intensive operations on the other, even on a single processor. Thread performance is much better when processing operations with widely varying resources.

* Target platforms for certain intensive operations (like I/O, CPU)

2.7 Reuse of server and client software

If the server and client functions are designed well, they can be reused. The application usually dictates whether the client creates threads and processes all the threads in one processor, or multiple threads created by the client make RPC calls to map to remote servers, etc. All the server and the client software does is establish communication between the client and server. The manager.c program contains the application code in an encapsulated form.

2.8 Additional features in DCE of use for science algorithms

2.8.1 DCE Pipes

A pipe in a DCE application efficiently passes very large or incrementally produced quantities of data in a remote procedure call. The following kinds of data are candidates for pipes:

* large quantities of data

* data of unknown size that cannot be in memory all at once

* data incrementally produced or consumed and not in memory all at once

Pipes put the RPC mechanism in charge of data transfer because it can use the underlying transport protocol more efficiently than an high-level application can.

2.9 Current status

* A functional prototype of the initialization and navigational parts of the AVHRR/Land Pathfinder Algorithm has been developed using OSF/DCE on a distributed set of workstations. Distributed processing is done on three HP's in the same cell and another HP on a DCE cell at GSFC. Both intracell and intercell distributed computing has been prototyped. Interprocessor communication takes place over the internet.

3.0 References

1. OSF/DCE Application Development Guide, Revision 1.0, Open Software Foundation, Prentice Hall, Englewood Cliffs, New Jersey 07632

2. OSF/DCE Application Development Reference, Revision 1.0, Open Software Foundation, Prentice Hall, Englewood Cliffs, New Jersey 07632

3. OSF/DCE Guide to writing DCE Applications, John Shirley, O'Reilly & Associates, Inc., Sebastopol, CA 95472

4. NCSA/MOSAIC offers plenty of documents/programs on DCE and its applications

5. DCE installed on HP machines (/opt/dcelocal/hpexamples) come with excellent working examples that demonstrate the programming of DCE applications. Many of the communication software are designed for reuse.

6. Anonymous ftp site where DCE applications can be obtained:

ftp.uu.net

cd /published/oreilly/dce/applic_guide

7. OSF DCE Guide to Developing Distributed Applications,

Harold Lockhart, Jr., McGraw Hill, Inc., 1994.

(This book comes with a diskette containing many DCE examples)


ScanSAR Processor Benchmark Software for Evaluating PGS Hardware

Narayan Prasad

ScanSAR Processor benchmark software from JPL is specifically designed for the purpose of evaluating hardware computing platforms for implementing the Radarsat ScanSAR Processor. They are fragments of real science software, typical of ECS algorithms. This suite of benchmark software will be used to study various vendor architectures. It can accomplish the following:

* Test CPU performance

* Test I/O performance

* Test concurrent I/O and CPU performance

* Test output tolerance

* Test performance of Fortran/C interface (would be typical of ECS algorithms with use of PGS toolkit)

Overall Plan:

1. Analyze SAR benchmarks

2. For portability across platforms, write wrappers around C function calls made from the Fortran program. The C functions perform the I/O and other time keeping operations.

3. Evaluate serial processors (HP, Sun, SGI, IBM, DEC, etc.).

4. Evaluate symmetric multiprocessors (SGI Challenge, Convex SPP, Sequent SMP, etc.)

5. Evaluate MPPs (Connection Machine, Cray T3D, Intel Paragon, Maspar, etc.)

Current Status

* Analyzed SAR benchmarks

* Writing wrappers around C function calls for portability across platforms

* Evaluating on serial processors (HP, SGI)

Concern

CPU times are too small even on workstations.

Work-around

The processing will be repeated a large number of times by enclosing it in a loop.


Plans for AVHRR/PGS Toolkit prototyping

Larry Klein
Tom Atwater
21 June, 1994

The objective of this project is to include as many different PGS Toolkit calls as possible in the AVHRR/Land Pathfinder code. The main points to test are:

* Accuracy: Are results with Toolkit functions identical to results without Toolkit functions.

* Speed and efficiency: How fast are Toolkit functions compared to the existing functions.

* Completeness: Whether a given class of Toolkit functions

is sufficient to replace the given Pathfinder/AVHRR functionality.

* Ease of use: How easy is it to replace existing calls with Toolkit calls. This is relevant for science software that uses heritage code.

* Engineering studies: Is the structure of the Toolkit as good as it can be?

* PGS emulation utilities: The AVHRR code can be used to test emulation utilities which may be included in the SCF Toolkit.

* Of the existing Toolkit code, the major tool groups appropriate to include in this prototype are the Error/Status (PGS_SMF) and Process Control (PGS_PC) tools. There will not be a comprehensive effort to replace all AVHRR/Pathfinder error handling with the Error/Status tools. File open (PGS_IO_Gen) and temporary file handling (PGS_IO_Gen_Temp) tools can also be exercised. Note that all of the above tools fall in the "mandatory" category, in that science software is required to use them.

* It may also be possible to test such "optional" geolocation tools

as Time/Date (PGS_TD) and Coordinate System Conversion (PGS_CSC) tools. This is of lower priority, and depends on further analysis of the AVHRR algorithm.

* Other tools which may be tested, such as Level 0 Access (PGS_IO_L0), Memory Management (PGS_MEM) and Ancillary/Auxiliary Data access (PGS_AA), are not yet available.

* Prototyping will take place on the HP workstation adriatic, and not on the PGS testbed, for reasons of disk space. Numerical results and timing benchmarks will be comparable to testbed results since the hardware and software is identical except for Toolkit calls; in addition, a virgin copy of the AVHRR code (i.e., without Toolkit calls) will be maintained in parallel with the working version on adriatic for comparison purposes. Thus there will be two points of comparison.


Status of other activities

* Installed ERBE Inversion Subsystem code (from ERBE Data Management team at LaRC) on Sun Sparc10 (Solaris). Output products generated at the STL were verified for correctness. The code runs only for ~5 min CPU time. According to Dan Chrisman, DAAC Liaison at LaRC, this inversion subsystem code is the most time consuming part of their inversion algorithm. It is used repeatedly during processing. However, the relatively small CPU time does not make it an ideal candidate for distributed computing on the testbed.

* Ongoing exchange of information with MODIS and AVHRR/Land Pathfinder teams on distributed computing

* In collaboration with MODIS team at GSFC on AVHRR prototyping. Experiences and lessons learned are being shared.

* Portability issues of AVHRR/Land Pathfinder algorithm to 64-bit architectures complete. Issues were circulated via PGS Newsbites.

* Plan put together for evaluating SAR science software on the distributed/parallel testbed, selected HPCC facilities, and other vendor hardware.

* Exploring with Lyn Oleson (EDC) if Landsat Level 1A algorithm is suitable for PGS prototyping.

* Obtained Pathfinder SSM/I precipitation rate algorithm (Fortran 77) from NASA/MSFC. Pathfinder sea ice is expected shortly. Pathfinder Land and Ocean algorithms are expected in July-August timeframe.

* We are expecting some representative Fortran code from MODIS shortly

* Hardware Evaluation:

IBM SP-2 (second generation MPP)

Sequent SMP (Symmetric Multiprocessor)

Convex SPP (Scalable Parallel Processor)

Pyramid Technology Corporation SMP/MPP

* Test PGS toolkit with AVHRR/Land Pathfinder algorithm for accuracy, speed and efficiency, completeness, ease of use, PGS emulation studies and perform other engineering studies.


Work Planned for July-Aug, 1994

* Contact Dr. Bruce Guenther at NASA/GSFC regarding Level 1B calibration algorithm. It is one of the tall poles.

* Exploring the suitability of Water Leaving Radiances algorithm (Dr. Bob Evans, U. of Miami) for PGS prototyping.

* Contact Dr. Mark Abbot at Oregon State University. They have a big parallel processing facility with a Connection Machine (CM5). Their experiences and lessons learned would be very valuable to the prototyping efforts.

* Evaluate what parallel processors can do for the tall pole algorithms using HPCC resources.

* Decide on a few parallel processors based on ECS requirements and vendor specifications. Work with vendors (IBM, Cray, Sequent, Convex, etc.) to evaluate their architectures and parallelization tools.

* Continue toolkit/AVHRR study

* Continue hardware evaluation using ScanSAR benchmarks

* Evaluate applicability of SSM/I algorithms to symmetric multiprocessing, distributed processing on clusters, etc.