Name: Parag Ambardekar
PGS Manager
Hughes Applied Information Systems, Inc.
Landover, MD 20785
As a part of the prototyping activities, we plan to acquire, port and run algorithm code from various instruments on the distributed/parallel testbed and HPCC sites to the extent possible. The purpose of this activity is to evaluate and validate hardware architecture for PGS. It is also intended to identify issues for resolution to facilitate algorithm integration and testing. We plan to communicate the progress on these activities and important lessons learned for hardware architecture and algorithm integration and testing from time to time. I am attaching the second report prepared by Narayan Prasad.
The reports are meant for informal exchange of technical information rather than formal ECS programmatic status. Consequently the plans for future activities may change based on schedule and resource constraints.
Progress Report #2
Date: 26 April, 1994
Prepared by: Narayan Prasad
Prototyping with AVHRR/Land Pathfinder algorithm
Results
Successfully ported AVHRR/Land Pathfinder algorithm to all machines (with 32-bit architecture) on the ECS STL distributed/parallel testbed. Performed benchmark runs on various operating systems. Total CPU time to process 14 orbits (1 day) of real data on various workstation platforms are:
HP735: HP-UX operating system, PA-RISC with integrated floating point co-processor, 150 MHZ, 1 processor
Total CPU: 5h
HP715/50: HP-UX operating system, PA-RISC with integrated floating point co-processor, 50 MHZ, 1 processor
Total CPU: 10h:30m
IBM RS6000: AIX operating system, RISC, 50 MHZ, 1 processor
Total CPU: 6h:45m
Sun Sparc10: Solaris operating system, SuperSparc, 40 MHZ, 1 processor
Total CPU: 15h:30m
SGI Indigo: IRIX 4.0 operating system, MIPS R4000, 100 MHZ, 1 processor
Total CPU: > 15h
Lessons learned
The AVHRR/Land algorithm spends ~70% of its processing time on system related activities (which includes I/O operations). Among the workstation classes tested thus far, the HP735 appears to be the fastest. The SGI Indigo appears to be the slowest.
Porting 32-bit programs to 64-bit architectures
Lessons learned
Based on a preliminary analysis of the AVHRR/Land Pathfinder algorithm, basic portability issues are identified and broadly classified as follows:
Complexity introduced by byte-ordering
a) Little-endian (DEC, MasPar, etc.) - High byte on the right
b) Big-endian (HP, SGI, SUN, IBM, Cray, etc.) - High byte on the left
Implications:
Programs that use ancillary data or other binary data that are in packed format would require modifications when porting across different byte- ordered architectures.
When porting among 32-bit machines, programs that use binary files created by one byte-ordered architecture may not require code conversion to be read on another byte-ordered architecture. The byte-ordering in the binary file can be converted from one family of architecture to another easily, using a simple byte swapping utility.
Limitation on word size for address arithmetic
- Fortran
32-bit architectures:
single precision: integer - 4 bytes
real - 4 bytes
float - 4 bytes
64-bit architectures:
single precision: integer - 8 bytes
real - 8 bytes
float - 8 bytes
- C
32-bit architectures:
short - 2 bytes
long - 4 bytes
float - 4 bytes
64-bit architectures:
short, long - 8 bytes (Cray)
float - 8 bytes (Cray)
Implications:
The Cray Y-MP and C-90 do not discriminate between short and long integers in C. The minimum word size is 8 bytes (64 bits). However, compiler options allow the representation of 24/32 bits for short and 46/64 bits for int (discussed under compiler options). If data are packed in less than 8 bytes, then must use unsigned char and perform necessary bit-manipulations.
Depending upon the complexity of the code and extent of pointer arithmetic used in reading binary/ancillary data, this remains the most challenging and time consuming part in code conversion for 64-bit architectures. Perhaps some high-level standard tool can be created to duplicate 32-bit architecture-address-arithmetic on 64-bit architectures.
Compiler options available on 64-bit architectures
- DEC Fortran (DEC Alpha)
"-convert_big_endian" switch that can convert big-endian data to little endian data before processing the data and convert little-endian to big- endian before writing data back to the file (can also do vice-versa) Compiler options to read IBM (TM) System/370 format, and Cray format.
Implication:
Portability is easier when porting to/from DEC Alpha from/to other big-endian architectures when input files are flat binary files and the language is Fortran.
- DEC C (DEC Alpha)
No switch available to convert big-endian to little-endian and vice-versa, or to read binary data created on other architectures
Implication:
Either convert data to DEC native (little-endian) format or use Fortran for all I/O operations using Fortran compiler switch to specify data format.
-taso compiler option allows the compiler to use 32-bit addressing mode. This option is useful to allow porting of 32-bit programs to 64-bit DEC Alpha
Implication:
Degradation in performance when running in 32-bit mode on 64- bit DEC Alpha
- Cray Fortran
"-eE" option for Cray C90 and Cray Y-MP systems only.
Compiles the program for use with the Cray Research MPP emulator. The emulator library, libemu.a is automatically linked in at the loader phase.
Implication:
Provides a gateway to Cray Massively Parallel Processor (Cray T3D) for existing Cray programs
Compiler options to read formats from other architectures
- Cray C
Compiler options available to specify 24/32 bits for short, 46/64 bits for int
Implication:
Not useful for portability. It only helps in faster arithmetic for existing Cray programs
1.0 Background
AVHRR/Land Pathfinder algorithm processes Global Area Coverage (GAC) data from NOAA polar orbiting satellites.
2.0 Processing steps
The following sections broadly identify the processing steps taken by the AVHRR/land Pathfinder algorithm and discuss the commonality with other ECS algorithms, applicability to distributed/parallel processing, distributed/concurrent processing, portability to different architectures, and give insights into scenarios for PGS scheduling and AI&T.
2.1 Preprocess GAC data
Initialization
- allocate memory for uncalibrated GAC data structure, ancillary GAC data structure, chunk (user specified pixels), granule metadata structure
- read, store process control information (user supplied), create log files
- open ancillary files, create binary file and write out layers 1-10
- initialize Goodes projection (earth radius, central meridians and false eastings for each of the 12 regions)
Create data chunk
- create data chunk equal to user specified size, decode header information from GAC data (architecture dependent - byte ordering, 64-bit). Summary header is is written by a VAX (little endian) with bytes of the short integer swapped.
- read documentation header (512 bytes) to pick up satellite id and orbit number
- read scan data - offsets are calculated to scan record (architecture dependent - byte ordering, 64-bit), copy data to chunk for processing. The chunks are kept in memory. One day (14 orbits) of GAC data in packed format is ~850 MB.
Unpack GAC data
- decode scan data (unpack) (architecture dependent - byte ordering, 64-bit), scan record is 3584 bytes. 10-bit masks are used to pull off and save channels of data into bank. There is extensive bit manipulation with each bank equal to 4 bytes of data. Perform decoding for all scans in the chunk. Five channels are used as a linked list of nodes. The unpacked data are stored in 16-bit integer format.
Missing pixels are marked.
Unpack GAC ancillary data
- Loop through all the scans in the chunk and unpack GAC ancillary data, extract and decode navigation tie points, interpolate the tie points to each pixel in the scan. The data are packed as three 10-bit words in 4 bytes. long int of 4 bytes is assumed for decoding.
Implication for PGS
- There may be commonality with other algorithms (both Pathfinder and non- Pathfinder algorithms) in the area of reading ancillary data. They are all architecture dependent, and may require code modifications to port to other architectures. May not be parallelizable because interpolation and extrapolation along pixels may be dependent on neighboring pixels.
2.1.1 Navigate chunk
- for each scan line, each position is initialized with lat/lon, zenith solar azimuth and other variables
propagate ephemeris data
- propagate the ephemeris data, assign values to satellite position, velocity and attitude
determine sensor orientations
- perform a simple calculation of the sensor orientation from the orbit position vector and input values of the attitude offset angles
determine pixel locations
- is done individually pixel by pixel from a series of engineering calculations
Implication for PGS
- Most time consuming part of the AVHRR algorithm. Some loops can be parallelized. Can perform operations on pixels/scans in parallel within a chunk. The ephemeris data (~0.25 MB) is a binary file. Reading such a file is hardware dependent.
2.1.2 Goodes bin number
- Determine Goodes projection region from pixel's latitude and longitude and assign bin number. Pixels are processed individually for each scan until processed for the entire chunk.
Implication for PGS
- Assignment of Goodes bin number can be done in parallel for all the pixels/scans
2.1.3 Extract ancillary data
- extract topography (land/sea) from the topography file for each scan line
- remove ocean chunks for each scan line
Implication for PGS
- The AVHRR topographic information is a binary file of size ~45 MB. Reading binary input is architecture dependent.
2.2 Pixel-based processing
2.2.1 Calibration
- calibrate (visible, thermal): apply calibration coefficients to convert counts in channels 1-5 to radiance
- calculate ICT temperatures: compute 50 point running average
Calibration is done pixel at a time. No external files are needed to perform calibration.
Implication for PGS
- Many ECS algorithms perform calibration for each pixel. Pixels/Scans within a chunk can be processed in parallel. Serial run takes 11% of the total CPU time.
2.2.2 Atmospheric correction
- ozone: calculate expected absorption from two passes through ozone layer of the solar flux. These are passed to Rayleigh routine to calculate Rayleigh correction and ground arriving radiance.
- Rayleigh: calculate Rayleigh contribution to the flux observed at the satellite
Atmospheric correction is done pixel at a time.
Implication for PGS
- Many ECS algorithms perform atmospheric correction. There may be commonality in function. Ten percent of the total CPU time is spent in performing atmospheric correction. Pixels/Scans within a chunk can be processed in parallel. Ozone data are read from CDF file (~55 MB). Reading such a file is hardware independent.
2.2.3 Normalization
- normalizes the flux at the sensor (pure arithmetic operation) pixel at a time
Implication for PGS
- Pixel based processing can be done in parallel. May have commonality with other ECS algorithms.
2.2.4 Cloud screen (CLAVR)
- The CLAVR algorithm produces a cloud map using land, daytime thermal and visible thresholds derived by NOAA.
Implication for PGS
- CLAVR is common to other AVHRR Pathfinder algorithms. Processing is done for each pixel. Potential for parallelization exists.
2.2.5 Generate NDVI
- The daily NDVI values are generated for each pixel.
Implication for PGS
- Pixels are processed sequentially. Potential for parallelization exists.
2.3 Product generation and binning
- Bin product layers: The daily NDVI and associated layer products are generated by mapping (binning) each pixel from the chunk into the Goodes projection space.
- Create output HDF
Implication for PGS
- Each pixel is binned independently and, therefore, has potential for parallelization. HDF eliminates hardware dependency. An intermediate binary file is created during processing and deleted after the HDF file is successfully written. The intermediate binary and output HDF files are always ~196 MB and ~230 MB, respectively, their sizes independent of how many orbits are processed for one day.
3.0 Algorithm requirements
Memory
- at least 32 MB RAM
Disk space
- Input files
GAC ancillary data (in packed format) for 1 day (14 orbits): ~850 MB
TOMS ozone file (CDF) (for a period): ~55 MB
land and ocean topography (binary file): ~45 MB
- Intermediate files
Binary file: ~196 MB
- Output file
Golden HDF file (1-14 orbits): ~230 MB
Log files, and other input metadata files: < 1 MB
Total CPU time to process 14 orbits (1 day) of real data on various workstation platforms are:
HP735: HP-UX operating system, PA-RISC with integrated floating point co-processor, 150 MHZ, 1 processor
Total CPU: 5h
HP715/50: HP-UX operating system, PA-RISC with integrated floating point co-processor, 50 MHZ, 1 processor
Total CPU: 10h:30m
IBM RS6000: AIX operating system, RISC, 50 MHZ, 1 processor
Total CPU: 6h:45m
Sun Sparc10: Solaris operating system, SuperSparc, 40 MHZ, 1 processor
Total CPU: 15h:30m
SGI Indigo: IRIX 4.0 operating system, MIPS R4000, 100 MHZ, 1 processor
Total CPU: > 15h
Miscellaneous files used
- HDF3.2r4 or later
- CDF2.2 or later
4.0 Overall implication for PGS
Distributed/concurrent processing
Processing is performed on each chunk as a separate unit. Each chunk is processed completely, from ingest through product generation, until a complete orbit (file) is processed. Therefore, the AVHRR algorithm would be a good candidate for distributed processing in a concurrent fashion. Each orbit can be completely processed on a single processor, then an output file written. The code has to be re-designed as intermediate results are stored in memory and accessed for later use.
Distributed/parallel processing
Each pixel within a scan is processed sequentially. Potential for parallelization exists.
Demand on PGS scheduler
There is no dependency on data products from another algorithm/instrument. It appears that there is no real requirement on the scheduler to prepare the algorithm for processing, other than normal scheduler functions expected minimally by any ECS algorithm.
Parallel architecture: < 20 processors
Massively parallel: > 20 processors
High performance machines at JPL HPCC
- Cray T3D
- Convex SPP
- IBM SP1
The Cray T3D has 256 processing elements.
JPL has been given an early release of the Cray T3D (currently with 128 processing elements). Applications have been ported to Cray T3D from Intel Delta, Cray Y-MP, C-90 and workstations. No MasPar code has been converted to Cray T3D.
Have consultants available to help new users
I/O is currently processed on the Cray Y-MP (front-end). Cray plans to have four gateways for I/O on the Cray T3D.
Cray T3D follows IEEE format
Cray T3D C compiler has the option of 32-bit or 64-bit arithmetic. This may help in porting 32-bit programs.
The number of processors can be requested as a run-time command. The requested number of processors should be a power of 2 (system constraint).
Currently uses PVM as a message passing library (expected to include others in future releases). Currently no automatic parallelization is done. Message passing must be explicitly programmed for using multi-processors.
(CRAFT) Cray Research Adaptive Fortran (tool to aid parallelization) is expected to be released shortly. CRAFT is an extension of Fortran 77, and includes Fortran 90 features such as array syntax and intrinsics.
"-eE" option available on the C-90 for porting applications to Cray T3D is not very helpful in C. It can, however, be used for debugging applications.
Distributed/parallel applications developed on workstation clusters can be easily ported to the Cray T3D.
Workstation clusters are usually slow because of slow communication between processors. The Cray T3D processing elements are connected by very fast bi-directional 3-D torus system interconnect network with peak interprocessor communication rates of 300 Mbytes per second in every direction through the torus resulting in up to 76.8 Gbytes per second of bisection bandwidth.
IMSL has parallel version of math libraries (but not on a massively parallel architecture)
Some benchmarks on parallel codes for a 2D particle simulation (Robert Ferraro, JPL) were presented.
Loop time: Total time for running the simulation minus the initialization time
Loop time (CPU time)
Cray C-90, 1 processor: 1105.6 sec
Cray Y-MP, 1 processor: 2264.9 sec
IBM RS6000, 1 processor: 7255.4 sec
Sun Sparc1000, 1 processor: 25299.7 sec
Intel IPSC/860, 64 processor: 447.1 sec
Intel IPSC/860, 32 processor: 851.9 sec
Intel IPSC/860, 16 processor: 1650.0 sec
Intel IPSC/860, 8 processor: 3261.7 sec
Intel Paragon XP/S, 64 processor: 348.2 sec
Intel Paragon XP/S, 32 processor: 659.2 sec
Intel Paragon XP/S, 16 processor: 1304.4 sec
Intel Paragon XP/S, 8 processor: 2640.9 sec
IBM SP1, w/MPL, 32 processor: 270.2 sec
IBM SP1, w/MPL, 16 processor: 422.8 sec
Cray T3D, w/pvm, 128 processor: 73.5 sec
Cray T3D, w/pvm, 64 processor: 126.8 sec
Cray T3D, w/pvm, 32 processor: 241.9 sec
Cray T3D, w/pvm, 16 processor: 487.1 sec
TMC CM-5, f77 w/CMMD, 32 processor: 1514.4 sec
RS/6000, M580 cluster, 4 processor: 1578.3 sec
RS/6000, M580 cluster, 2 processor: 3305.3 sec
Caveat: Based on only a single benchmark algorithm.
Cray T3D is very highly dependable
A solid understanding of the algorithm is necessary to do parallelization.
Prepared by: Narayan Prasad
1. Background
The AVHRR/Land Pathfinder algorithm processing software system can be partitioned
into four main components based on function; 1) initialization,
2)preprocessing,
3) pixel-based processing, and
4) product generation. During initialization,
memory is allocated, output files are created, error handling routines are
initialized and ancillary data files are opened for reading. The AVHRR Level 1B
GAC orbit consists of roughly 12,000 scans or 50 MB of raw data in NOAA's
packed format. The data processing is performed by dividing each orbit into
units or "chunks" of scans. Processing is performed on each chunk as a
separate unit. Each chunk is processed completely, from ingest through product
generation, until a complete orbit (file) is processed. A binary file with
random-access binary format for speed critical processing is used for product
generation. After processing is complete, the product layers are copied to a
HDF file.
2. Distributed Computing Environment (DCE)
The DCE provides services and tools that support the creation, use and maintenance of distributed applications in a heterogeneous computing environment. "Distributed computing" means, the cooperation of two or more machines communicating over a network. The machines can physically be located anywhere, and are connected over the network. DCE provides interoperability and portability across heterogeneous platforms. DCE is based on three distributed computing models - client/server, remote procedure call and data sharing. The client/server model is a way of organizing a distributed application. The distributed application is divided into two parts, one part residing on each of the two computers that will be communicating during the distributed computation. The Remote Procedure Call (RPC) model is a way of communicating between parts of a distributed application. In this model, the client makes a procedure call, which is translated into network communications by the underlying RPC mechanism. The server receives a request, executes the procedure, returning the results to the client. The data sharing model is a way of handling data in a distributed system. In this model, the data is shared by distributing it throughout the system. In data sharing, a copy of the server's data is sent to the client, and the client accesses the file locally.
2.1 DCE threads
A thread is a single, sequential flow of control within a process. In a traditional computer program, there is only one thread of control. Execution of the program proceeds sequentially, and at any given time, there is only one point in the program that is currently executing. The DCE Multithreading Service allows multiple threads, that is, multiple, concurrent flows of control, within a single process. The multiple threads can be mapped to multiple processors when they are available. All threads within a process use a common virtual address space. Threads may progress independent of one another. That is, one or more threads in a process can wait for I/O or events while others continue to run.
3. Suitability of AVHRR algorithm for distributed computing
The AVHRR algorithm lends itself to distributed computing. Each orbit of data is processed through all four steps from initialization to product generation. Processing each orbit can, therefore, be done independent of other orbits. The DCE thread capability can be used to concurrently process AVHRR GAC data. The client machine creates multiple threads and assigns the processing of one or more orbits to each thread created. The threads are then mapped to available processors. For example, in an ideal case, to process 1 day (14 orbits) of AVHRR GAC data, 14 threads can be created and mapped to 14 different processors on heterogeneous platforms. The 14 orbits of data are processed concurrently and independent of other orbits. Each process is blocked until all the processes are completed. Upon completion of all processes, the output file is written. The current processing rate of 5 hours (on HP735) for 14 orbits can be drastically improved with the proposed approach, depending upon the number of processors available. The distributed/parallel testbed available at the Landover facility can be used to prototype distributed computing of the AVHRR pathfinder processing.
4. Summary and conclusions
The AVHRR/Land Pathfinder processing lends itself to being structured as multiple flows of control. The DCE multithreading capability can be used to distribute the processing to multiple processors on heterogeneous platforms. Concurrent processing can drastically improve performance and throughput. Such a study would give us a better understanding of the type of processing strings that relate to PGS design.
Prepared by: Narayan Prasad
1. Why support other modern programming languages?
Fortran certainly is the dominant scientific programming language, despite recent inroads by languages such as C and C++. Most of the Fortran code in existence today is in Fortran 77. Fortran 77 is a POSIX standard and is, therefore, supported by the toolkit. However,
Fortran 77 is suitable for sequential processing, and is badly out of step with modern high performance computing architectures. Fortran was originally developed for serial machines with linear memory architectures. It is reaching its limitations on the latest generation of high-performance machines, and has difficulties when executing on parallel machines.
There are also other issues that are limitations in Fortran 77:
Interface performance
Input/Output
Communication and synchronization
Code tuning for various architectures
The following languages are a step toward bringing the convenience of sequential Fortran to the complex parallel machines of today.
1.1 Fortran 90
Fortunately, the deficiencies of Fortran 77 have been addressed with the introduction of Fortran 90. Fortran 90 contains all the features expected of a modern programming language, as well as some of the features of a modern object oriented language. Codes written in Fortran 90 are more concise, efficient, readable, less prone to error (because it is more structured), and better suited to modern computer architectures. Fortran 90 has features which assist in the expression of data parallelism. It provides features that can significantly facilitate optimization of array operations on many computer architectures. Other features of Fortran 90 that improve upon the features provided in Fortran 77 are:
Additional storage classes of objects - allocatable, automatic and assumed shape-objects, and the pointer facility
Modules - enables the practice of design and implementation using data abstractions
Interface blocks - enables specifying interfaces to subprograms explicitly, allowing a high quality compiler to use the information to provide better checking and optimization at the interface to other subprograms
Additional intrinsic procedures - support mathematical operations on arrays, including construction and transformation on arrays, MIL-STD bit manipulation procedures, etc.
Reliability - large increase in reliability
Increased efficiency - array expressions on array processors, modules (data storage
optimization), pointers to arrays
Object Oriented Programming - good data abstraction, class hierarchies provided by nested data type definitions, Fortran 90 operations are essentially data type non-specific
Fortran 77, in short, is a complete subset of Fortran 90.
1.1.1 Current Status of Fortran 90
1.1.1.1 Standards
Fortran 90 is a current ANSI and ISO programming language standard. It was accepted by the committees as a standard in 1992. It is not POSIX compliant.
1.1.1.2 Compilers
NAGware Fortran 90 compiler
Platforms: Apollo DN10000, DECstations , HP 9000, IBM RS/6000, NeXT, Intel 386/486 (MS DOS), Sun 3, Sun 4 (SunOS 4.1), VAX/VMS is forthcoming
VAST-90
Platforms: SPARC (Sun 4), IBM RS/6000
PF90 Version 2.0
Platforms: Sun 4, IBM RS/6000, SGI, DECstations, Convex, Alliant, IBM 3090 (AIX), Cray
Salford FTN90
1.2 High Performance Fortran (HPF)
HPF was designed to provide a portable extension to Fortran 90 for writing data parallel applications. It includes features for mapping data to parallel processors, specifying data
parallel operations, and methods for interfacing HPF programs to other programming paradigms. HPF's parallel constructs make it easy for the programmer to indicate potentially parallel operations. HPF is based on Fortran 90, which is the latest in a long line of Fortran standards. The data mapping features in HPF can describe how data is to be divided up among the processors in a parallel machine. Even when there are many more parallel operations than there are processors on the target machine, HPF allows the extra parallelism to be specified. This way when the program is ported to a more parallel machine, it can immediately take advantage of the extra speed available. Although HPF was motivated by parallel architectures , the constructs can be used on any computer, in much the same way that Fortran 90 vector assignments can also be used for scalar processors.
1.2.1 Current Status of HPF
The High Performance Fortran Forum (HPFF) produced the High Performance Fortran Language Specification, Version 1.0 (Final) in May 1993. This document contains all the technical features proposed for the language. It is the latest set of extensions to Fortran 90.
1.2.1.1 Standards
HPF is not a standard recognized by the formal national and international standards committees. The final version of the HPF language specification has been published.
1.2.1.2 Vendors/Compilers
Thinking Machines, Inc. has a compiler called cmf for their CM-2. Cray Research, Inc. is committed to producing a version of HPF for their MPP product, which will access the hardware directly for efficiency.
1.2.2 Concerns
Early versions of HPF are, however, likely to have only limited application areas, since they will have restricted MIMD (Multiple Instruction stream, Multiple Data stream) support, and neither unstructured data objects nor sophisticated load-balancing support, although such developments are likely in the future. Fortran 90 has been chosen as the base for HPF development, and Fortran 77 will not be supported. This is despite the fact that the US DoD requires Fortran 77 on all its contracts, and recommends against the use of Fortran 90. There are also worries that no consideration has been made for providing HPF semantics in C.
2. Code translation
2.1 Conversion of Fortran 77 to Fortran 90 and vice versa
What makes the migration of old Fortran 77 codes complicated is that the syntax of good Fortran 90 programs is entirely different, and many of the features of Fortran 90 that make it especially valuable for high performance computing are completely new and require at least some amount of code restructuring. Porting "pure" Fortran 77 only requires replacement of comment character "C" with "!" , to use "&" as the continuation line character, and to append it to the continued line, to remove blanks embedded inside constants or identifiers, and to check some intrinsics usage. Most of this can be done automatically. Clearly tools are needed to assist the conversion process. One such tool that is available from Pacific-Sierra Research Corporation is the VAST-90, which makes the task of migrating Fortran 77 codes to Fortran 90 a fairly painless and efficient task. The automated translations are often surprisingly sophisticated, and with a little tweaking by the programmer, extremely efficient, clean Fortran 90 programs can be obtained from even fairly messy Fortran 77 codes. Where possible, VAST-90 converts arithmetics IFs and GOTOs to block IFs, adds CYCLE and EXIT commands to loops, changes computed GOTOs and block IFs to CASE statements, changes COMMONs into MODULEs, generates and includes interface files, collapses loops to employ Fortran 90 array syntax, and many more. VAST-90 is currently available for the IBM RS6000, SUN SPARC, HP9000, DEC5000, CONVEX, CRAY and VAX/VMS.
3. Language Bindings
The provision of a standard compiler-independent inter-language calling interface remains a challenge for the toolkit. With the PGS toolkit designed in C, most of the Fortran is expected to be provided via inter-language bindings.
3.1 Fortran-C binding
There is an ANSI POSIX standard for Fortran 77 bindings. There is an ISO and ANSI POSIX standard for C bindings. However, there is no standard for a link between Fortran 77 and C. In short, there is no inter-language standard for communication between languages. Inter-language communication is platform dependent.
3.2 HPF-Fortran 90 binding
The new HPF language features fall into four categories with respect to Fortran 90:
New directives or special comments that suggest implementation strategies or assert facts about a program to a compiler
New language syntax like FORALL statement, the PURE and EXTRINSIC attributes for procedures, and some intrinsic functions
Library that defines a standard interface to routines that have proven valuable for high performance computing
Language restrictions on the use of sequence and storage associations that may require insertion of HPF directives into standard Fortran 90 programs to preserve correct semantics under HPF
Received ERBE Inversion Subsystem code from ERBE Data Management team at LaRC (04/26/94) for parallelization studies on the distributed/parallel testbed.
Earl Hansen (MISR) visited ECS STL. He indicated that MISR will be experimenting with distributed computing for their algorithms. Currently there is no MISR code that would be useful for PGS prototyping at ECS STL.
In collaboration with MODIS team at GSFC on AVHRR prototyping. Experiences and lessons learned are being shared.
DCE installation completed on the distributed/parallel testbed. DCE cell also configured.
Identification of AVHRR routines that require conversion for 64-bit architectures complete. Cray Y-MP EL at GSFC/HPCC is used for testing portability issues for 64-bit architectures.
In collaboration with Larry Fishtahler (GSFC - Dr. Barbara Putney's group) on AVHRR prototyping. Parallelization issues were discussed.
ASTER PGS Requirements Review at JPL, Pasadena, CA (18 April 1994).
Met with the ASTER team (Moshe Pniel) to discuss their prototype of the ASTER algorithm. Some parallelization studies can be done as the code is in Fortran, although their prototype currently runs on serial machines.
Established contact with Lyn Oleson (EDC) regarding getting Landsat Level 1A algorithm for PGS prototyping. This algorithm may be a good candidate and is being explored further.
CDF (Common Data Format) support team at GSFC working on creating a version for Cray architectures
We are expecting some representative Fortran code from MODIS shortly
Contact Dr. Bruce Guenther at NASA/GSFC regarding Level 1B calibration algorithm. It's the biggest of tall poles (3000 MFlops)
Familiarize with the Landsat Level 1A algorithm
Water leaving radiances (a tall pole) algorithm by Dr. Bob Evans at the University of Miami. This is an excellent algorithm for distributed processing. It has the complexity of ECS algorithms in general. It appears to be a good prototype to study PGS sizing issues.
Contact Dr. Mark Abbot at Oregon State University. They have a big parallel processing facility with a Connection Machine (CM5). Their experiences and lessons learned would be very valuable to the prototyping efforts.
Evaluate what parallel processors can do for the tall pole algorithms (in addition to testbed architectures)
Decide on a few parallel processors based on ECS requirements and vendor specifications. Work with vendors (IBM, Cray, Sequent, Convex, etc.) to evaluate their architectures and parallelization tools.