18.337J / 6.338J Applied Parallel Computing (SMA 5505)

Spring 2005

Photo of typical parallel computers.The photo represents an example of parallel computing. The Beowulf project consists of one front end and nine computers, resulting in high computing performance. (Photo by Alan Edelman.)

Course Highlights

This course features lecture notes and student projects.

Course Description

Applied Parallel Computing is an advanced interdisciplinary introduction to applied parallel computing on modern supercomputers.

Technical Requirements

Special software is required to use some of the files in this course: .gz, .tar.



Applied Parallel Computing is an advanced interdisciplinary introduction to applied parallel computing on modern supercomputers. Numerical topics include dense and sparse linear algebra, N-body problems, multigrid, fast-multipole, wavelets and Fourier transforms. Geometrical topics include partitioning and mesh generation. Other topics include applications oriented architecture, software systems, MPI, Cilk, data parallel systems, Parallel MATLAB®, caches and vector processors with hands-on emphasis on understanding the realities and myths of what is possible on the world's fastest machines. One emphasis for this course will be VHLLs or very high level languages for parallel computing. Are we there yet?


Linear Algebra (18.06)

Assignments and Projects

There are 3 homework assignments. The first assignment is to pick a topic of interest for your final project. You will need to hand in 2 progress reports, and a final project report. Also, you are required to do two presentations on the project, one before Spring break when you hand in your first progress report, and one at the end of the semester. Also you need to keep a Web site on your project. All reports and the Web site will be made public and archived.


Scribing (Commenting on Notes for One Lecture)5%
Project (Roughly Half Semester)50%


Every week there will be an optional discussion session. This is great place to get research ideas/project ideas.


3Parallel Prefix
4Dense Linear Algebra I
5Dense Linear Algebra II
6Sparse Linear Algebra I
7Parallel Computer ArchitectureHomework 0 due two days before Lec #7
8STAR-P DemoProject proposal due
9Sparse Linear Algebra II
11Domain Decomposition I
12-13Project Proposal and Progress PresentationHomework 1 due on Lec #13
14Sparse Linear Algebra III
15Fast Multipole I
17Fast Multipole II
18Fast Multipole III
19Domain Decomposition II
20Partitioning IProgress report due

Homework 2 due
21Partitioning II
22Mesh Generation
23Package Design of the Connection Machine
24SVM and SVD
26Student Project Presentations I
27Student Project Presentations II
28Student Project Presentations IIIFinal projects due

Lecture Notes

This section contains documents that could not be made accessible to screen reader software. A "#" symbol is used to denote such documents.

Selected Lecture Notes from Spring 2003 Class

The course notes below are a work in progress. The lecture numbers do not correspond to the class session numbers. The notes are available as a single file (PDF - 4.3 MB)# or as separate chapters below.

Chapter 1: Introduction (PDF - 1.2 MB)

1.1. The Machines
1.2. The Software
1.3. The Reality of High Performance Computing
1.4. Modern Algorithms
1.5. Compilers
1.6. Scientific Algorithms
1.7. History, State-of-Art, and Perspective

  • 1.7.1. Things that are not Traditional Supercomputers

1.8. Analyzing the Top500 List Using Excel

  • 1.8.1. Importing the XML File
  • 1.8.2. Filtering
  • 1.8.3. Pivot Tables

1.9. Parallel Computing: An Example
1.10. Exercises

Chapter 2: MPI, OpenMP, MATLAB®*P (PDF)

2.1. Programming Style
2.2. Message Passing

  • 2.2.1. Who am I?
  • 2.2.2. Sending and Receiving
  • 2.2.3. Tags and Communicators
  • 2.2.4. Performance, and Tolerance
  • 2.2.5. Who's got the floor?

2.3. More on Message Passing

  • 2.3.1. Nomenclature
  • 2.3.2. The Development of Message Passing
  • 2.3.3. Machine Characteristics
  • 2.3.4. Active Messages

2.4. OpenMP for Shared Memory Parallel Programming
2.5. STARP

Chapter 3: Parallel Prefix (PDF)

3.1. Parallel Prefix
3.2. The "Myth" of lg n
3.3. Applications of Parallel Prefix

  • 3.3.1. Segmented Scan
  • 3.3.2. Csanky's Matrix Inversion
  • 3.3.3. Babbage and Carry Look-Ahead Addition

3.4. Parallel Prefix in MPI

Chapter 4: Dense Linear Algebra (PDF)

4.1. Dense Matrices
4.2. Applications

  • 4.2.1. Uncovering the Structure from Seemingly Unstructured Problems

4.3. Records
4.4. Algorithms, and Mapping Matrices to Processors
4.5. The Memory Hierarchy
4.6. Single Processor Condiderations for Dense Linear Algebra

  • 4.6.1. LAPACK and the BLAS
  • 4.6.2. Reinventing Dense Linear Algebra Optimization

4.7. Parallel Computing Considerations for Dense Linear Algebra
4.8. Better Load Balancing

  • 4.8.1. Problems

Chapter 5: Sparse Linear Algebra (PDF)

5.1. Cyclic Reduction for Structured Sparse Linear Systems
5.2. Sparse Direct Methods

  • 5.2.1. LU Decomposition and Gaussian Elimination
  • 5.2.2. Parallel Factorization: The Multifrontal Algorithm

5.3. Basic Iterative Methods

  • 5.3.1. SuperLU-dist
  • 5.3.2. Jacobi Method
  • 5.3.3. Gauss-Seidel Method
  • 5.3.4. Splitting Matrix Method
  • 5.3.5. Weighted Splitting Matrix Method

5.4. Red-Black Ordering for Parallel Implementation
5.5. Conjugate Gradient Method

  • 5.5.1. Parallel Conjugate Gradient

5.6. Preconditioning
5.7. Symmetric Supernodes

  • 5.7.1. Unsymmetric Supernodes
  • 5.7.2. The Column Elimination Tree
  • 5.7.3. Relaxed Supernodes
  • 5.7.4. Supernodal Numeric Factorization

5.8. Efficient Sparse Matrix Algorithms

  • 5.8.1. Scalable Algorithms
  • 5.8.2. Cholesky Factorization
  • 5.8.3. Distributed Sparse Cholesky and the Model Problem
  • 5.8.4. Parallel Block-Oriented Sparse Cholesky Factorization

5.9. Load Balance with Cyclic Mapping

5.10. Heuristic Remapping
5.11. Scheduling Local Computations

Chapter 6: Parallel Machines (PDF)#

Chapter 7: FFT (PDF)

7.1. FFT

  • 7.1.1. Data Motion
  • 7.1.2. FFT on Parallel Machines
  • 7.1.3. Exercises

7.2. Matrix Multiplication
7.3. Basic Data Communication Operations

Chapter 8: Domain Decomposition (PDF)#

8.1. Geometric Issues

  • 8.1.1. Overlapping vs. Non-overlapping Regions
  • 8.2.2. Geometric Discretization

8.2. Algorithmic Issues

  • 8.2.1. Classical Iterations and their Block Equivalents
  • 8.2.2. Schwarz Approaches: Additive vs. Multiplicative
  • 8.2.3. Substructuring Approaches
  • 8.2.4. Accellerants

8.3. Theoretical Issues
8.4. A Domain Decomposition Assignment: Decomposing MIT

Chapter 9: Particle Methods (PDF)#

9.1. Reduce and Broadcast: A Function Viewpoint
9.2. Particle Methods: An Application
9.3. Outline
9.4. What is N-Body Simulation?
9.5. Examples
9.6. The Basic Algorithm

  • 9.6.1. Finite Difference and the Euler Method

9.7. Methods for Force Calculation

  • 9.7.1. Direct Force Calculation
  • 9.7.2. Potential Based Calculation
  • 9.7.3. Poisson Methods
  • 9.7.4. Hierarchical Methods

9.8. Quadtree (2D) and Octtree (3D) : Data Structures for Canonical Clustering
9.9. Barnes-hut Method (1986)

  • 9.9.1. Approximating Potentials

9.10. Outline
9.11. Introduction
9.12. Multipole Algorithm: An Overview
9.13. Multipole Expansion
9.14. Taylor Expansion
9.15. Operation No.1 - SHIFT
9.16. Operation No.2 - FLIP
9.17. Application on Quad Tree
9.18. Expansion from 2-D to 3-D
9.19. Parallel Implementation

Chapter 10: Partitioning and Load Balancing (PDF - 1.1 MB)

10.1. Motivation from the Parallel Sparse Matrix Vector Multiplication
10.2. Separators
10.3. Spectral Partitioning - One way to Slice a Problem in Half

  • 10.3.1. Electrical Networks
  • 10.3.2. Laplacian of a Graph
  • 10.3.3. Spectral Partitioning

10.4. Geometric Methods

  • 10.4.1. Geometric Graphs
  • 10.4.2. Geometric Partitioning: Algorithm and Geometric Modeling
  • 10.4.3. Other Graphs with Small Separators
  • 10.4.4. Other Geometric Methods
  • 10.4.5. Partitioning Software

10.5. Load-Balancing N-body Simulation for Non-uniform Particles

  • 10.5.1. Hierarchical Methods of Non-uniformly Distributed Particles
  • 10.5.2. The Communication Graph for N-body Simulations
  • 10.5.3. Near-field Graphs
  • 10.5.4. N-body Communication Graphs
  • 10.5.5. Geometric Modeling of N-body Graphs

Chapter 11: Mesh Generation (PDF)

11.1. How to Describe a Domain?
11.2. Types of Meshes
11.3. Refinement Methods

  • 11.3.1. Hierarchical Refinement
  • 11.3.2. Delaunay Triangulation
  • 11.3.2. Delaunay Refinement

11.4. Working With Meshes
11.5. Unsolved Problems

Chapter 12: Support Vector Machines and Singular Value Decomposition (PDF )

12.1. Support Vector Machines

  • 12.1.1. Learning Models
  • 12.1.2. Developing SVMs
  • 12.1.3. Applications

12.2. Singular Value Decomposition

Bibliography (PDF)

Selected Lecture Notes from Spring 2003 Class

1Introduction (PDF - 1.3 MB)
4Parallel Prefix (PDF)
5Parallel Computer Architecture I (PDF)
6Parallel Computer Architecture II (PDF)
7Dense Linear Algebra I

(PDF 1) (Courtesy of James Demmel. Used with permission.)
(PDF 2)
(PDF 3 - 1.1 MB)
(PDF 4) (Courtesy of Jack Dongarra, University of Tennessee. Used with permission.)
11Graph Partitioning (PDF)
14N-Body Problem - Barnes Hut (PDF)#
18Domain Decomposition (PDF)#
23Support Vector Machines (PDF 1) (PDF 2)# (Courtesy of Tong Wen. Used with permission.)


Special software is required to use some of the files in this section: .gz, .tar.

There are 3 homework assignments. Associated files are provided for assignments 0 and 1.

Homework 0: Buffon-Laplace Needle Problem (PDF)
hw0.tar.gz (GZ)

Homework 1: Parallel Matrix Power (via Prefix) (PDF)
hw1.tar.gz (GZ)

Homework 2: Parallel Sparse Conjugate Gradients (PDF)


The project counts toward 50% of the course grade. Students are required to hand in 2 progress reports, and a final project report. Below are samples of final student reports from the 2003 class.

All work is done by students named and used with permission.

Comparison of Programming and Synchronization Techniques (PDF)

Sean Lie

FFTW and MATLAB®*P (PDF)Richard Hu
Parallel Implementation of a Multi-Length Scale Finite Element Method (PDF)Trevor Tippetts
Java™ MPI in MATLAB®*P (PDF)Max Goldman and Da Guo
A Parallel Hierarchical Solver for the Poisson Equation (PDF)R. Sudarshan and Seung Lee
Sparse Matrix Implementation on MATLAB®*P (PDF)Stu Blair
Parallelizing Incremental Bayesian Segmentation (IBS)Joseph Hastings and Siddhartha Sen

Some Ideas

  • Parallel Jacobi Computation/Simulation
  • Running MATLAB®*P across Multiple Clusters via IMPI
  • 'Automatically' parallelize for-loops with mm mode?
  • Use Video Cards to do Interesting Parallel Computation
  • Parallel Filters
  • Write new front end for *p- Octave? R? Maxima? Write interpreter?
  • Try MATLAB® Optimization Stuff in *P
  • Computer Graphics: Ray Tracing
  • Computer Graphics: Simulate Cloth and Movement of Cloth
  • Computer Graphics: Any Sort of Animation
  • Simulate a Musical Instrument (Wind, String, Percussion, Other?)
  • Simulate a Stellar Collision (Scientific American, Nov 2002)
  • Parallel Linear Programming (Perhaps take Advantage of Star*P)
  • HTML Renderer
  • Integer Factorization by Quadratic or Numberfield Sieve
  • Replicate a known result: 2^20,996,011-1 is prime. 4-color map theorem. 4x4x4 3D tic-tac-toc-toe is first-player (?) win. Kepler conjecture.
  • Create a game playing program for some game other than Chess. Or chess, if you really want.
  • Generate Some Fractal Images
  • Program perturbation. Investigate the effects of typos for a particular programming language/compiler. Start with a working program and introduce perturbations (typos). What percentage cause a compile-time error?
  • Computational biology--Motif discovery: given one set of strings labeled positive, and another labeled negative, find the word (substring) that is present in each positives string not in any of the negatives, with the complication that there is noise: many strings are misclassified in the wrong set.
  • Computational biology. A genome typically contains instances of long chunks of DNA that are identical to another chunk somewhere else. Remove these duplicated chunks.
  • Keep in mind that the human genome will not fit in RAM.
  • Theorem prover. Either parallelize a theorem prover, or use one to prove something interesting.

Related Resources

Here are a list of links to interesting readings related to this course.

Q: Is Supercomputing dead? (Oct 1994)
A: There is a Future for High-performance Computing (Sep 98)

Is this the supercomputing model for the future? Beowulf
Or this? Blue Gene - Petaflop Supercomputer

Parallel MATLAB®

Documents Relating to STAR-P and Predecessors

Parallel MATLAB® Survey

Grid Computing

Grid Computing Info Center - All you Need to Know about Grid Computing

Globus - One of the Major Computational Grid Projects

World Record for Largest Dense Linear System Solved (Though Really a Structured Application of Multipole)

Held by UIUC

Parallel Architecture

MPI - Message Passing Interface

SETI@home - distributed Computing over the Internet

Parallel Softwares

Netlib - Collection of Mathematical Software, Many of them Parallel

The ACTS Toolkit - a Collection of Scientific Computation Software, Many of them will be Discussed in the Course

ParMETIS - A Parallel Graph Partitioning Software

Fast Computers

TOP500 Supercomputers

ASC - Home of the Fastest Computer in the World


Parascope - A Listing of Parallel Computing Sites

Pictures of Supercomputers

A Timeline of Supercomputing History



驗證碼請輸入7 + 4 =





Creative Commons授權條款 本站一切著作係採用 Creative Commons 授權條款授權。