Valkyrie should only be used for parallel program development and measurement. If you need to use a Linux or UNIX system, please use your student account on the Solaris machines in the Advanced Programming Environment (APE) lab, located in APM 6426.
At the time, consult this web page for using Valkyrie. (ACS has not yet updated their web page since a recent software upgrade.)
It doesn't appear that you have set up your ssh key.
This process will make the files:
/home/cs160s/<your account>/.ssh/identity.pub
/home/cs160s/<your account>/.ssh/identity
/home/cs160s/<your account>/.ssh/authorized_keys
Generating public/private rsa1 key pair.
You will then be asked 3 questions shown below. Be sure
to hit carriage return (entering no other input) in response to
each question:
Enter file in which to save the key (/home/cs160s/<your
account>/.ssh/identity):
Created directory '/home/cs160s/<your account>/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cs160s/<your
account>/.ssh/identity.
Your public key has been saved in /home/cs160s/<your
account>/.ssh/identity.pub.
The key fingerprint is:
<several 2 digit hex numbers separated by :> <your
account>@valkyrie.ucsd.edu
We'll be using the bash shell. Modify your .bash_profile using information found in /export/home/cs260x-public/bash_profile. (From now on we'll refer to this directory as $(PUB).)
Eventually we'll use the Intel C++ compiler, but for now we'll use a special version of the Gnu C++ compiler that incorporate the MPI libraries. To use these compilers, you must have the following environment variable set:
export PATH=/opt/mpich/myrinet/gnu/bin:$PATH
You may also want to set the MANPATH so you can access the MPI manual pages:
MANPATH=/opt/mpich/gnu/man:$MANPATH
Both of these are set for you in the bash_profile file provided.
(The default profile also defines the PATH to
include $(PUB)/bin.)
Note:
the ACS web page at
is out of date. The correct PATH and MANPATH
for MPI are as described here.
To compile your programs use the makefiles provided for you. These makefiles include an architecture file (arch) file containing the appropriate compiler settings.
Run your program with the mpirun command. The command provides the -np flag so you can specify how many nodes to run on. There are 16 nodes numbered 0 through 15. Be sure to use the "-1" (the number one) to make sure that you don't run on the front end. Be sure to run this program in a subdirectory of your home directory, as it is not possible to run in $(PUB)
To establish that your environment has been set up correctly, compile and run the parallel "hello world" program. This program prints "Hello World'' from each process along with the process ID. It also reports the total number of processes in the run. The hello world program is found in $(PUB)/examples/Basic/hello. To run the program use mpirun as follows:
mpirun -np 2 -1 ./hello
Here is some sample output:
# processes: 2 Hello world from node 0 Hello world from node 1You must specify "." before the executable. Note that any command line arguments come in the usual position, after the name of the executable. Thus, to run the Ring program(found in $(PUB)/examples/Ring) on 4 processes with command line arguments -t 5 and -s 1024, type:
mpirun -np 4 -1 $(PUB)/examples/Ring/ring -t 5 -s 1024
Sometimes you'll want to specify particular nodes to run on. To do this you need to specify a machines file listing the names of the physical nodes. The command line sequence takes the following form
mpirun -np <# NODES> -1 -machinefile <MACHINE_FILE>
The machine file contains a list of physical node names, one per line. The nodes are numbered from 0 to 15, and are named compute-0-0 through compute-0-15. (Each node contains 2 CPUs, but in effect you may use only 1 CPU per node.) Thus, to run the ring program with nodes 6, 7, 11, 14 as logical processes; 0-3, create the following file, say mfile:
compute-0-6 compute-0-7 compute-0-11 compute-0-14To, run, type
mpirun -np 4 -1 -machinefile mfile ./ring -t 5 -s 1024
We have provided a python script to generate randomized machine files: $(PUB)/bin/randMach.py. The command line argument specifies the number of processors in the machine file. For example, the command randMach.py 7 > mach was used to generate the following 7-line machine file:
compute-0-8 compute-0-0 compute-0-1 compute-0-15 compute-0-4 compute-0-9 compute-0-2
Generate a machine file with ONE entry for each node you want to use. Do not list each machine entry twice. Then specify the number of processes you want to run with along with the machine file. For example, if you want to run with 6 processors using this machine file p3
compute-0-1 compute-0-2 compute-0-8you enter
mpirun -np 6 -machinefile p3 ./a.out
THIS IS SUBJECT TO CHANGE, as MKL has not yet available To link with MKL use the following on your load line
${MKLPATH}/libmkl_lapack.a ${MKLPATH}/libmkl_ia32.a ${MKLPATH}/libguide.a -lpthreadwhere ${MKLPATH} has been set as follows (i.e. in your .bash_profile file):
export MKLPATH=/opt/intel/mkl61/lib/32
If you want to use the FFT or DFT, add -lm to your link line.
Documentation can be found at
http://developer.intel.com/software/products/mkl/docs/mklqref/index.htm
.
Sometimes runaway processes will persist after a run.
This can occur if you break a run using control/C.
If you feel that the machine is slowing down, run the command
ganglia load_one | sort -n -k 2
which will displays the load on each node. Since there are 2 CPUs per node
the load should be not more that
2.5 or so (i.e., there are maximum 2 PBS jobs running per node) unless there are
other processes running on the node.
This behavior could be due to valid user jobs or due to runaway processes.
cluster-ps <username>(If any nodes are down, you'll be notified.)
If you see that you have processes running:
compute-0-13: cs260x 12208 0.0 0.1 5172 1228 ? S Oct14 0:00 ./parallel_jacobi 4 .001 100000 cs260x 12209 0.0 0.1 5904 1228 ? S Oct14 0:00 ./parallel_jacobi 4 .001 100000use the cluster-kill command to delete them:
cluster-kill <username>Ignore messages of the following form:
compute-0-13: kill 9363: Operation not permitted Connection to compute-0-13 closed by remote host. compute-0-14: kill 9904: Operation not permitted Connection to compute-0-14 closed by remote host. compute-0-15: kill 9662: Operation not permitted Connection to compute-0-15 closed by remote host.
When done, re-run the cluster-ps command to make sure all is clear, but specify the user "cs260x" in order to search all course user IDs (including your instructor!). This method will filter out extraneous commands, making it easier to locate runaway processes:
cluster-fork 'ps aux' | egrep "cs260x|compute-" | sed -f $PUB/bin/cl.sedIf you find other running processes, and the user is not logged in (you can find that out with the who command), then notify the user by email. Since email doesn't work on Valkyrie, you'll need to finger the user's real name (e.g. finger cs260x) and then check the ucsd data base as in finger username@ucsd.edu.
As matter of etiquette, be sure and run cluster-ps before logging out. If you plan to be on the machine for a long time, it would be a good idea to run this command occasionally, and before you start a long series of benchmark runs.
MPI documentation is found at at http://www.cse.ucsd.edu/classes/fa05/cse260/testbeds.html#MPI. You can obtain man pages for the MPI calls used in the example programs described here at http://www-unix.mcs.anl.gov/mpi/www/www3/:
Copyright © 2005 Scott B. Baden. Last modified: Wed Sep 14 18:34:49 PDT 2005