VASP modifes IMAGES in NEB calculation and how to deal with?

phydig · Post by **phydig** » Sat Dec 09, 2006 12:16 pm

Dear all:
When performing NEB calculation I specified IMAGES=2 and created four subfolders named
00, 01, 02, 03, respectively.
But I found that only 01/ contained output files, so I doubted that the IMAGES may be changed in the executing.
I added some tracking statements to the source file and recompiled VASP. Rerunning VASP in my system by typing:
bsub -q rms -n 16 -e err.1 -o log.1 ~/bin/vasp_NEB

The information in output files told me that VASP had really modified the IMAGES.
Source file was modified like this: (in main_mpi.F, line 81)
IF (IMAGES>0) THEN
write(tiu6,*)"IN main_mpi.F call M_divide! IMAGES= ", IMAGES !ADD
CALL M_divide( COMM_WORLD, IMAGES, COMM_CHAIN, COMM, .TRUE. )
write(tiu6,*)"Returning from M_divide,IMAGES= ", IMAGES !ADD

Output information was:
IN main_mpi.F call M_divide! IMAGES= 2
Returning from M_divide,IMAGES= 1

IMAGES was changed! So endpoints of images were not 00/POSCAR, 03/POSCAR, and became 00/POSCAR, 02/POSCAR.
As indicated by this information generated from neb_init(chain_init, calls neb_init)
(Some output statements were added to the source file)
POSCAR: SYS:Pt100(2x2)+Nb+Ob 01
positions in cartesian coordinates
No initial velocities read in
CHAIN: Read ICHAIN 0
CHAIN: Running the NEB
Call neb_init
NEB Params: spring, spring2, efirst, elast, ispring, spower:
-5.00000000000000 -7.00000000000000 -103.462770000000
-102.541359000000 2 1.00000000000000
images= 1

POSCAR: System: Pt100(2x2)+NO Upright 00
positions in direct lattice
velocities in cartesian coordinates
idir=images+1 = 2

POSCAR: SYS:Pt100(2x2)TST 02
positions in direct lattice
No initial velocities read in
Current node is: 1
Return from neb_init

I have analyst the reason why IMAGES was changed to be 1. It was passed as a parameter to the subroutine M_divide

IF (IMAGES>0) THEN !(main_mpi.F line 81)
CALL M_divide( COMM_WORLD, IMAGES, COMM_CHAIN, COMM, .TRUE. )
and was modified by this statement in subroutine M_divide
IF (NPAR >= COMM%NCPU) NPAR=COMM%NCPU
Because COMM%NCPU here was equaled to 1. But I requested 16 CPUs, Why?

I specified NPAR=16 and IMAGES=2, following are top lines of OUTCAR:
vasp.4.6.9 24Apr03 complex
executed on True64 date 2006.12.09 17:56:06
running on 1 nodes
each image running on 1 nodes
distr: one band on 1 nodes, 1 groups

Why running on 1 nodes and each image running on 1 nodes?
I suppose above lines should be:
vasp.4.6.25 17Sep03 complex
executed on True64 date 2006.12.09 17:56:06
running on 16 nodes
each image running on 8 nodes
distr: one band on 1 nodes, 8 groups

My system is:
Compaq Tru64 UNIX V5.1A (Rev. 1885); Tue Mar 25 12:25:36 CST 2003
This is sigma-x, a Compaq AlphaServer SC V2.5 system

LSF version is:
Platform LSF AlphaServer SC V2.5 UK1, Feb 21 2003

Thanks a lot for your kind consideration.

Post by **graeme** » Sat Dec 09, 2006 4:32 pm

It sounds like lsf has only given the job 1 processor instead of 16. This would explain the output that you see, wouldn't it?

If this is the case, it is not a problem with the neb - rather something with the queuing system or mpi.

Can you run an regular (non-neb) calculation on many processors?

Are you using an mpi version of vasp, i.e. built with -DMPI?

phydig · Post by **phydig** » Sun Dec 10, 2006 2:14 am

Thank you very much for your reply.

It seems that only 1 processor performing the task. But when I type bjbos command, 16 CPUs are listed:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
37318 phydig RUN rms sigma-x0 sigma-x2 *ijob ./v5 Dec 10 09:36
sigma-x2
sigma-x2
sigma-x2
sigma-x3
sigma-x3
sigma-x3
sigma-x3
sigma-x8
sigma-x6
sigma-x6
sigma-x6
sigma-x4
sigma-x4
sigma-x5
sigma-x5

This time I submitted my task by typing: bsub -q rms -n 16 -e err.10 -o log.10 mpijob ./v5
Where v5 is the MPI version of VASP with NEB implement. And my MPI version is COMPAQ MPI V1.96
When I compiled VASP, I did use the –DMPI directive. But I am not sure the line
BLAS = -lcxml in the original Makefile should be modified to be
BLAS = -lcxmlp
For the system gives me the information below each time when I login.
Compaq Extended Math Library is available on this machine.
Users can link your routines with -lcxml option (-lcxmlp for parallel).
I tried both cases and found both versions of VASP can run.
When running an regular (non-neb) calculation on many processors, the top lines of OUTCAR are :
vasp.4.6.9 24Apr03 complex
executed on True64 date 2006.12.07 09:51:32
running on 1 nodes
distr: one band on 1 nodes, 1 groups
The task was submitted by typing bsub -q rms -n 16 ~/bin/vasp
This problem confuses me. Thanks a lot for your kind help.

Post by **graeme** » Sun Dec 10, 2006 10:27 pm

Since you get the same problem with a regular vasp calculation, this is probably not a problem with our neb routines. The problem is likely related to the fact that you are running vasp directly, instead of using mpirun. For your calculations, you should try replacing

~/bin/vasp

with

mpirun -np 16 -machinefile (some_file_that_you_get_from_pbs) ~/bin/vasp

To debug this, you can also try running this second command directly, using a machinefile containing a list of the machines you want to run on.

phydig · Post by **phydig** » Mon Dec 11, 2006 6:21 am

Thanks a lot for your kind help.
Since on my system it provides command mpijob instead of mpirun. And the reference manual tells me that the parallel and serial jobs should be submitted in the same way. I tried to submit my task like this:
bsub -q rms –n 16 prun ~/bin/vasp
Then the top lines of OUTCAR were:
vasp.4.6.9 24Apr03 complex
executed on True64 date 2006.11.19 08:03:30
running on 16 nodes
distr: one band on 1 nodes, 16 groups

But the test of NEB method on my system has not been performed because the computational resource is not available now.

phydig · Post by **phydig** » Wed Dec 13, 2006 2:08 am

The parallel version of VASP can be compiled successfully on my system.
However, when running VASP will abort immediately after it is started due to “LAPACK: Routine ZPOTRF failed!”
I have fixed this problem by lowering the optimization level of mpi.F.
Since the NEB method essentially requires parallelization May this will lead to a decrease in efficiency?
If it is true, how to improve the efficiency when performing the NEB calculations?
Thanks for your kind consideration.

Post by **graeme** » Wed Dec 13, 2006 4:33 am

I don't think you will see any decrease in performance by changing the optimization level in mpi.F. Most of the time is spent in the blas/lapack and fft routines. If you use efficient versions of blas/lapack, such as goto, acml, or mkl, and possibly fftw for the fft routines, you will have an efficient binary.

phydig · Post by **phydig** » Wed Dec 13, 2006 5:29 am

Thank you very much!