NEB crash on IBM

Vasp transition state theory tools

Moderator: moderators

NEB crash on IBM

Postby ashwin_r » Fri Dec 15, 2006 5:41 pm

Dear all,

I compiled VASP with the VTST code on an IBM P6x and I seem to be running into problems. Brief description of what I did first:
- When compiling, I had to add the dstev.f to VASP's lapack_double.f file---I was subsequently able to link the Lapack library directly to avoid this hack.
- Next, I had to pass a '-qextname=flush' option to the mpxlf compiler since there was issue with an unresolved "flush" symbol in lanczos.o. I also tried commenting out all references to lanczos in chain.f and compiling without lanczos.o which worked fine.

Now the problem: no matter how I compile, VASP works fine by itself but the moment I switch on the elastic band part I get an error message saying "1525-108 Error encountered while attempting to allocate a data object. The program will stop." This can supposedly be fixed using a "-q64" option when compiling which I tried and I also set all shell limits (data, stack etc.) to be unlimited. It appears (from the OUTCAR file) that the program crashes when trying to initialize the FFTs.

Sorry for being long-winded! Has anyone seen this problem before?

thanks!
Ashwin.
Last edited by ashwin_r on Sun Dec 17, 2006 6:30 pm, edited 1 time in total.
ashwin_r
 
Posts: 4
Joined: Fri Dec 15, 2006 5:20 pm
Location: Princeton University

Postby graeme » Sat Dec 16, 2006 6:32 am

If there is something in our code that is not compatible with the ibm compilers, we would certainly like to fix it.

We have it running here on a Power5 system. Increasing the default stack and data limits that you mention are definitely important. We have also used the native essl math libraries that ibm provides. These have blas, lapack and fft routines.

The flush commands can be commented. They are not compatible with all flavors of fortran, and they are not essential. The only reason they are in the code is so that small files get updated after each ionic iteration.

I can try to reproduce your error on our local IBM machine, if you send your makefile. Also, if you learn what is going wrong, please let us know so that we can fix the code, or recommend how to update the makefile.

It's very strange that you see the problem when the NEB is turned on, but not for regular parallel calculations.

Here's one idea: have you made sure that you maxdata and maxstack limits are set when your job is run through loadleveler? You can add the -bmaxdata and -bmaxstack limits in your makefile, so that the vasp job is always allowed unlimited data sizes.

What about running a small neb; do you still see the error if you run something that should be within the default data size limit?

Good luck with this.
graeme
Site Admin
 
Posts: 1192
Joined: Tue Apr 26, 2005 4:25 am
Location: University of Texas at Austin

Postby ashwin_r » Sun Dec 17, 2006 4:36 pm

Thanks! I have been able to get this to run... I was running out of memory apparently. All I have done for now is to use the brute force approach and quadruple the number of processors; I need to run more tests to check exact memory utilization. Incidentally, I can recommend two things when compiling: 1) If using lanczos.f, some compilers might require the '-qextname=flush' tag 2) If only VASP's lapack_double.f and ESSL are being used on an IBM (see e.g. makefile.sp2) , dstev.f (Lapack routine for eigenvalues) will have to be included, although I prefer linking the lapack library directly if available on the machine.


I observe something interesting thus far: when I do a normal VASP run with 8 processors (12gb limit) things work fine. However, when I run only 1 image (IMAGES = 1, ICHAIN = 0, SPRING = -5, LCLIMB = .TRUE.)
with 8 processors (12gb limit again), the code crashes when planning the FFT with insufficient memory. So why does the latter, consume much more memory than the former? Naively, I would expect that both procedures are running only one structure, possibly with some additional information being stored in the NEB case.

thanks,
Ashwin.
ashwin_r
 
Posts: 4
Joined: Fri Dec 15, 2006 5:20 pm
Location: Princeton University

Postby graeme » Sun Dec 17, 2006 4:56 pm

Ashwin, great to hear that this is working; thank you for posting to explain the memory problem.

I don't understand why a single image NEB should be different from a normal vasp calculation. It might be interesting to check the number of bands and planewaves in the two calculations to see if the images tag changes how these values are calculated. The only other thing I can think of is that the regular vasp binary is built for gamma point and the NEB version is not (I know this is unlikely). You can also change the parallel memory usage with the NPAR and NSIM, which might allow you to run on fewer processors for the same sized system.
graeme
Site Admin
 
Posts: 1192
Joined: Tue Apr 26, 2005 4:25 am
Location: University of Texas at Austin

Postby ashwin_r » Sun Dec 17, 2006 5:44 pm

I checked the bands and planewaves and you suggested and therein lies my problem. If I repeat my run by copying the *same* POSCAR file to directories 00, 01 and 02 and run the NEB, things work like a normal VASP run. My problem was that that the structure in 01 was intermediate between 00 and 02 with a consequent reduction in symmetries and therefore more kpoints+bands which explains my increased memory requirements. Things appear to be working for now. Thanks for your help, Graeme!
ashwin_r
 
Posts: 4
Joined: Fri Dec 15, 2006 5:20 pm
Location: Princeton University

1525-108 Error encountered while attempting to allocate a da

Postby molesimu » Sat Jan 06, 2007 8:12 pm

Hello,

The same error message arised when I run the vasp-NEB jobs in IBM Power5 and Linux Cluster.

The installation of NEB and vasp job running are fine on both machines. When I tested vasp-NEB job, one system which is reaction on Pt(100) surface works fine so far. When I tested the reaction on Pt(111) surface, the error message ' 1525-108 Error encountered while attempting to allocate a data' . Both of the testing system are same in the ammount of atoms and difference is the structure of surface.

The available memory is 24gb each node.

Any Comment and Suggestion will be appreciated!
molesimu
 
Posts: 5
Joined: Wed Dec 27, 2006 10:45 pm

Postby graeme » Sun Jan 07, 2007 3:15 am

The first thing to check is that your limits are set unlimited. Even if you have a lot of memory on your machine, AIX has a default limit for the stack and data segments.

Another possibility is that this is being caused by a bad line of code in the neb.F file. There is one variable which is not allocated properly (thanks to phydig for pointing this out). Most compilers don't seem to mind, but it would be good to check. I've made a new neb.F file with the problem fixed. The only difference in this file is how the variable old_tangent is allocated. If you get a chance to download it and try it, I would be very interested to see if this solves the problem. It is available at:

http://theory.cm.utexas.edu/vtsttools/downloads/neb.F

We have been making major revision to these codes, and will be releasing a new major version (hopefully) this upcoming week. The neb code is this new version has be redone and will have this problem.
graeme
Site Admin
 
Posts: 1192
Joined: Tue Apr 26, 2005 4:25 am
Location: University of Texas at Austin

Postby molesimu » Mon Jan 08, 2007 1:12 am

The new neb.F was download and the NEB was reinstalled. seems the new neb.F is unable to solve ths problem.

The stack and data were set unlimited.

The error seems weird that NEB does not work for Pt111 surface reaction , whereas works Pt100 for same size systms. Vasp relaxation of system containing Pt111 also works.
molesimu
 
Posts: 5
Joined: Wed Dec 27, 2006 10:45 pm

Postby graeme » Mon Jan 08, 2007 2:31 am

Could you make the files available to us so that we can try to reproduce the error? I don't have any other ideas, but with the input files we could probably debug the problem.
graeme
Site Admin
 
Posts: 1192
Joined: Tue Apr 26, 2005 4:25 am
Location: University of Texas at Austin

Postby molesimu » Mon Jan 08, 2007 8:20 pm

When I use three times nodes in Pt111 case than Pt100 case, the Pt111 case works.

In Pt100 case, six images in the job and I used 6 nodes(each node has eight processors). and in Pt111 case, now seems at least 18 nodes are needed to run the job.

Graeme, Could you please tell me how to send the inputs to you?

Thank you very much for patient reply.
molesimu
 
Posts: 5
Joined: Wed Dec 27, 2006 10:45 pm

Postby graeme » Tue Jan 09, 2007 4:29 am

Ah, good to know that the Pt111 system works with many processors. This does seem to suggest that it is a memory limit problem. Are you sure that the stack and data are set to be unlimited in your makefile, or in your submission script? Adding a flag similar to:

-bmaxdata:0x80000000 -bmaxstack:0x10000000

in you LINK command will increase your limits for the vasp binary.

To send your input files, if you have access to a web server you can post a link to them, or you can email a .tar.gz file to graeme at mail.utexas.edu, or you could ftp them to theory.cm.utexas.edu, using anonymous access.
graeme
Site Admin
 
Posts: 1192
Joined: Tue Apr 26, 2005 4:25 am
Location: University of Texas at Austin


Return to VTSTTools

Who is online

Users browsing this forum: No registered users and 1 guest

cron