|
From: | Axel Arnold |
Subject: | Re: [ESPResSo-users] mpi and compressed block files |
Date: | Thu, 06 Sep 2012 22:47:56 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0 |
Dear Martin,this is basically the same problem as you ran already into with reading all Tcl variables. You read back all values, and some are incompatible with changing the number of nodes. Here, this is the processor node grid, which of course has to fit the number of nodes as the checkpoint was written, i.e., you can only read this variable back if you don't change the number of processors.
Just like for tcl variables, there are also blacklists for not reading back certain variables from the setmd variables, see the user's guide, and you can specify which variables you really need to reset. However, in the case of the checkpoints, there is one more concern that you should be aware of: if you use a thermostat that relies on random numbers, such as the standard Langevin, then the random numbers will be only reproducable if you use the same node_grid (and hence, number of nodes), and restore the random seeds. Therefore, for true checkpointing, you need to save node_grid and restore it, on the same number of nodes. In addition, you need to unconditionally recreate the Verlet lists, which requires the command "invalidate_system" right after writing the checkpoint.
In your case, it seems that you are just creating the setup serially, and then want to go parallel. In this case, saving random seeds etc is not necessary, and you should only save those setmd variables, that you actually changed during your setup script.
Cheers, Axel On 09/06/2012 02:50 PM, Martin Lindén wrote:
Hi! I am fairly new to Espresso, and have some trouble with reading checkpoints, as described at the end of Sec. 10.1.7 in the users guide for 3.1.0. To reproduce the problem: 1. Run blockread3.tcl in serial mode. This reads a uncompressed and a compressed version of a blockfile (idential content), and works as expected.Espresso blockread3.tcl2. Run in mpi mode with one processor. Somewhat artificial, but works:mpirun -n 1 Espresso blockread3.tcl3. The problem is mpi on multiple processors:mpirun -n 4 Espresso blockread3.tcl(...) WARNING: node_grid incompatible with current n_nodes, ignoring error waiting for process to exit: child process lost (is SIGCHLD ignored or trapped?) while executing "close $innnn" (file "blockread3.tcl" line 14) -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- Two processors (mpirun -n 2 ...) sometimes go through, and sometimes crashes, but more than two always crashes on my system. A temporary fix is of course to stay away from compressing the block files. But it would be nice to be able to work with compressed files when I go to larger systems. System info: ESPResSo-3.1.0 { Compilation status { FFTW } { BOND_ANGLE_HARMONIC } { LENNARD_JONES } { LJCOS } { LJCOS2 } { MPI_CORE } { EXCLUSIONS } } mpirun (Open MPI) 1.5.4 gzip 1.4 ubuntu 12.04 64 bit. Sincerely, Martin
-- JP Dr. Axel Arnold Tel: +49 711 685 67609 ICP, Universität Stuttgart Email: address@hidden Pfaffenwaldring 27 70569 Stuttgart, Germany
[Prev in Thread] | Current Thread | [Next in Thread] |