The GPU version of LICOM3 under HIP framework and its large-scale application

. A high-resolution (1/20°) global ocean general circulation model with Graphics processing units (GPUs) code implementations is developed based on the LASG/IAP Climate system Ocean Model version 3 (LICOM3) under Heterogeneous-compute Interface for Portability (HIP) framework. The dynamic core and physics package of LICOM3 are 15 both ported to the GPU, and 3-dimensional parallelization is applied. The HIP version of the LICOM3 (LICOM3-HIP) is 42 times faster than what the same number of CPU cores dose, when 384 AMD GPUs and CPU cores are used. The LICOM3-HIP has excellent scalability; it can still obtain speedup of more than four on 9216 GPUs comparing to 384 GPUs. In this phase, we successfully performed a test of 1/20° LICOM3-HIP using 6550 nodes and 26200 GPUs, and at the grand scale, the model’s time to solution can still obtain an increasing, about 2.72 simulated years per day (SYPD). The high performance was 20 due to putting almost all of computation processes inside GPUs, and thus greatly reduces the time cost of data transfer between CPUs and GPUs. At the same time, a 14-year spin-up integration following the phase 2 of Ocean Model Intercomparison Project (OMIP-2) protocol of surface forcing has been conducted, and the preliminary results have been evaluated. We found that the model results have little differences from the CPU version. Further comparison with observations and lower-resolution LICOM3 results suggests that the 1/20° LICOM3-HIP can not only reproduce the observations, but also produce much smaller 25 scale activities, such as submesoscale eddies and frontal scales structures.

Both the low-resolution (1°) (Lin et al., 2020) and high-resolution (1/10°) (Li Y. et al., 2020) stand-alone LICOM3 are also involved in the OMIP-1 and OMIP-2; their outputs can be downloaded from websites. The performances of the two versions 100 of LICOM3 comparing with other CMIP6 ocean models are shown in Tsujino et al., (2020) and Chassignet et al. (2020), respectively. The 1/10° version has also been applied to do short-term ocean forecast (Liu et al., 2020, under review).

Configurations of models
To investigate the GPU version, three configurations are employed in the present study. They are 1°, 0.1°, and 1/20°. Details of these models are listed in Table 1. The number of horizontal grid points for the three configurations are 360×218, 105 3600×2302 and 7200×3920, respectively. The vertical levels for the low-resolution are 30, while they are 55 for the other two high-resolution models. From 1° to 1/20°, it increases the computational effort by about 8000 (20 ! ) times (considering the 20 times for decreasing the time step), plus the vertical resolution increase from 30 to 55, totally approximately 15000 times. The original CPU version of 1/20° with MPI parallel on Tianhe-1A only reached 0.31 SYPD using 9216 CPU cores. This will slow down the 10-year spin-up simulation of LICOM3 to more than one month, which is not practical for climate research. 110 Therefore, such simulations are suitable for extreme-scale high-performance computers by applying the GPU version.
Besides the differences in grid points, there are three main aspects that are different among three experiments, particularly between 1° version and the other two versions. First, the horizontal viscosity schemes are different: using Laplacian for 1° and biharmonic for both 1/10° and 1/20°. The viscosity coefficient is one order smaller for the 1/20° version than for the 1/10° version, namely, -1.0×10 9 m 4 /s for 1/10° vs -1.0×10 8 m 4 /s for 1/20°. Second, although the forcing including dataset (JRA55-115 do;Tsujino et al., 2018) and the bulk formula for the three experiments is all standard of the OMIP-2, the periods and temporal resolutions of the forcing fields are different: 6-hour data from 1958 to 2018 for the 1° version, and daily mean data in 2016 only for both the 1/10° and 1/20° versions. Third, the 1° version is coupled with a sea ice model of the CICE4, via NCAR's flux coupler version 7, while the two higher-resolution models are stand-alone, without a coupler or sea ice model. Additionally, the two higher-resolution experiments employ the new HIP version of LICOM3 (i.e., LICOM3-HIP), while the 120 low-resolution experiment does not, which was the CPU version of LICOM3 and also same as the version submitted to OMIP (Lin et al., 2020). We also listed all the important information in Table 1, such as the bathymetry data and the bulk formula, etc., though these items are similar in the three configurations.
The spin-up experiments for two high resolution versions are both conducted for 14 years forced by daily JRA55-do dataset in 2016. The atmospheric variables include the wind vectors at 10-m, air temperature at 10-m, relative humidity at 10-m, total 125 precipitation, the downward shortwave radiation flux, the downward longwave radiation flux, and the river runoff. According to the evolution of the kinetic energy, the models reach to a quasi-equilibrium state after more than 10 years of spin-up. The daily mean data are output for store and analysis. https://doi.org/10.5194/gmd-2020-323 Preprint. Discussion started: 10 December 2020 c Author(s) 2020. CC BY 4.0 License.

Hardware and software environments of the testing system
The two higher-resolution experiments were performed on a heterogeneous Linux cluster supercomputer, located at the 130 Computer Network Information Center (CNIC) of the CAS, China. This supercomputer consists of totally 7000 nodes, with a 1.9 Ghz X64 CPU of 32 cores on each node. Also, each node is equipped with 4 gfx906 AMD GPU cards with 16 GB memory.
The GPU has 64 cores, totally 2560 threads on each card. The nodes of both partitions are interconnected through the highperformance InfiniBand (IB) networks. The OpenMPI version 4.02 is employed for compiling, and the AMD GPU driver and libraries are rocm-2.9 integrated with HIP version 2.8. 135

Introduction to HIP on an AMD hardware platform
AMD's HIP is a C++ runtime API and kernel language. It allows developers to create portable applications that can be run on AMD's accelerators as well as on CUDA devices. The HIP provides an API for an application to leverage GPU acceleration for both AMD and CUDA devices. It is syntactically similar to CUDA, and most CUDA API calls can be converted in place 140 the character "cuda" by "hip." The HIP supports a strong subset of CUDA runtime functionality, and its open-source software is currently available on GitHub (https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-GUIDE.html). Now, some of the supercomputers install NVIDIA GPU cards, such as P100 and V100, and others install AMD GPU cards, for example AMD VERG20, etc.; hence, our HIP version LICOM3 can adapt and gain very high performance at different supercomputer centers, such as Tianhe-2 and AMD clusters. For coding on AMD GPU, our experience indicates that the HIP 145 is a good choice for high-performance model development. The developed model ported to HIP can be used on both the AMD and NVIDIA GPU platforms. Meanwhile, the model version is easy to keep consistent and maintain. In the following, the successful simulation of LICOM3-HIP is confirmed to be adequate to employ HIP. Figure 1 demonstrates the HIP implementations to support different types of GPUs. Besides the differences in naming and including libraries, there are other differences between HIP and CUDA: 1) AMD Graphics Core Next (GCN) hardware "warp" 150 size = 64; 2) device and host pointers allocated by HIP API use flat addressing (unified virtual addressing is enabled by default); 3) dynamic parallelism not currently supported; 4) some CUDA library functions do not have AMD equivalents; and 5) shared memory and registers per thread may differ between AMD and NVIDIA hardware. Despite these differences, majority of the CUDA codes in applications can be easily translated to the HIP, and vice versa.
Technical supports of CUDA and HIP also have some differences. For example, CUDA applications have some CUDA-aware 155 MPI to do direct MPI communication between different nodes in the GPU space, but HIP applications have no such functions so far. We have to transfer data from GPU memory to CPU memory for exchanging data with other nodes, and then transfer them back to the GPU memory. https://doi.org/10.5194/gmd-2020-323 Preprint. Discussion started: 10 December 2020 c Author(s) 2020. CC BY 4.0 License.

Core computation process of LICOM3 and C transitional version
We tried to apply the LICOM on a heterogeneous computer about five years earlier, cooperating with the NVIDIA Corporation. 160 The LICOM2 was adapted to NVIDIA P80 by OpenACC technical (Jiang et al., 2019). That was a convenient implementation of LICOM2-gpu using 4 NVIDIA GPUs to achieve 6.6 speedup compared to 4 Intel CPUs, but its speedup was not so good when further increasing the GPU number.
This time we started from the CPU version of LICOM3. The code structure of LICOM3 includes four steps. The first step is the model setup; it involves MPI partition and ocean block distribution. The second stage is model initialization, which includes 165 reading the input data and initialize the variables. The third stage is integration loops, the core computation of the model. Three  subroutines, such as "readyt", "readyc", "barotr", "bclinc", "tracer", "icesnow", and "convadj". When the model finishes one day's computation, the diagnostics and output subroutine will write out the predicted variables to files. The output files contain all the necessary variables to restart the model and for analysis.
To obtain high performance, using the native GPU develop language is more efficient. In the CUDA develop forum, both CUDA-C and CUDA-Fortran are provided; for the HIP, however, the support for Fortran is not as good as that for C++. We 175 plan to push all the core process codes into GPUs; hence, the Fortran codes of the seven major subroutines must be converted to HIP/C++. Due to the complexity and large number of lines in these subroutines (approximately 12000 lines Fortran code), also for making sure the converted C/C++ codes be correct, we rewrote them to C first, before finally converting them to HIP codes.
A bit-reproducible climate model produces the exact same numerical results for a given precision, regardless of its execution 180 setup, which includes the choice of domain decomposition, the type of simulation (continuous or restart), compilers, and the architectures executing the model (i.e., the same hardware and software conduct the same result). The C transitional version is bit-reproducible with F90 version of the LICOM3 (the binary output data are the same under Linux with "diff" command).
This C transitional version becomes the starting point of HIP/C++ codes, and reduces the complexity of developing HIP version of the LICOM3. 185

Optimization and tuning methods in LICOM3-HIP
The unit of computation in LICOM3-HIP is horizontal grid point. For example, 1/20° corresponds to 7200×3920 grids. For convenience of MPI parallelism, the grids were united as blocks, that is, if Proc x ×Proc y MPI processes are used in x and y directions, then each block has B x ×B y grids, where Proc x ×B x =7200 and Proc y ×B y =3920. Each GPU process does 2-D or 3-D computation in these B x ×B y grids as the MPI process does. In practice, four lateral columns are added to B x and B y (two on 190 each side, imt=B x +4, jmt=B y +4) for halo. Table 2 lists the frequently used block definitions of LICOM3.
The original LICOM3 was written in F90. To adapt it to GPU, the Fortran/C hybrid programing is applied. As shown in Figure   2, the codes are kept using F90 language before entering device-stepon and after stepon-out. The core computation processes within the stepons are rewritten by using HIP/C. Data structures in the CPU space remain the same as the original Fortran structures. The data commonly used by F90 and C are then defined by extra C including files and defined by "extern" type 195 pointers in C syntax to reference to them. In the GPU space, newly allocated GPU global memories to hold the arrives correspond to those in the CPU space, and the HipMemcpy is called to copy them in and out.
Seven major subroutines (include their sub-recurrent calls) are converted from Fortran to HIP. The sequences of the seven subroutines calls are maintained, but each subroutine is deeply re-coded in HIP to obtain the best performance. The data in the CPU space are 2-D or 3-D arrays; in the GPU space, we change them to 1-D arrays, which helps improve the data transfer 200 speed between different GPU subroutines.
The LICOM3-HIP is a two-level parallelism, each MPI process corresponding to an ocean block. The computation within one MPI process is then pushed into GPU. The latency of data copy between GPU and CPU is one of the bottlenecks for daily computation loops. To reduce the data copy time, all read-only GPU variables are allocated and copied at the initial stage.
Some data copy is still needed in the stepping loop, e.g., MPI call in barotr.cpp. In fact, four subroutines call halo in each step, 205 but the number of halo calls in "barotr" is more than 95% of all (Table 3); so we only use "barotr" subroutine to represent the halo communications.
The computation block in MPI (corresponding to 1 GPU) is a 3-D grid; in HIP revision, the 3-D parallelism is implemented.
This change adds extra parallel inside one block than the MPI solo parallelism (only 2-D). In order to adapt to this change, some optimizations are needed, such as increasing the global arrays to avoid data dependency. A demo of using temporary 210 array to parallel the computation inside a block can be found in Figure 3. Figure 3a represents a loop of the original code in the k direction. Since the variable v(i,j,k) has dependence on v(i,j,k+1), it will cause error when the GPU threads are paralleled in the k direction. We then separate the variable into two HIP kernel computations. In the upper of Figure 3b, a temporary array vt is used to hold the result of f1(), and it can be GPU threads parallel in the k direction. Then, in the bottom of Figure   3b, we use vt to do the computations of f2() and f3(); it can still be GPU threads parallel in the k direction. Finally, this loop of 215 codes is parallelized.
The parallel in GPU is more like shared memory program; the memory write conflicts occur in the advection computation of subroutine "tracer". We change the if-else tree in this subroutine; hence, the data conflicts between neighboring grids are avoided, making the 3-D parallelism successful. Moreover, in this subroutine we use more operations to alternate the data movement to reduce the cache usage. Since the operation can be GPU threads parallelized and it will not increase the total 220 computation time, the reduction of memory cache improves the final performance of this subroutine.
A notable problem when the resolution is increased to 1/20° is that the total size of Fortran common blocks will be bigger beyond 2 GB. This change will not cause abnormal for C in the GPU space. But if the data are referenced by GPU process, the https://doi.org/10.5194/gmd-2020-323 Preprint. Discussion started: 10 December 2020 c Author(s) 2020. CC BY 4.0 License. system call in HipMemcpy will occur compile errors. In this situation, we should change the data structure of the original Fortran arrays from "static" to "allocatable" type. Since a GPU is limited to 16 GB GPU memory, the ocean block size in one 225 block should not be too large. In practice, the 1/20° version starts from 384 GPUs (and it is regarded as the baseline for speedup here); if the partition is smaller than that value, sometimes the GPU memory insufficient errors will occur. Variable operations in CPU and GPU memory are at least one magnitude faster than the data transfer between GPU and CPU through 16X PCI-e.
As shown in Fig. 4, we only pack the necessary data for halo operation from imt×jmt to 4(imt+jmt). This change reduces the data buffer size to (4/imt+4/jmt) of the original one. The larger imt and jmt are, the less the transfer data is. At 384 GPUs, this 230 change saves about 10% of the total computation time. The change is valuable for the HIP since the platform has no CUDAaware MPI installed; otherwise, the halo operation can be done in the GPU space directly as POM.gpu does .

Model I/O optimization
The time cost for reading daily forcing from the disk is increased to 200 s in one model day after the model resolution is updated from 1° to 1/20°. This time is equivalent to the time of one step when 1536 GPUs are applied; hence, we must optimize 235 it for more total speedup. The cause of low performance is due to the daily data reading and scattering to all nodes every model day; we then rewrite the data reading strategy and do parallel scattering for 10 different forcing variables. Finally, the time cost of input is reduced to about 20 s, which is 1/10 of the original one (shown below).
As indicated, the core-process time cost is about 200s using 1536 GPUs. The output (65 GB data in binary format) of one model day needs about 250 s; it is also beyond the GPU computation time for one step. We modify the subroutine to a parallel 240 version, and it decreases the data write time to 70 s on the test platform (this also depends on system I/O performance).

Model performance in computing
Performing kilometer-scale and global climatic simulation is an challenging task (Palmer, 2014;Schär et al., 2020).  pointed out, the SYPD is a useful metric to evaluate model performance for a parallel model (Balaji et al., 2017). 245 Because a climate model often needs to be run at least 30-50 years for each simulation, 0.2-0.3 SYPD will need too much time to finish the experiment. The common view is that at least 1-2 SYPD is an adequate entrance for a realistic climate study. It also depends on the time scale in a climate study. For example, for 10-20-year simulation, 1-2 SYPD seems acceptable, and for 50-100 year simulation, 5-10 SYPD is better. NCEP weather prediction system throughput standard is 8 minutes to finish one model day, which is equivalent to 0.5 SYPD. 250 maintains that the LICOM3-HIP 1/20° runs well at all practice scales for a realistic climate study. To analyze the purely parallel 255 performance, the I/O time has been cut off from the total simulation time in the results of the follow-up tests. Figure 6 shows the SYPD at various parallel scales. The baseline (384) of GPUs could achieve 42 times speedup than that of the same number of CPU cores. Sometimes, we also count the overall speedup, which is 384 GPUs in 96 nodes versus the total 3072 CPU cores in 96 nodes. We can get the overall performance speedup of 384 being about 6-7 times. The figure also indicates that for all scales, the SYPD keeps increasing. At the scale of 9216 GPUs, the SYPD first goes beyond 2, and that is 260 seven times of the same CPUs result. A quasi-whole machine (26200 GPUs, 26200×65=1703000 cores totally, one process corresponds to one CPU core plus 64 GPU cores) result indicates it can still obtain an increasing SYPD to 2.72.  time. When 384 GPUs are applied, the "barotr" costs about 50% of the total time (Figure 8a), which solves the barotropic equations. When GPUs are increased to 9216, time cost for each of the subroutines is decreased, but the percentage of subroutine "barotr" is increased to 62% (Figure 8b). As mentioned above, the phenomenon can be interpreted by the calling of halo in "barotr" being more than the other subroutines; hence, the memory data copy and communication latency make it slower. 275

Model performance in climate research
To test if the HIP version of the LICOM meets the requirement of numerical precision for scientific usage, the daily mean sea surface height (SSH) fields of CPU and HIP versions' simulations are compared using the results from 1/20° experiments on a particular day, March 1 st of the 4th model year (Figures 9a, b). The general SSH spatial patterns of the two are very similar visually. The significant differences are only found in very limited areas, such as in the eddy rich regions near strong currents 280 or high-latitude areas ( Figure 9c); in most places, the difference values fall into the range between -0.1 and 0.1 cm. Because the hardware is different and the mathematical operation sequence of the HIP codes is not always the same as that for the Fortran version, the HIP and CPU versions are not identical byte by byte. Therefore, it is hard to verify the correctness of the results from the HIP version. Usually, the ensemble method is employed to evaluate the consistency of two model runs (Baker et al., 2015). Considering the unacceptable computing and storage resources, besides the differences between the two versions, 285 we here simply compute Root Mean Square Errors (RMSEs) between the two versions, which is only 0.18 cm, much smaller https://doi.org/10.5194/gmd-2020-323 Preprint. Discussion started: 10 December 2020 c Author(s) 2020. CC BY 4.0 License. than the spatial variation of the system, which is 92 cm (about 0.2%). That indicates the results of LICO3-HIP are generally acceptable for research.
To evaluate the preliminary results of the global 1/20° simulation from LICOM3-HIP, the sea surface temperature (SST) of the GPU version is compared with the observed SST ( Figure 10 With increasing horizontal resolution of the observation, we now know that mesoscale eddies are ubiquitous in the ocean with the spatial scale of 100-300 km. The rigorous eddies usually occur along major ocean currents, such as the Kuroshio and its Extension, the Gulf Stream and the Antarctic Circumpolar Current (Figure 11a). The eddies also capture more than 80% of the kinetic energy of the ocean, estimated using the satellite data (e.g., Chelton et al., 2011). Therefore, these mesoscale eddies 300 must be solved in the ocean model. To resolve eddies of the global ocean, the horizontal resolution of a numerical model must be higher than 1/10°, but that cannot resolve the eddies in the high latitude and shallow waters (Hallberg, 2013). Therefore, higher resolution is required to resolve the eddies globally. It is clear that the EKE for the 1° version is low, even in the areas with strong currents, while the 1/10° version can reproduce most of the eddy-rich regions in the observation. The EKE is increased when the resolution is further enhanced to 1/20°, indicating much more eddy activities are resolved. 305 Table 4 summaries detailed features of some published GPU version models. We can find that various programing methods have been implemented for different models. A near-kilometer atmospheric model using 4888 GPUs was reported as a largescale example of weather/climate studies. With the development of supercomputing, the horizontal resolution of ocean 310 circulation models will keep on increasing, and much more sophisticated physical processes will also be developed. The LICOM3-HIP has a larger scale not only in terms of grid size but also in terms of final GPU numbers.

Application of ocean climate model beyond 10000 GPUs
We successfully performed a quasi-whole machine (26200 GPUs) test, and the results indicate the model obtained an increasing SYPD (2.72). Actually, the application of an ocean climate model beyond 10000 GPUs is not easy because the multi-nodes plus multi-GPUs running requires that the network connection, PCI-e and memory speed, and input/output storage 315 systems all work well and are in their best performances. The three most occur errors when running LICOM3-HIP are MPI hardware errors, CPU memory access errors, and GPU hardware errors. Let's suppose that the probability of an individual hardware (or software) error to occur is 10 -5 . Along with the MPI (GPUs) scale increasing, the total error rate is increased; and once a hardware error occurs, the model simulation will fail.
When 384 GPUs are applied, the success rate within one hour can be expressed as %1-384×10 -5 & 3 =98.85%, and the failure rate 320 .15%. Applying this formula, we can obtain the failure rate corresponding to 1000, 10000, and 26200 GPUs. The results are listed in Table 5. As shown in Table 5, in the middle scale (i.e., 1000 GPUs are used) three failures will happen through 100 runs; when the scale increases to 10000 GPUs, 1/4 of them will fail. The 10 -5 error probability also indicates that 10000 GPUs task cannot run 10 continues hours on average. If the success time restriction decreases, the model success rate will increase. For example, within 6 minutes, the 26200 GPUs task success rate is 325 .34%, and its failure rate is 1-%1-26200×10 -6 & 3 =7.66%.

Energy to solution
We also measured energy to solution here. A simulation normalized energy (E) has been employed here as a metric. The formula is as follows: where TDP is the Thermal Design Power, N is the computer nodes used, and SYPD/24 equals the simulated years per hour.
So, the smaller the E value, the better, which means that we can get more simulated years within limited power supply. To calculate the value of E, we estimated the TDP of 1380 W for a node on the present platform (1 AMD CPU and 4 GPUs), and 290 W for a reference node of INTEL (2 INTEL 16-core CPUs). We only include the TDP of CPUs and GPUs here.
Based on the above power measurements, the energy cost of simulations are shown in Table 6 in MWh per Simulation Year 335 (MWh/SY). The energy costs for the 1/20° LICOM3 simulations running on CPUs and GPUs are comparable when the numbers of the MPI processors are within 1000. The energy costs of LICOM3 at 1/20° running on 384 (768) GPUs and CPUs are about 6.234 (7.661) MWh/SY and 6.845 (6.280) MWh/SY, respectively. But the simulation speed of LICOM3 on GPU is much faster than that on CPU, about 42 times for 384 processors and 31 times for 768 processors. When the number of MPI processors is beyond 1000, the value of E for GPU becomes much larger than that for CPU. This indicates the GPU is not fully 340 loaded from this scale.

Conclusions
The GPU version of LICOM3 under HIP framework has been developed in the present study. The dynamic core and physic packages are both ported to the GPU, and 3-D parallelization is applied. The new model has been implemented and gained good accelerating rate on a Linux cluster with AMD GPU cards. This is also the first time an ocean general circulation model 345 is fully applied on a heterogeneous supercomputer using the HIP framework. Based on our test using the 1/20° configuration, the LICOM3-HIP is 42 times fast than the CPU dose, when 384 AMD GPUs and CPU cores are used. The LICOM3-HIP has good scalability, it can still obtain speedup of more than four on 9216 GPUs comparing to 384 GPUs. The SYPD, which is equilibrium to the speedup, keeps on increasing as the number of GPUs increases. We successfully performed a quasi-whole machine test, which was 6550 nodes and 26200 GPUs, using 1/ test of kernel functions, the decreasing efficiency was mainly caused by the latency of data copy in and out to the GPU memory in solving the barotropic equations, particular for the number of GPUs larger than 10000.
Using the 1/20° configuration of LICOM3-HIP, a 14-year spin-up integration was conducted. Because the hardware is different and the mathematical operation sequence of GPU codes is not always the same as that of the Fortran version, the GPU and CPU versions cannot be identical byte by byte. But, the comparison between GPU and CPU versions of LICOM3 shows that 360 the differences in most places are very small, indicating that the results from LICOM3-HIP can be used for practical research.
Further comparison with the observation and the lower-resolution results suggest that the 1/20° configuration of LICOM3-HIP can not only reproduce the observed results, but also produce much more smaller scale activities.
The eddy-resolving ocean circulation model, which is a fundamental platform for oceanography research, ocean forecast, and climate prediction and projection, can simulate the variations of the circulations, temperature, salinity, and sea level with 365 spatial scale larger than 15 km and temporal scale from diurnal cycle to decadal variability. As mentioned above, 1-2 SYPD is a good entrance for a realistic model used for climate research. More practical GPU scale range for realistic simulation is around 384-1536 GPUs. At these scales, the model still has 0.5-1.22 SYPD. Even if we decrease the loops in "barotr" procedure to 1/3 of the original in the spin-up simulation, the performance will achieve 1-2.5 SYPD for 384-1536 GPUs. This performance will satisfy 10-50-year scale climate studies. Besides, this version can be definitely used for short-term ocean 370 prediction in the future.
Besides, the block size 36×30×55 (1/20° setup, 26200 GPUs) is not a large computation task for one GPU. Since one GPU has 64 cores totally 2560 threads, if a subroutine computation is 2-D, the operation in each thread is too small. Even for the 3-D loops, it is still not big enough to load the whole GPU. This indicates that it will gain more speedup when the LICOM resolution is increased to the kilometer level. The LICOM3-HIP codes are now written for 1/20°, but they are kilometer-ready GPU 375 codes.
There are still potential to further increase the speedup of LICOM3-HIP. Now, the bottleneck is in the high-frequency data copy in and out to the GPU memory in the barotropic part of the LICOM3. Unless the HIP-aware MPI is supported, the latency of data transfer between CPU and GPU cannot be totally overcome. So far, we can only try to reduce the time consumed through decreasing the frequency or magnitude of the data copy, and even modifying the method to solve the barotropic    https://doi.org/10.5194/gmd-2020-323 Preprint. Discussion started: 10 December 2020 c Author(s) 2020. CC BY 4.0 License.  Redi (1982); Gent & McWilliams (1990) Laplacian Laplacian