(PDF) Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Task T7.1 “Enabling Applications Codes for PRACE Systems” in Work Package 7 (WP7) of PRACE-4IP aims to provide application enabling support for the HPC applications which are important for the European researchers and small and medium enterprises to ensure the applications can effectively exploit HPC systems.

E-Infrastructures H2020-EINFRA-2014-2015 EINFRA-4-2014: Pan-European High Performance Computing Infrastructure and Services PRACE-4IP PRACE Fourth Implementation Phase Project Grant Agreement Number: EINFRA-653838 D7.1 Periodic Report on Applications Enabling Final Version: Author(s): Date: 1.0 Paul Graham, EPCC; Sebastian Lührs, JUELICH 15.04.2016 D7.1 Periodic Report on Applications Enabling Project and Deliverable Information Sheet PRACE Project Project Ref. №: EINFRA-653838 Project Title: PRACE Fourth Implementation Phase Project Project Web Site: http://www.prace-project.eu Deliverable ID: D7.1 Deliverable Nature: Report Dissemination Level: Contractual Date of Delivery: PU * 29 / April / 2016 Actual Date of Delivery: 30 / April / 2016 EC Project Officer: Leonardo Flores Añover * - The dissemination level are indicated as follows: PU – Public, CO – Confidential, only for members of the consortium (including the Commission Services) CL – Classified, as referred to in Commission Decision 2991/844/EC. Document Control Sheet Document Authorship Title: Periodic Report on Applications Enabling ID: D7.1 Version: 1.0 Status: Final Available at: http://www.prace-project.eu Software Tool: Microsoft Word 2010 File(s): D7.1.docx Paul Graham, EPCC; Sebastian Lührs, Written by: JUELICH PRACE-4IP – EINFRA-653838 i 15.04.2016 D7.1 Periodic Report on Applications Enabling Gabriel Hautreux, CINES Tristan Cabel, CINES Stéphane Lanteri, INRIA Dimitris Dellis, IASA Volker Weinberg, LRZ Bertrand Cirou, CINES Juan Carlos Caja, BSC Andrew Emerson, CINECA John Donners, SURFsara Ahmet Duran, ITU Yakup Hundur, ITU Jörg Hertzer, HLRS Bärbel Große-Wöhrmann, HLRS Juan Carlos Garcia, BSC Jussi Heikonen, CSC Esko Järvinen, CSC Claudio Arlandini, CINECA Raffaele Ponzini, CINECA Vittorio Ruggiero, CINECA Isabelle Dupays, IDRIS Lola Falletti, IDRIS Philippe Segers, GENCI Thomas Palychata, GENCI Hervé Lozach, CEA; Thomas Eickermann, JUELICH MB/TB Contributors: Reviewed by: Approved by: Document Status Sheet Version 0.1 Date 18/March/2016 Status Draft 0.2 0.3 0.4 0.5 0.6 24/March/2016 01/April/2016 03/April/2016 05/April/2016 11/April/2016 Draft Draft Draft Draft Draft 0.7 13/April/2016 Draft 1.0 15/April/2016 Final version PRACE-4IP – EINFRA-653838 ii Comments Set up document structure Added PA C content Added SHAPE content Formatting Content, formatting After project-internal review Further post-review revision 15.04.2016 D7.1 Periodic Report on Applications Enabling Document Keywords Keywords: PRACE, HPC, Research Infrastructure, Preparatory Access, SHAPE Disclaimer This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement n° EINFRA-653838. It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements. Please note that even though all participants to the Project are members of PRACE AISBL, this deliverable has not been approved by the Council of PRACE AISBL and therefore does not emanate from it nor should it be considered to reflect PRACE AISBL’s individual opinion. Copyright notices  2016 PRACE Consortium Partners. All rights reserved. This document is a project document of the PRACE project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contract EINFRA-653838 for reviewing and dissemination purposes. All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders. PRACE-4IP – EINFRA-653838 iii 15.04.2016 D7.1 Periodic Report on Applications Enabling Table of Contents Project and Deliverable Information Sheet ......................................................................................... i Document Control Sheet ........................................................................................................................ i Document Status Sheet ......................................................................................................................... ii Document Keywords ............................................................................................................................ iii Table of Contents ................................................................................................................................. iv List of Figures ........................................................................................................................................ v List of Tables......................................................................................................................................... vi References and Applicable Documents .............................................................................................. vi List of Acronyms and Abbreviations ................................................................................................. vii List of Project Partner Acronyms ....................................................................................................... ix Executive Summary .............................................................................................................................. 2 1 Introduction ................................................................................................................................... 3 2 T7.1.A Petascaling & Optimisation Support for Preparatory Access Projects – Preparatory Access Calls ............................................................................................................................................ 1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Cut-off statistics ................................................................................................................................... 1 Review Process ..................................................................................................................................... 3 Assigning of PRACE collaborators .................................................................................................... 4 Monitoring of projects ........................................................................................................................ 4 Hand-over between PRACE-3IP and PRACE-4IP PA type C projects ......................................... 5 PRACE Preparatory Access type C projects covered by this report .............................................. 6 Dissemination ....................................................................................................................................... 9 Cut-off June 2014 ................................................................................................................................ 9 2.8.1 Numerical modeling of the interaction of light waves with nanostructures using a high order discontinuous finite element method, 2010PA2452 ............................................................................... 9 2.8.2 Large scale parallelized 3d mesoscopic simulations of the mechanical response to shear in disordered media, 2010PA2457 .......................................................................................................... 13 2.8.3 PICCANTE: an open source particle-in-cell code for advanced simulations on tier-0 systems, 2010PA2458 ........................................................................................................................................ 16 2.8.4 OpenFOAM capability for industrial large scale computation of the multiphase flow of future automotive component: step 2., 2010PA2431 ..................................................................................... 21 2.9 Cut-off September 2014 .....................................................................................................................22 2.9.1 Parallel subdomain coupling for non-matching mesh problems in ALYA, 2010PA2486 ........... 22 2.10 Cut-off December 2014 ......................................................................................................................25 2.10.1 Numerical simulation of complex turbulent flows with Discontinuous Galerkin method, 2010PA2737 ........................................................................................................................................ 25 2.11 Cut-off March 2015 ............................................................................................................................30 2.11.1 Large Eddy Simulation of unsteady gravity currents and implications for mixing, 2010PA2821 ........................................................................................................................................ 30 3 T7.1.B SHAPE ............................................................................................................................. 35 3.1 SHAPE Second call: Applications and Review Process ..................................................................35 3.2 SHAPE Second call: Status ................................................................................................................36 3.3 SHAPE second call: Project summaries ...........................................................................................37 3.3.1 Ergolines: HPC-based Design of a Novel Electromagnetic Stirrer for Steel Casting ................ 37 3.3.2 Cybeletech: Numerical simulations for plant breeding optimization ......................................... 39 3.3.3 RAPHI: rarefied flow simulations on Intel Xeon Phi ................................................................. 40 PRACE-4IP – EINFRA-653838 iv 15.04.2016 D7.1 Periodic Report on Applications Enabling 3.3.4 Open Ocean: High Performance Processing Chain - faster on‐line statistics calculation ........ 41 3.3.5 Hydros Innovation: Automatic Optimal Hull Design by Means of VPP Applications on HPC Platforms ............................................................................................................................................. 43 3.3.6 Vortex: Numerical Simulation for Vortex-Bladeless .................................................................. 44 3.3.7 DesignMethods: Coupled sail and appendage design method for multihull based on numerical optimization ......................................................................................................................................... 45 3.3.8 Ingenieurbüro Tobias Loose: HPCWelding: parallelized welding analysis with LS-Dyna ....... 46 3.3.9 WB-Sails Ltd: CFD simulations of sails and sailboat performance ........................................... 47 3.3.10 Principia: HPC for Hydrodynamics Database Creation ................................................... 48 3.3.11 Algo’tech Informatique ...................................................................................................... 49 3.4 Summary of lessons learned and feedback .......................................................................................51 3.5 SHAPE third call ................................................................................................................................52 3.6 SHAPE: future ....................................................................................................................................53 4 Summary ...................................................................................................................................... 54 4.1 Preparatory Access Type C ...............................................................................................................54 4.1 SHAPE.................................................................................................................................................55 List of Figures Figure 1: Number of submitted and accepted proposals for PA type C per Cut-off. .............................. 2 Figure 2: Amount of PMs assigned to PA type C projects per Cut-off. .................................................. 2 Figure 3: Number of projects per scientific field. ................................................................................... 3 Figure 4: Timeline of the PA C projects. ................................................................................................ 5 Figure 5: View of the computational domain for the Y-shaped waveguide (left) and contour lines of the amplitude of the electric field (right)............................................................................................... 11 Figure 6: Strong scalability analysis of the DGTD solver with P2 (top), P3 (middle) and P4 (bottom) interpolation. ......................................................................................................................................... 12 Figure 7: The inverse average iteration time of initial ELASTO code, using up to 32 cores, as function of number of cores on Froggy and Curie, for system sizes 643, 2563, 5123 and 10243 ......................... 14 Figure 8: The percentage of time spent in MPI calls (left) and the average MPI message size during run, as function of number of cores on Curie, for system size 643, 2563 and 5123. .............................. 15 Figure 9: Most time consuming routines for strong (top) and weak (bottom) scaling tests, before (left) and after (right) the optimization work. ................................................................................................ 17 Figure 10: Old vs. new output strategy overview.................................................................................. 18 Figure 11: Comparison of the old and the new output strategy for a strong scaling test. ..................... 19 Figure 12: Two subdomains coupling and parallel partition. The lines show the connection between parallel partitions of the different subdomains. ..................................................................................... 23 Figure 13: Relative cost of using the subdomain coupling with a fixed number of two hundred iterations of the GMRES solver with a Krylov space of size ten. ......................................................... 23 Figure 14: Speed up using the subdomain coupling with a fixed number of two hundred iterations of the GMRES solver with a Krylov space of size ten, and the same case and configuration in one subdomain. ............................................................................................................................................ 24 Figure 15: Implicit coupling applied to the Navier-Stokes equations. (Left) Meshes (Right) Velocity and pressure. .......................................................................................................................................... 24 Figure 16: FSI benchmark. (Left) Geometry. (Right) Results. ............................................................. 25 Figure 17: Speedup on FERMI. The values are normalized by the speedup with 1024 cores. ............. 26 Figure 18: Speedup on HORNET. The values are normalized by the speedup with 24 cores. ............. 27 Figure 19: Percentage of the time consumed in the main steps of the computations. ........................... 29 Figure 20: Shear stress (left) and turbulent kinetic energy (right) profiles for different value of k are shown. The dashed line represent the modelled quantities, while the continuous lines the resolved ones........................................................................................................................................................ 30 Figure 21: Speedup vs. cores test1: gravity current in a channel. ......................................................... 32 Figure 22: Results for the channel test: propagation of the gravity current (density field) ................... 32 Figure 23: Benchmark for the 3D gravity current simulation. .............................................................. 33 PRACE-4IP – EINFRA-653838 v 15.04.2016 D7.1 Periodic Report on Applications Enabling Figure 24: Speedup vs. cores for the 3D gravity current test. ............................................................... 33 Figure 25: Top views of geometry and mesh, and velocity field .......................................................... 38 Figure 26: Speed-up as a function of the number of cores .................................................................... 39 List of Tables Table 1: Projects, which were established and finalized in the PRACE-3IP extension phase, but had to be finally reported in this deliverable. ..................................................................................................... 7 Table 2: Projects, which were established in the PRACE-3IP extension phase, but were supported by PRACE-4IP T7.1.A. ................................................................................................................................ 8 Table 3: Projects, which were established in PRACE-4IP. ..................................................................... 8 Table 4: Strong scaling of the DGTD solver with P2 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). .......................................................................... 12 Table 5: Strong scaling of the DGTD solver with P3 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). .......................................................................... 12 Table 6: Strong scaling of the DGTD solver with P4 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). .......................................................................... 12 Table 7: Scalasca analysis of the piccante core routines. ...................................................................... 17 Table 8: Maximum writing speed by using the new output strategy..................................................... 19 Table 9: Scaling performances on FERMI ............................................................................................ 27 Table 10: Scaling performances on HORNET ...................................................................................... 27 Table 11: Scaling performances on MareNostrum................................................................................ 28 Table 12: I/O writing time in sec with and without HDF5.................................................................... 28 Table 13: SHAPE applications to the second call ................................................................................. 36 Table 14: Third call applications ........................................................................................................... 53 Table 15: White paper status of the current PA C projects. .................................................................. 55 References and Applicable Documents [1] http://www.prace-project.eu (identical to http://www.prace-ri.eu) [2] http://www.prace-ri.eu/IMG/pdf/D7.1.3_3ip.pdf [3] http://www.prace-ri.eu/white-papers/ [4] Hybrid MIMD/SIMD High Order DGTD Solver for the Numerical Modeling of Light/Matter Interaction on the Nanoscale, http://www.prace-ri.eu/IMG/pdf/WP207.pdf [5] Large Scale Parallelized 3d Mesoscopic Simulations of the Mechanical Response to Shear in Disordered Media, http://www.prace-ri.eu/IMG/pdf/WP208.pdf [6] Optimising PICCANTE – an Open Source Particle-in-Cell Code for Advanced Simulations on Tier-0 Systems, http://www.prace-ri.eu/IMG/pdf/WP209.pdf [7] Parallel Subdomain Coupling for non-matching Meshes in Alya, http://www.praceri.eu/IMG/pdf/WP210.pdf [8] A. Abbà , L. Bonaventura, M. Nini , M. Restelli. Dynamic models for Large Eddy Simulation of compressible flows with a high order DG method. Computers and Fluids. DOI: 10.1016/j.compfluid.2015.08.021, in press [9] M. Nini, A. Abbà, M. Germano, M. Restelli. Analysis of a Hybrid RANS/LES Model using RANS Reconstruction. Proceeding of iTi2014 - conference on turbulence, Bertinoro, September 21-24, 2014 [10] Numerical Simulation of complex turbulent Flows with Discontinuous Galerkin Method, http://www.prace-ri.eu/IMG/pdf/wp211.pdf [11] PRACE-3IP Deliverable 5.2 “Integrated HPC Access Programme for SMEs”, February 2013 PRACE-4IP – EINFRA-653838 vi 15.04.2016 D7.1 Periodic Report on Applications Enabling [12] PRACE-3IP Deliverable 5.3.1 “PRACE Integrated Access Programme Launch”, June 2013 [13] PRACE-3IP Deliverable 5.3.2 “Results of the Integrated Access Programme Pilots”, June 2014 [14] PRACE-3IP Deliverable 5.3.3 “Report on the SHAPE Implementation”, January 2015 List of Acronyms and Abbreviations aisbl BCO CoE CPU CUDA DARPA DEISA DoA EC EESI EoI ESFRI GB Gb/s GB/s GÉANT GFlop/s GHz GPU HET HMM HPC HPL ISC KB LINPACK MB MB MB/s MFlop/s MIC MooC Association International Sans But Lucratif (legal form of the PRACE-RI) Benchmark Code Owner Center of Excellence Central Processing Unit Compute Unified Device Architecture (NVIDIA) Defense Advanced Research Projects Agency Distributed European Infrastructure for Supercomputing Applications EU project by leading national HPC centres Description of Action (formerly known as DoW) European Commission European Exascale Software Initiative Expression of Interest European Strategy Forum on Research Infrastructures Giga (= 230 ~ 109) Bytes (= 8 bits), also GByte Giga (= 109) bits per second, also Gbit/s Giga (= 109) Bytes (= 8 bits) per second, also GByte/s Collaboration between National Research and Education Networks to build a multi-gigabit pan-European network. The current EC-funded project as of 2015 is GN4. Giga (= 109) Floating point operations (usually in 64-bit, i.e. DP) per second, also GF/s Giga (= 109) Hertz, frequency =109 periods or clock cycles per second Graphic Processing Unit High Performance Computing in Europe Taskforce. Taskforce by representatives from European HPC community to shape the European HPC Research Infrastructure. Produced the scientific case and valuable groundwork for the PRACE project. Hidden Markov Model High Performance Computing; Computing at a high performance level at any given time; often used synonym with Supercomputing High Performance LINPACK International Supercomputing Conference; European equivalent to the US based SCxx conference. Held annually in Germany. Kilo (= 210 ~103) Bytes (= 8 bits), also KByte Software library for Linear Algebra Management Board (highest decision making body of the project) Mega (= 220 ~ 106) Bytes (= 8 bits), also MByte Mega (= 106) Bytes (= 8 bits) per second, also MByte/s Mega (= 106) Floating point operations (usually in 64-bit, i.e. DP) per second, also MF/s Multi Integrated Cores Massively open online Course PRACE-4IP – EINFRA-653838 vii 15.04.2016 D7.1 MoU MPI NDA PA PATC PRACE PRACE 2 PRIDE RI SHAPE SME TB TB TCO TDP TFlop/s Tier-0 Tier-1 UNICORE WP Periodic Report on Applications Enabling Memorandum of Understanding. Message Passing Interface Non-Disclosure Agreement. Typically signed between vendors and customers working together on products prior to their general availability or announcement. Preparatory Access (to PRACE resources) PRACE Advanced Training Centres Partnership for Advanced Computing in Europe; Project Acronym The upcoming next phase of the PRACE Research Infrastructure following the initial five year period. Project Information and Dissemination Event Research Infrastructure SME HPC Adoption Programme in Europe Small and Medium Enterprises Technical Board (group of Work Package leaders) Tera (= 240 ~ 1012) Bytes (= 8 bits), also TByte Total Cost of Ownership. Includes recurring costs (e.g. personnel, power, cooling, maintenance) in addition to the purchase cost. Thermal Design Power Tera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per second, also TF/s Denotes the apex of a conceptual pyramid of HPC systems. In this context the Supercomputing Research Infrastructure would host the Tier-0 systems National or topical HPC centres in the conceptual pyramid Uniform Interface to Computing Resources. Grid software for seamless access to distributed resources. Work Package PRACE-4IP – EINFRA-653838 viii 15.04.2016 D7.1 Periodic Report on Applications Enabling List of Project Partner Acronyms BADW-LRZ Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Germany (3rd Party to GCS) BILKENT Bilkent University, Turkey (3rd Party to UYBHM) BSC Barcelona Supercomputing Center - Centro Nacional de Supercomputacion, Spain CaSToRC Computation-based Science and Technology Research Center, Cyprus CCSAS Computing Centre of the Slovak Academy of Sciences, Slovakia CEA Commissariat à l’Energie Atomique et aux Energies Alternatives, France (3 rd Party to GENCI) CESGA Fundacion Publica Gallega Centro Tecnológico de Supercomputación de Galicia, Spain, (3rd Party to BSC) CINECA CINECA Consorzio Interuniversitario, Italy CINES Centre Informatique National de l’Enseignement Supérieur, France (3 rd Party to GENCI) CNRS Centre National de la Recherche Scientifique, France (3 rd Party to GENCI) CSC CSC Scientific Computing Ltd., Finland CSIC Spanish Council for Scientific Research (3rd Party to BSC) CYFRONET Academic Computing Centre CYFRONET AGH, Poland (3rd party to PNSC) EPCC EPCC at The University of Edinburgh, UK ETHZurich (CSCS) Eidgenössische Technische Hochschule Zürich – CSCS, Switzerland FIS FACULTY OF INFORMATION STUDIES, Slovenia (3rd Party to ULFME) GCS Gauss Centre for Supercomputing e.V. GENCI Grand Equipement National de Calcul Intensiv, France GRNET Greek Research and Technology Network, Greece INRIA Institut National de Recherche en Informatique et Automatique, France (3 rd Party to GENCI) IST Instituto Superior Técnico, Portugal (3rd Party to UC-LCA) IUCC INTER UNIVERSITY COMPUTATION CENTRE, Israel JKU Institut fuer Graphische und Parallele Datenverarbeitung der Johannes Kepler Universitaet Linz, Austria JUELICH Forschungszentrum Juelich GmbH, Germany KTH Royal Institute of Technology, Sweden (3rd Party to SNIC) LiU Linkoping University, Sweden (3rd Party to SNIC) NCSA NATIONAL CENTRE FOR SUPERCOMPUTING APPLICATIONS, Bulgaria NIIF National Information Infrastructure Development Institute, Hungary NTNU The Norwegian University of Science and Technology, Norway (3rd Party to SIGMA) NUI-Galway National University of Ireland Galway, Ireland PRACE Partnership for Advanced Computing in Europe aisbl, Belgium PSNC Poznan Supercomputing and Networking Center, Poland RISCSW RISC Software GmbH PRACE-4IP – EINFRA-653838 ix 15.04.2016 D7.1 RZG SIGMA2 SNIC STFC SURFsara UC-LCA UCPH UHEM UiO ULFME UmU UnivEvora UPC UPM/CeSViMa USTUTT-HLRS VSB-TUO WCNS Periodic Report on Applications Enabling Max Planck Gesellschaft zur Förderung der Wissenschaften e.V., Germany (3 rd Party to GCS) UNINETT Sigma2 AS, Norway Swedish National Infrastructure for Computing (within the Swedish Science Council), Sweden Science and Technology Facilities Council, UK (3rd Party to EPSRC) Dutch national high-performance computing and e-Science support center, part of the SURF cooperative, Netherlands Universidade de Coimbra, Labotatório de Computação Avançada, Portugal Københavns Universitet, Denmark Istanbul Technical University, Ayazaga Campus, Turkey University of Oslo, Norway (3rd Party to SIGMA) UNIVERZA V LJUBLJANI, Slovenia Umea University, Sweden (3rd Party to SNIC) Universidade de Évora, Portugal (3rd Party to UC-LCA) Universitat Politècnica de Catalunya, Spain (3rd Party to BSC) Madrid Supercomputing and Visualization Center, Spain (3rd Party to BSC) Universitaet Stuttgart – HLRS, Germany (3rd Party to GCS) VYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA OSTRAVA, Czech Republic Politechnika Wroclawska, Poland (3rd party to PNSC) PRACE-4IP – EINFRA-653838 x 15.04.2016 D7.1 PRACE-4IP – EINFRA-653838 Periodic Report on Applications Enabling 1 15.04.2016 D7.1 Periodic Report on Applications Enabling Executive Summary Task T7.1 “Enabling Applications Codes for PRACE Systems” in Work Package 7 (WP7) of PRACE-4IP aims to provide application enabling support for the HPC applications which are important for the European researchers and small and medium enterprises to ensure the applications can effectively exploit HPC systems. There were two activities in T7.1: T7.1.A Petascaling & Optimisation Support for Preparatory Access Projects: This activity provided code enabling and optimisation to European researchers as well as industrial projects to make their applications ready for Tier-0 systems. Projects can continuously apply for such services via the Preparatory Access Call Type C (PA C) with a cut-off every three months for evaluation of the proposals. Five Preparatory Access Calls have been carried out in PRACE-4IP. The report focuses on the optimization work and results gained by the completed projects in PRACE-4IP and will report the last PA C projects of PRACE-3IP, which finished after the last Deliverable of PRACE-3IP was completed, and therefore have not be reported so far. In total seven PA C projects have finished their work. The statistics about the PA C calls in PRACE-4IP as well as a description of the call organization itself is also included. The results of the completed projects have also been documented in white papers, which were published on the PRACE-RI website [1]. T7.1.B SHAPE: This activity continued the support for SHAPE (the SME HPC Adoption Programme in Europe). SHAPE aims to raise awareness and provide European SMEs with the expertise necessary to take advantage of the innovation possibilities created by High-Performance Computing (HPC), thus increasing their competitiveness. It holds regular calls, and successful applicants to the SHAPE programme get support effort from a PRACE HPC expert and access to machine time at a PRACE centre. In collaboration with the SME, the PRACE partner helps them try out their ideas for utilising HPC to enhance their business. This report focusses on the second call of SHAPE, looking at the results of the projects and lessons learned from the perspective of both the SMEs and the PRACE partners. In addition, it covers the recently closed third call for projects, outlining changes to the application process both already implemented and recommended for future calls. PRACE-4IP – EINFRA-653838 2 15.04.2016 D7.1 Periodic Report on Applications Enabling 1 Introduction Computational simulations have proved to be a promising way of finding answers to research problems from a wide range of scientific fields. However, such complex problems often have such high demands regarding the needed computation time that these cannot be met by conventional computer systems. Instead, supercomputers are the method of choice in today’s simulations. PRACE offers a wide range of different Tier-0 and Tier-1 architectures to the scientific community as well as to industrial's innovative projects. The efficient usage of such systems places high demands on the used software packages and in many cases advanced optimization work has to be applied to the code to make efficient use of the provided supercomputers. The complexity of supercomputers requires a high level of experience and advanced knowledge of different concepts regarding programming techniques, parallelization strategies, etc. Such demands often cannot be met by the applicants themselves and thus special assistance by supercomputing experts is essential. PRACE offers such a service through the Preparatory Access Call type C (PA C) for Tier-0 systems. PA C is managed by Task 7.1.A “Petascaling and Optimization Support for Preparatory Access Projects”. This includes the evaluation of the PA C proposals as well as the assignment of PRACE experts to these proposals. Furthermore, the support itself is provided and monitored within this task. Section 2 gives a more detailed description of PA C and some facts on the usage of PA C in PRACE-4IP are listed in 2.1. The review process, the assignment of PRACE experts to the projects and the monitoring of the support work are detailed in Section 2.2, Section 2.3 and Section 2.4 respectively. The contents of Sections 2.2 - 2.4 can already be found in deliverable D7.1.3 of PRACE-3IP [2]. They are repeated here for completeness and the benefit of the reader. Section 2.5 describes the relation and hand over between the PRACE-3IP and the PRACE-4IP project regarding PA C. Section 2.6 gives an overview about the Preparatory Access type C projects covered in PRACE-4IP and will list projects supported by PRACE-3IP, which were not reported in former deliverables. The announcement of the call is described in Section 2.7. Finally, the work done within the projects along with the outcome of the optimization work is presented in Sections 2.8 - 2.11. The second part of this deliverable is the report on the SME HPC Adoption Programme in Europe (SHAPE), which is a pan-European programme to support the adoption of High Performance Computing (HPC) by European small to medium-size enterprises (SMEs). It was developed by PRACE under its PRACE-3IP European Union funded project, and continued under PRACE-4IP. The SHAPE programme, presented in the PRACE-3IP Deliverable 5.2 [11] aims to equip European SMEs with the awareness and expertise necessary to take advantage of the innovation possibilities opened by HPC, increasing their competitiveness. The mission of the Programme is to help European SMEs to demonstrate a tangible Return on Investment (ROI) by assessing and adopting solutions supported by HPC, thus facilitating innovation and/or increased operational efficiency in their businesses. It can be challenging for SMEs to adopt HPC. They may have no in-house expertise, no access to hardware, or be unable to commit resources to a potentially risky endeavour. This is where SHAPE comes in, by making it easier for SMEs to make use of high-performance computing in their business - be it to improve product quality, reduce time to delivery or provide innovative new services to their customers. Successful applicants to the SHAPE programme get support effort from a PRACE HPC expert and access to machine time at a PRACE centre. In collaboration with the SME, the PRACE partner helps them try out their ideas for utilising HPC to enhance their business. PRACE-4IP – EINFRA-653838 3 15.04.2016 D7.1 Periodic Report on Applications Enabling The initial SHAPE pilot [12][13] was launched in 2013 and helped 10 SMEs adopt HPC, with a follow-up exercise to gauge the business impact for the SMEs showing in almost all the cases that the pilot had been of real value to the SMEs, with tangible measures of the return on investment for the SHAPE work [14]. Following this pilot the PRACE Council decided to operate the SHAPE programme as a permanent service. The SHAPE second call was launched November 2014 and closed in January 2015, and is reported on in sections 3.1 to 3.4. The third call for SHAPE was launched in November 2015 and closed in January 2016, and is reported in section 3.5. Finally section 3.6 looks at the plans and recommendations for SHAPE going forward. The deliverable closes with a summary in Section 4 and points out the outcome of Task 7.1.A and Task 7.1.B. PRACE-4IP – EINFRA-653838 4 15.04.2016 D7.1 Periodic Report on Applications Enabling 2 T7.1.A Petascaling & Optimisation Support for Preparatory Access Projects – Preparatory Access Calls Access to PRACE Tier-0 systems is managed through PRACE regular calls, which are issued twice a year. To apply for Tier-0 resources the application must meet technical criteria concerning scaling capability, memory requirements, and runtime set up. There are many important scientific and commercial applications, which do not meet these criteria today. To support the researchers PRACE offers the opportunity to test and optimize their applications on the envisaged Tier-0 system prior to applying for a regular production project. This is the purpose of the Preparatory Access Call. The PA Call allows for submission of proposals at any time whereby the review of these proposals takes place every three months. This procedure is also referred to as Cut-off. Therefore, new projects can be admitted for preparatory purposes to PRACE Tier-0 systems once every quarter. It is possible to choose between three different types of access: • • • Type A is meant for code scalability tests, the outcome of which is to be included in the proposal in a future PRACE Regular Call. Users receive a limited number of core hours; the allocation period is two months. Type B is intended for code development and optimization by the user. Users get also a small number of core hours; the allocation period is 6 months. Type C is also designed for code development and optimization with the core hours and the allocation period being the same as for Type B. The important difference is that Type C projects receive special assistance by PRACE experts to support the optimization. As well as access to the Tier-0 systems, the applicants also apply for 1 to 6 PMs of supporting work to be performed by PRACE experts. The following Tier-0 systems were available for PA during the reporting period: • • • • • • CURIE, BULL Bullx cluster at GENCI-CEA, France (thin (TN), fat (FN), and hybrid (HN) nodes were available) FERMI, IBM Blue Gene/Q at CINECA, Italy HAZEL HEN, Cray XC40, replacing HORNET, Cray XC40 at GCS-HLRS, Germany MARENOSTRUM, IBM System X iDataplex at BSC, Spain (normal and hybrid nodes are available), SUPERMUC, IBM System X iDataplex at GCS-LRZ, Germany JUQUEEN, IBM Blue Gene/Q at GCS-JSC, Germany 2.1 Cut-off statistics In PRACE-4IP, five Cut-offs for PA took place resulting in three projects so far. Despite the fact that Cut-off June 2014, September 2014 and December 2014 took place in PRACE-3IP they are included in the presented statistics because the corresponding projects are either reported in this deliverable or were taken over by PRACE-4IP. In the March 2015 Cut-off one project had to be rejected due to poor scalability. Another project was already supported by SHAPE in T7.1.B and used PA C to gain the needed computing time. SHAPE already handles the support on this project. Therefore, no extra support by T7.1.A had to be provided on this proposal. In the June 2015 and the September 2015 Cut-off no new proposals had been accepted. In June 2015, only one single proposal applied for PA C but this project was already supported by SHAPE and used PA C to gain the needed computing time. In September 2015, no new proposals applied for PA C. PRACE-4IP – EINFRA-653838 1 15.04.2016 D7.1 Periodic Report on Applications Enabling Proposals 6 Applied Accepted 5 4 3 2 1 0 Jun-14 Sep-14 Dec-14 Mar-15 Cut-off Jun-15 Sep-15 Dec-15 Mar-16 Figure 1: Number of submitted and accepted proposals for PA type C per Cut-off. Figure 1 presents the number of proposals, which have been submitted and accepted respectively for each Cut-off covered in this deliverable. In total 3 out of 6 proposals were accepted during the PRACE-4IP Cut-off phase beginning in March 2015 until the Cut-off in December 2015. The two projects which were handled by SHAPE and T7.1.B marked here as rejected because they gain no extra support out of PA C and T7.1.A. The Cut-off March 2016 is currently in progress and therefore the final status is not yet available for the report. Provided support by PRACE-4IP experts 7 6 PMs 5 4 3 2 1 0 December 2014 March 2015 June 2015 September 2015 December 2015 March 2016 Cut-off Figure 2: Amount of PMs assigned to PA type C projects per Cut-off. Figure 2 gives an overview of the number of PMs from PRACE-4IP assigned to the projects per Cut-off. In total 11 PMs were made available to these projects. Finally, Figure 3 provides an overview of the scientific fields, which are covered by the supported projects in PRACE-4IP. PRACE-4IP – EINFRA-653838 2 15.04.2016 D7.1 Periodic Report on Applications Enabling Scientific Areas 4 3 2 1 0 Medicine and Engineering and Earth Sciences Chemistry and Life Sciences Energy and Materials Environment Astrophysics Mathematics Fundamental Physics and Computer Science Financial & Social Science Figure 3: Number of projects per scientific field. 2.2 Review Process The organization of the review procedure, the assignment of PRACE collaborators and the supervision of the PA C projects are managed by task 7.1.A. In this section, the review process for the preparatory access proposals of Type C is explained. All preparatory access proposals undergo a technical review performed by technical staff of the hosting sites to ensure that the underlying codes are in principle able to run on the requested system. In parallel, all projects are additionally reviewed by work package 7 in order to assess their optimization requests. Each proposal is assigned to two WP7 reviewers. The review is performed by PRACE partners who all have a strong background in supercomputing. Currently a list of 24 experts is maintained and the task leader has the responsibility to contact them to launch the review process. As the procedure of reviewing proposals and establishing the collaboration of submitted projects and PRACE experts takes place only four times a year it is necessary to keep the review process swift and efficient. A close collaboration between PRACE aisbl, T7.1.A and the hosting sites is important in this context. The process for both the technical and the WP7 review is limited to two weeks. In close collaboration with PRACE aisbl and the hosting sites, the whole procedure from PA Cut-off to project start on PRACE supercomputing systems is completed in less than six weeks. Based on the proposals the Type C reviewers need to focus on the following aspects: • • • • • Does the project require support for achieving production runs on the chosen architecture? Are the performance problems and their underlying reasons well understood by the applicant? Is the amount of support requested reasonable for the proposed goals? Will the code optimisation be useful to a broader community, and is it possible to integrate the achieved results during the project in the main release of the code(s)? Will there be restrictions in disseminating the results achieved during the project? PRACE-4IP – EINFRA-653838 3 15.04.2016 D7.1 Periodic Report on Applications Enabling Additionally, the task leader evaluates whether the level and type of support requested is still available within PRACE. Finally, the recommendation from WP7 to accept or reject the proposal is made. Based on the provided information from the reviewers the Board of Directors has the final decision on whether proposals are approved or rejected. The outcome is communicated to the applicant through PRACE aisbl. Approved proposals receive the contact data of the assigned PRACE collaborators, rejected projects are provided with further advice on how to address the shortcomings. 2.3 Assigning of PRACE collaborators To ensure the success of the projects it is essential to assign suitable experts from the PRACE project. Based on the described optimization issues and support requests from the proposal experts are thus chosen who are most familiar with the subject matter. This is done in two steps: First, summaries of the proposals describing the main optimization issues are distributed via corresponding mailing lists. Here, personal data is explicitly removed from the reports to maintain the anonymity of the applicants. Interested experts can get in touch with the task leader offering to work on one or more projects. Should the response not be sufficient to cover the support requirements of the projects, the task leader contacts the experts directly and asks them to contribute. There is one exception to the procedure when a proposal has a close connection to a PRACE site which has already worked on the code: In this case this site is asked first if they are able to extend the collaboration in the context of the new PA C project. This procedure has proven to be extremely successful. No proposals had to be rejected in the past reporting period due to a lack of available support. The assignment of PRACE experts takes place concurrently to the review process so that the entire review can be completed within six weeks. This has proven itself to be a suitable approach, as the resulting overhead is negligible. As soon as the review process is finished, the support experts are introduced to the PIs and can start the work on the projects. The role of the PRACE collaborator includes the following tasks: • • • • • Formulating a detailed work plan together with the applicant, Participating in the optimization work, Reporting the status in the task 7.1A phone conference every second month, Participating in the writing of the final report together with the PI (the PI has the main responsibility for this report), due at project end and requested by the PRACE office, Writing a white paper containing the results, which is published on the PRACE web site. 2.4 Monitoring of projects Task 7.1.A includes the supervision of the Type C projects. This is challenging as the projects’ durations (six months) and the intervals of the Cut-offs (3 months) are not cleanly aligned. Due to this, projects do not necessarily start and end at the same time but overlap, i.e. at each point in time different projects might be in different phases. To solve this problem, a phone conference takes place in task 7.1.A every two months to discuss the status of running PRACE-4IP – EINFRA-653838 4 15.04.2016 D7.1 Periodic Report on Applications Enabling projects, to advice on how to proceed with new projects and to manage the finalization and reporting of finished projects. In addition, the T7.1.A task leader gives a status overview in a monthly WP7 conference call to address all PRACE collaborators who are involved in these projects. All project relevant information is maintained on a PRACE wiki page, which is available to all PRACE collaborators. The T7.1.A task leader is also available to address urgent problems and additional phone conferences are held in such cases. Twice a year, a WP7 face-to-face meeting is scheduled. This meeting gives all involved collaborators the opportunity to discuss the status of the projects and to exchange their experience. 2.5 Hand-over between PRACE-3IP and PRACE-4IP PA type C projects The support for Preparatory Access Type C projects has been and is part of all PRACE projects (PRACE-1IP, -2IP, -3IP, -4IP). For the hand-over between the projects, the tasks decided to treat the regarding projects in the following way: The hand-over between the extension phase of PRACE-3IP and PRACE-4IP PA type C projects took place at the beginning of PRACE-4IP, February 1st, 2015. The Cut-offs, which took place in June, September and December 2014 were still under the responsibility of T7.1 in PRACE-3IP. Projects out of the June 2014 cut-off ran until February 1st, 2015. These projects could be finalized within the context of the PRACE-3IP extension phase but could not be finally reported in the final deliverable D7.1.3 [2]. The project out of the September 2014 cut-off ran until April 30th, 2015 and was completely supported by PRACE-3IP, but could not be finally reported in the final PRACE-3IP deliverable. The project out of the December 2014 cut-off started at February 16th, 2016, it was completely supported by PRACE-4IP. Thus, no hand-over of ongoing projects was needed. 06.05.14 30.06.14 24.08.14 18.10.14 12.12.14 05.02.15 01.04.15 26.05.15 20.07.15 13.09.15 07.11.15 01.01.16 25.02.16 20.04.16 14.06.16 PA2431 PA2452 PA2457 PA2458 PA2486 PA2737 PA2821 PA3125 PA3056 [ZELLBEREICH] [ZELLBEREICH] [ZELLBEREICH] [ZELLBEREICH] [ZELLBEREICH][ZELLBEREICH] [ZELLBEREICH][ZELLBEREICH] PRACE-3IP PRACE-4IP Cut-off Figure 4: Timeline of the PA C projects. PRACE-4IP – EINFRA-653838 5 15.04.2016 D7.1 Periodic Report on Applications Enabling The timeline of these projects is shown in the Gantt chart in Figure 4. The chart shows the time span of each project. Projects, which were supported by PRACE-3IP but are reported in this deliverable, are shown in red. PRACE-4IP projects are shown in green. The slightly different starting dates of the projects per Cut-off is the result of the decisions made by the hosting members which determine the exact start of the projects at their local site. Additionally, PIs can set the starting date of their projects within a limited time frame. The final results of all these projects are described in this deliverable. 2.6 PRACE Preparatory Access type C projects covered by this report Projects from Cut-off June 2014 and Cut-off September 2014 have their origin in the PRACE3IP extension phase and were also finalized as part of this phase. Because of the overlap between the creation of the final deliverable D7.1.3 of the PRACE-3IP extension phase [2] and the creation of the corresponding final reports, these projects could not be reported in D7.1.3. For completeness, their results are reported in this deliverable. Table 1 lists the corresponding projects. Cut-off June 2014 Title Numerical modelling of the interaction of light waves with nanostructures using a high order discontinuous finite element method Project leader Stéphane Lanteri PRACE expert Gabriel Hautreux, Tristan Cabel PRACE facility CURIE TN, CURIE FN PA number 2010PA2452 Project's start 15-Jul-2014 Project's end 15-Jan-2015 Title Large scale parallelized 3d mesoscopic simulations of the mechanical response to shear in disordered media Project leader Kirsten Martens PRACE expert Dimitris Dellis PRACE facility CURIE TN PA number 2010PA2457 Project's start 01-Aug-2014 Project's end 01-Feb-2015 Title PICCANTE: an open source particle-in-cell code for advanced simulations on tier-0 systems Project leader Andrea Macchi PRACE expert Volker Weinberg PRACE facility FERMI, JUQUEEN PRACE-4IP – EINFRA-653838 6 15.04.2016 D7.1 Periodic Report on Applications Enabling Cut-off June 2014 PA number 2010PA2458 Project's start 15-Jul-2014 Project's end 15-Jan-2015 Title OpenFOAM capability for industrial large scale computation of the multiphase flow of future automotive component: step 2 Project leader Jerome Helie PRACE expert Gabriel Hautreux, Bertrand Cirou PRACE facility CURIE TN PA number 2010PA2431 Project's start 01-Aug-2014 Project's end 01-Feb-2015 Cut-off September 2014 Title Parallel subdomain coupling for non-matching mesh problems in ALYA Project leader Guillaume Houzeaux PRACE expert Juan Carlos Caja PRACE facility MARENOSTRUM, FERMI PA number 2010PA2486 Project's start 01-Nov-2014 Project's end 30-Apr-2015 Table 1: Projects, which were established and finalized in the PRACE-3IP extension phase, but had to be finally reported in this deliverable. Projects from Cut-off December 2014 also have their origin within the PRACE 3IP extension phase. However, due to the project start in February (see 2.5) the corresponding project was completely handed over to PRACE-4IP T7.1.A. Table 2 lists the key information of the corresponding project. Cut-off December 2014 Title Numerical simulation of complex turbulent flows with Discontinuous Galerkin method Project leader Antonella Abba' PRACE expert Andrew Emerson PRACE facility MARENOSTRUM, FERMI, HORNET PA number 2010PA2737 Project's start 16-Feb-2015 PRACE-4IP – EINFRA-653838 7 15.04.2016 D7.1 Periodic Report on Applications Enabling Cut-off December 2014 Project's end 15-Aug-2015 Table 2: Projects, which were established in the PRACE-3IP extension phase, but were supported by PRACE-4IP T7.1.A. All remaining Cut-offs, starting with the Cut-off in March 2015, take place within the PRACE-4IP project phase and were supported by PRACE-4IP. Projects which were established in the Cut-off in December 2015 or the Cut-off in March 2016 are still in progress. Therefore, the final results cannot be presented in this deliverable but will be reported in a later deliverable. Table 3 lists all of this projects. Cut-off March 2015 Title Large Eddy Simulation of unsteady gravity currents and implications for mixing Project leader Claudia Adduce PRACE expert John Donners PRACE facility MARENOSTRUM, FERMI PA number 2010PA2821 Project's start 20-Apr-2015 Project's end 14-Nov-2015 Cut-off December 2015 Title Optimization of Hybrid Molecular Dynamics-Self Consistent Field OCCAM CODE Project leader Antonio De Nicola PRACE expert Chandan Basu PRACE facility MARENOSTRUM, FERMI, HAZEL HEN PA number 2010PA3125 Project's start 01-Feb-2016 Project's end 31-Jul-2016 Title HOVE Higher-Order finite-Volume unstructured code Enhancement for compressible turbulent flows. Project leader Claudia Adduce PRACE expert Thomas Ponweiser PRACE facility SUPERMUC, HAZEL HEN PA number 2010PA3056 Project's start 01-Feb-2016 Project's end 31-Jul-2016 Table 3: Projects, which were established in PRACE-4IP. PRACE-4IP – EINFRA-653838 8 15.04.2016 D7.1 Periodic Report on Applications Enabling The evaluation of the March 2016 proposals is currently in progress and is therefore not listed here. 2.7 Dissemination New PA Cut-offs are normally announced on the PRACE website [1]. After the low number of new proposals in the Cut-offs in June 2015 and September 2015 PRACE sites were asked to distribute an email to their users to advertise preparatory access and especially the possibility of dedicated support via PA C. Each successfully completed project should be made known to the public and therefore the PRACE collaborators are asked to write a white paper about the optimization work carried out. These white papers are published on the PRACE web page [3] and are also referenced by this deliverable. 2.8 Cut-off June 2014 This section and the following sub-sections describe the optimizations performed on the Preparatory Projects type C. The projects are listed in accordance with the Cut-off dates in which they appeared. General information regarding the optimization work done as well as the achieved results is presented here using the recommended evaluation form. The application evaluation form ensures a consistent and coherent presentation of all projects which were managed in the context of PA C. Additionally the white papers created by these projects are referenced so that the interested reader is provided with further information. 2.8.1 Numerical modeling of the interaction of light waves with nanostructures using a high order discontinuous finite element method, 2010PA2452 Code general features Name DIOGENeS - DIscOntinuous GalErkin Nano Solver Scientific field Computational nanophotonics Short code description The code is a high order finite element type solver for the numerical modeling of light interaction with nanometer scale structures. From the mathematical modeling point of view, one has to deal with the differential system of Maxwell equations in the time domain, coupled to an appropriate differential model of the behavior of the underlying material (which can be a dielectric and/or a noble metal) at optical frequencies. For the numerical solution of the resulting system of differential equations, the code implements a high order discontinuous finite element method (DGTD – Discontinuous Galerkin Time-Domain solver) which has been adapted to hybrid MIMD/SIMD computing in the context of the present project. Programming language Fortran 90 Supported compilers Pgf90, ifort, g95 Parallel implementation The hybrid MIMD/SIMD parallelization in DIOGENeS combines the use of the MPI and OpenMP parallel programming models Accelerator support Not yet PRACE-4IP – EINFRA-653838 9 15.04.2016 D7.1 Periodic Report on Applications Enabling Code general features Libraries MPI Building procedure Standard Makefile Web site Not available yet Licence None Topic 1 Main objectives The main objectives of the project were to implement the OpenMP part of the hybrid MIMD/SIMD parallelization strategy of the code, and to demonstrate the benefits of this parallelization on the scaling properties of the code. Accomplished work The starting point software implementing the DGTD solver was programmed in Fortran and parallelized for a distributed memory system using a SPMD strategy combining a partitioning of the underlying tetrahedral mesh and a message passing programming model using the MPI standard. During the project, the effort has been put on the introduction of a fine grain parallelization of the main loops of the solver using the OpenMP programming model. The specific feature that was to be optimized was the fine grain shared memory parallelization of the underlying discontinuous finite element solver. The associated computer code has thus been equipped with OpenMP directives for the intra-node parallelization of the loops over the elements (tetrahedra) of the meshes, which occur in the core solver routines for updating the electric and magnetic fields components (at each time step). It happened that simple DO PRIVATE constructs were sufficient to achieve acceptable speedups on the thin nodes of the Curie system. In order to improve further the scalability of the overall MPI/OpenMP solver on multi-node configurations of Curie, we also implemented a non-blocking based communication protocol for exchanging the values of the electromagnetic field components attached to elements on each side of the faces localized on the interfaces (surfaces) between neighbouring sub meshes in the mesh partitioning. This protocol has allowed to partially overlap point-to-point communication operations with computation when updating the electric and magnetic fields components. Main results We limited ourselves to a parallel performance evaluation in terms of strong scalability analysis. For that purpose, we selected a use case typical of optical guiding applications. A Yshaped waveguide is considered which consists of a nanosphere embedded in vacuum. The computational domain is shown of Figure 5 below. The constructed tetrahedral mesh consists of 520,704 vertices and 2,988,103 elements. The high order discontinuous finite element method designed for the solution of the system of time-domain Maxwell equations coupled to a Drude model for the dispersion of noble metals at optical frequencies is formulated on a tetrahedral mesh. Within each element (tetrahedron) of the mesh, the components of the electric and magnetic field, as well as the component of the electric polarization, are approximated by a nodal (Lagrange type) interpolation method. The unknowns of the problem are thus given by the values of these physical quantities at the nodes of the polynomial interpolation. For instance, for a linear (i.e. P1) interpolation of the fields, the number of DoFs (Degrees of Freedoms) within a tetrahedron is 6x4 if the element is located in vacuum, and 9x4 if the element is located in the metallic structure. For a quadratic (i.e. P2) interpolation, PRACE-4IP – EINFRA-653838 10 15.04.2016 D7.1 Periodic Report on Applications Enabling the corresponding figures are 6x10 and 9x10, and so on for higher interpolation degrees. Then the global number of DoFs is the sum of these figures of the elements of the given mesh. Figure 5: View of the computational domain for the Y-shaped waveguide (left) and contour lines of the amplitude of the electric field (right). The strong scalability analysis has been conducted on the thin nodes of the Curie system. Each run has been made considering eight OpenMP threads per socket and two sockets per node. Plots of the parallel speedup of the DGTD solver with P2 (top), P3 (middle) and P4 (bottom) interpolation are shown on Figure 6. The maximum number of cores that has been exploited is 8192 for a simulation based on the DGTD-P4 method. A quasi-ideal scaling is obtained up to 1024 cores. Achieving a better parallel speedup for a number of MPI processes greater than 1024 would probably require a finer tetrahedral mesh (with several million mesh vertices). PRACE-4IP – EINFRA-653838 11 15.04.2016 D7.1 Periodic Report on Applications Enabling Figure 6: Strong scalability analysis of the DGTD solver with P2 (top), P3 (middle) and P4 (bottom) interpolation. Number of cores 128 256 512 1024 Wall clock time 4066 sec 1972 sec 952 sec 462 sec Speed-up vs the first one 1.00 2.00 4.05 8.40 Number of Nodes 8 16 32 64 Number of MPI processes 16 32 64 128 Table 4: Strong scaling of the DGTD solver with P2 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). Number of cores 512 1024 2048 Wall clock time 2580 sec 1271 sec 646 sec Speed-up vs the first one 1.00 2.00 4.00 Number of Nodes 32 64 128 Number of MPI processes 64 128 256 Table 5: Strong scaling of the DGTD solver with P3 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). Number of cores 2048 4096 8192 Wall clock time 1714 sec 897 sec 529 sec Speed-up vs the first one 1.00 1.90 3.25 Number of Nodes 128 256 512 Number of MPI processes 256 512 1024 Table 6: Strong scaling of the DGTD solver with P4 interpolation (on each node, we spawn two MPI process and eight OpenMP threads per MPI process). The obtained results in terms of parallel performances are perfectly in line with our expectations and clearly demonstrate the potential of the high order discontinuous finite element solver that we are studying for exascale class simulations. In particular, these results open the route for possible future work toward the numerical treatment of more complex physical models relevant to nanophotonics which will require the use of higher resolution discretization meshes involving much more structures than what has been considered in the application selected for this project. The project also published a white paper, which can be found online under [4]. PRACE-4IP – EINFRA-653838 12 15.04.2016 D7.1 Periodic Report on Applications Enabling 2.8.2 Large scale parallelized 3d mesoscopic simulations of the mechanical response to shear in disordered media, 2010PA2457 Code general features Name ELASTO Scientific field Modeling the athermal shear response of 3d dense disordered material Short code description This code evolves the equations for a 3d lattice model for disordered systems under shear, which contain a stochastic local yielding part and a deterministic one for the long range elastic response, containing a convolution that is resolved in Fourier space. The parallelisation of the 3d FFT is divided in three steps. For clarity the x and y directions are considered parallelized and the z one the nonparallelized. First of all data are reorganized such that x is transposed with the z direction. Then each processor performs along the x direction multiple uni-dimensional FFT (multi-1DFFT). In the second step the y direction is transposed with the x one. Then multi1DFFT along the non-parallelized direction are performed. Finally, z is transposed with the y and the algorithm is repeated. Programming language FORTRAN 90 / MPI Supported compilers intel/14.0.3.174, mkl/14.0.3.174 and bullxmpi/1.2.8.2 Parallel implementation MPI (bullxmpi/1.2.8.2) Accelerator support N/A Libraries HDF5, MKL, FFTW3 Building procedure Makefile (see attachment) Web site N/A Licence N/A Topic 1 Main objectives The aim of this project was to resolve serious scaling problems of the ELASTO code experienced on the PI's local cluster, named Froggy. Initial runs were performed on Froggy with problem sizes 643, 2563 and 5123. In all these runs, code exhibited a slowing down with more than 16 cores (1 node). Accomplished work We can state that we could enhance the portability of our code from the Curie cluster to the local Ciment cluster "Froggy". Switching to Intel compilers and Intel MPI, using the compiler flags that were used on Curie, recompiling the fftw3 library and applying the minor code changes to our local cluster Froggy, the performance and scaling of code is close to the performance/scaling on Curie Thin Nodes. Note that for the large problem size of 10243 on Curie the codes scales almost linearly up to 4096 cores. PRACE-4IP – EINFRA-653838 13 15.04.2016 D7.1 Periodic Report on Applications Enabling Scaling results For the small problem size of 643 slow down appears from 8 to 16 cores in a single node. The code was compiled and ran on Curie without code changes, using compilers and flags suggested in PRACE Best Practice Guides. BullX MPI with Intel compilers was used. The available on Curie FFTW3 library was used to provide FFTW3 functions. The compiler optimization flags selected: -O3 -xAVX -unroll -unroll-aggressive -ip Figure 7: The inverse average iteration time of initial ELASTO code, using up to 32 cores, as function of number of cores on Froggy and Curie, for system sizes 643, 2563, 5123 and 10243 Figure 7 shows the inverse average iteration time of initial ELASTO code, using up to 32 cores, as function of number of cores on Froggy and Curie, for system sizes 643, 2563 and 5123 on the left site. Depicted are the initial performance of code on Froggy (●) and Curie (฀), together with the performance on Froggy after applying minor code, compiler and flags changes (฀). The right site of the Figure shows the inverse average iteration time of ELASTO code as function of number of cores on Curie, for system sizes 643, 2563, 5123 and 10243. Depicted are the performance before (฀) and after (฀) applying the minor code changes. The main hardware difference of two machines is the network interface. Curie uses QDR while Froggy uses FDR. Their CPUs are similar: E5-2680 on Curie, E5-2670 on Froggy. Surprisingly, on Curie with no code changes the scaling is much better when using up to two nodes. The inverse average iteration time of the original code as a function of the number of cores, up to 32 cores, is presented on the left side of Figure 7. Since code appears to exhibit better scaling on Curie a number of runs were performed on Curie with more than 32 cores. These results are presented on the right site of Figure 7. It seems that there is a large discrepancy of the performance and scaling of the same runs on the two machines. On Curie, the code scaling looks like a typical case. On the other hand, on Froggy, going from one to two nodes leads to slow down. In addition, for single node runs, i.e. up to 16 cores, the performance on Froggy is much lower than on Curie. These two findings suggest that there is some problem on PI's cluster related with network, batch system environment, MPI implementation etc. Inspecting software in use on Froggy, we found that: PRACE-4IP – EINFRA-653838 14 15.04.2016 D7.1 Periodic Report on Applications Enabling 1. MPI implementation is openmpi-1.6.4 with openib support, compiled with and using with GNU-4.6.2 compilers 2. fftw-3.3 compiled with GNU-4.6.2 3. Intel MPI and compiler version 13.0.1 is available on Froggy. Inspecting the code, a number of non-crucial changes was applied. Initially all timings were performed using the Fortran cpu_time() function. All occurrences of cpu_time() calls replaced with MPI_Wtime() calls to get an accurate measure of the elapsed time, since cpu_time() measures the CPU only time of a process and is not accurate in case of load imbalancing. Profiling Profiling of the code was performed using Scalasca and mpiP. In addition, some run time variables were examined inserting pieces of code at certain points. From profiling runs, the percentage of run time spent in MPI calls as well as the MPI messages size were collected. These measurements are presented in Figure 8. Figure 8: The percentage of time spent in MPI calls (left) and the average MPI message size during run, as function of number of cores on Curie, for system size 643, 2563 and 5123. The main conclusions from the profiling of code on Curie are summarized below. 1. The code uses only few MPI functions. Few MPI_Allreduce, MPI_BCast and mainly point to point send/recv calls. When the code periodically saves the trajectory, for example every 1000 iterations, it uses the HDF5 library that is also using MPI calls. These HDF5 originated calls were not profiled here. 2. During a multistep run, the first iteration takes more time to complete that the rest of iterations. This fact should be taken into account, especially when one runs a small number of steps. 3. During execution, on some processes and for small numbers of iterations, the FFTW plane size is not identical on all processes although the deviation is not large (± 1-3). This introduces a small load imbalance. 4. The scalability of code depends on the problem size. As shown in Figure 7 for a problem size of 10243 the performance is almost linear up to 4096 cores. For smaller problem sizes, the speed up starts to decrease after a number of cores. 5. The percentage of time spent in MPI calls during run was measured indicating that the communication time increases with increasing the number of cores. 6. The average MPI message decreases increasing the number of cores. The network performance (Bandwidth/Latency) depends on message size. The project also published a white paper, which can be found online under [5]. PRACE-4IP – EINFRA-653838 15 15.04.2016 D7.1 Periodic Report on Applications Enabling 2.8.3 PICCANTE: an open source particle-in-cell code for advanced simulations on tier-0 systems, 2010PA2458 Code general features Name piccante Scientific field Plasma Physics Short code description piccante is an open source, massively parallel, fully-relativistic Particle-In-Cell (PIC) code. PIC codes are widely used in plasma physics and astrophysics to study problems where kinetic effects are relevant. A PIC code integrates in time Maxwell-Vlasov equations. Electromagnetic field equations are solved on a grid. The plasma distribution function is sampled with numerical particles. piccante is primarily developed to study laser-plasma interaction. The code allows to design a simulation with great flexibility (i.e. multiple laser pulses, arbitrary target geometry, mutiple particle species...). Programming language C++ Supported compilers successfully compiled on GNU compiler (g++ v 4.8.2) and IBM compiler (Blue Gene, XL v12.1) Parallel implementation yes, using mainly MPI (openMP is used in some code sections) Accelerator support no Libraries STL, GSL, BOOST (optional), jsoncpp (optional) Building procedure Makefile. Script for aided compilation on FERMI and JUQUEEN Web site http://aladyn.github.io/piccante/ Licence GNU GPLv3 Topic 1 Main objectives Code scaling and profiling tests Accomplished work: Strong scaling and weak scaling tests were performed on FERMI and JUQUEEN, using a typical simulation case. Output routines were disabled for these tests. Scaling was tested on up to 16384 computing cores. Scalasca and Vampir were used for detailed profiling of the code. Main results: A detailed profiling of the code allowed identifying the most time-consuming routines of the code as a function of the task number. PRACE-4IP – EINFRA-653838 16 15.04.2016 D7.1 Periodic Report on Applications Enabling The Scalasca analysis at the beginning of the project showed that the core routines of the code (everything but the output routines) scaled well on up to 16384 MPI tasks (the largest tested configuration). #cores Strong Scaling (sec/step) Strong Scaling (% ideal scal.) Weak Scaling (sec/step) Weak Scaling (% ideal scal.) 512 -- -- 3.785 100% 1024 7.538 100% 3.777 100% 2048 3.777 99.8% 3.777 100% 4096 1.885 100% 3.769 100% 8192 0.946 99.6% 3.769 100% 16384 0.485 97.2% 3.769 100% Table 7: Scalasca analysis of the piccante core routines. However, the profiling highlighted a very bad behaviour of the output routines. While showing a decent performance on a small number of MPI tasks (1024), they were requiring a major fraction of the computing time on larger numbers of MPI tasks. This guided us to spend most of our efforts on this problem (which was successfully addressed, see Topic 2 for details). At the end of the optimization work, piccante with all its function enabled, including very demanding output routines, proved to scale very well up to the maximum tested number of MPI_tasks (16384). See Figure 9 below showing the top ten most time consuming routines for strong and weak scaling tests, before and after the optimization work. Figure 9: Most time consuming routines for strong (top) and weak (bottom) scaling tests, before (left) and after (right) the optimization work. PRACE-4IP – EINFRA-653838 17 15.04.2016 D7.1 Periodic Report on Applications Enabling Topic 2 Main objectives: Output performance Accomplished work: When using more than 2048 MPI Tasks the usual MPI-IO routines (MPI_File_write) dramatically slow down the output process up to an unacceptable level on the BlueGene architecture. Two main routines were changed: grid-based output (electromagnetic fields, charge density and current density, fields in the followings) and particles coordinates (particle output in the following). Several new parallel output strategies were developed and tested: 1. Collective call to mpi_file_write_all: allowed for a good improvement: up to 5-10 times for grid-based and 5-10 times for particles (problem dependent). The scaling remained rather bad: linear increase of the output time with the number of MPI_tasks for a strong scenario. 2. Several files, one for each MPI_task: decent speedup for small number of MPI-tasks but extremely large number of output files (very difficult handling). 3. One file written by a limited (GroupSize = 32-128) number of “sub-master” MPI tasks (one every sub-group, with the other tasks sending their buffer to the “submaster”): improvement for fields of a factor up to 25 and up to 10 for particles (mainly limited by the size of the file). GroupSize was varied to find the best values for the architecture. The scaling with the number of MPI_Tasks was not ideal: sublinear in a strong scaling test. 4. Several files (2-16) for each output-item, one every 1024 (or 2048) MPI_tasks. The speedup at small total number of MPI_tasks remained good (10-20 for fields, about 10 for particles). The scaling was very good: nearly constant output time for a weak scaling test and decreasing for a strong scaling test. Solution 4) was chosen as the new output strategy (see Figure 10). Figure 10: Old vs. new output strategy overview. PRACE-4IP – EINFRA-653838 18 15.04.2016 D7.1 Periodic Report on Applications Enabling Main results: We targeted at least 8192 and we tested up to 16384 cores. The output time for a 2048 MPI-tasks on 2048 cores was improved by a factor of: 15 for particles and 200 for fields. The output time for 4096 MPI-tasks on 4096 cores was improved by a factor of: 34 for particles and 180 for fields. The output time for 8192 MPI-tasks on 8192 cores was improved by a factor of: 40 for particles and 600 for fields. The amount of rewriting of the output routines was significant (~ 50%). Figure 11: Comparison of the old and the new output strategy for a strong scaling test. #MPI-tasks Particles files size (GB) 1024 2048 4096 8192 16384 24 48 96 192 384 Fields files size particles GB/sec fields GB/sec (GB) 0.42 1.45 0.60 0.83 1.78 1.11 1.66 3.48 1.81 3.33 4.13 2.41 6.66 8.12 4.66 Table 8: Maximum writing speed by using the new output strategy. The new output allowed reaching a satisfactory maximum writing speed (total aggregated writing speed on 16384 MPI-Tasks): more than 8 GB/s for the particles files and 4.66 GB/s for the much smaller field files. Topic 3 Main objectives: Improved output strategies Accomplished work: The output routines were extensively rewritten. The “output manager” now allows to define a greater variety of output requests. In particular, it is now possible to produce reduced-dimension outputs (e.g. a plane in a 3D simulation or a PRACE-4IP – EINFRA-653838 19 15.04.2016 D7.1 Periodic Report on Applications Enabling 1D subset in a multidimensional run) and subdomain outputs (e.g. a small box in a large 3D simulation). The list of output requests can also be controlled in time: a given output can be active within a time interval with a given frequency, independently from the other requests. Main results: Several new output functions are now available. These functions allow a more flexible output strategy (both in time and space). Much smaller total output size can be achieved performing the output of only the relevant regions of the simulation when they are of interest (e.g. output frequency can be increased to better resolve crucial processes). This allows to increase the accuracy of the output data (time and spatial resolution) limiting the required disk space. Topic 4 Main objectives: Add input-file support Accomplished work: piccante is designed as a library. Thus, the user was originally required to write or modify the “main-1.cpp” file and re-compile the code for each simulation. New functions were introduced to initialise a simulation reading a JSON input file. A new main file was designed to allow typical a simulation setup without any code editing. Support for JSON parsing is provided using the library “jsoncpp”, which is licensed as “Public Domain” (or with MIT license in some jurisdictions): https://github.com/open-sourceparsers/jsoncpp. Main results: Support for simulation setup via input-file was successfully added. All the typical simulation parameters can now be controlled with a user-friendly and easy to read JSON input-file. Topic 5 Main objectives: Memory allocation strategies Accomplished work: Particle coordinates are stored in large arrays (7 double variables per particle are stored: x,y,z,px,py,pz,w). We tested two main allocation [x1,x2,...,xN,y1,y2,...,yN,...]. strategies: [x1,y1,...,w1,x2,y2,...,w2,x3...] and On an Intel linux cluster one strategy proved to be slightly better, while on BlueGene/Q the differences were minimal. Main results: We tested a few memory allocation strategies for particles. The best allocation strategy is enabled by default in the code. This allows a slight performance gain on some architectures. PRACE-4IP – EINFRA-653838 20 15.04.2016 D7.1 Periodic Report on Applications Enabling Topic 6 Main objectives: Hybridization (MPI+OpenMP) of the code Accomplished work: We managed to introduce MPI+OpenMP hybridization for the electromagnetic solver and this provided a slight performance enhancement. We tested a similar improvement for the particle solver, but our preliminary results were unsatisfactory. We suspended the development of this feature to concentrate our efforts on more urgent topics (e.g. output optimization). Main results: Only a limited hybridization was performed, which provides a slight performance gain in some code sections. This feature is temporarily disabled in the code. Topic 7 Main objectives: Code Refactoring Accomplished work: Together with the development of new functions, a consistent code refactoring was pursued to ease maintenance. Main results: Long and complicated functions were split in smaller and simpler ones. Unused or obsolete functions were deleted from the source files. The project also published a white paper, which can be found online under [6]. 2.8.4 OpenFOAM capability for industrial large scale computation of the multiphase flow of future automotive component: step 2., 2010PA2431 Due to an unexpectedly long administrative procedure for one of the project members to be accepted as user on the computer platforms, the PA2431 only ran for two weeks. Because of this problem, no sufficient results could be created for a publication in a final report. Within the remaining time, the PI chose an industrial test case and updated it with the support and the advice of the PRACE expert. This test case needed an older version of OpenFOAM than the one installed on Curie. Two old versions were compiled by the PRACE expert before finding the right one. The test case was run successfully at the end. Critical points were identified to progress on the parallelization for OpenFOAM on realistic geometries, complex phase change - but the time ran out at this point. The applicant will continue the work on a next project, to be submitted again, and will plan with slightly more secured delays and resources from his side. PRACE-4IP – EINFRA-653838 21 15.04.2016 D7.1 Periodic Report on Applications Enabling 2.9 Cut-off September 2014 2.9.1 Parallel subdomain coupling for non-matching mesh problems in ALYA, 2010PA2486 Code general features Name Alya Scientific field Multi-physics problems, fluid flow, structural dynamics, thermal flow Short code description Alya is a multi-physics code developed at the Barcelona Supercomputing Center. It is based on a finite element formulation and is structured using a modular architecture, organised in kernel, modules and services. The kernel contains the facilities required to solve any set of discretized partial differential equations, while the modules provide the physical description of a given problem. Programming language Fortran Supported compilers ifort, gfortran Parallel implementation mpi+openmp+ompss Accelerator support No Libraries metis, hdf5 Building procedure Makefile Web site http://bsccase02.bsc.es/alya/overview/ Licence Open source Topic 1 Main objectives: This work aimed to implement Domain Composition Methods at the algebraic level to couple subdomains with non-matching meshes in a distributed memory supercomputer environment for multi-physics problems. The coupling is performed at the algebraic level, thus, it is almost independent of the problem considered. This approach enables us to solve multi-domain and multi-physics problems, using both single and multi-code approaches. Accomplished work: We have implemented a strategy to couple subdomains with non-matching meshes for distributed memory supercomputers. The method can be explicit (multi code) or implicit (single code). The latter one was implemented using an MPI communicator splitting in order to set inter- and intra-subdomain communicator. This enabled us not to affect the original parallelization of the code. In addition, the methodology is currently being tested for multiphysics simulations, like fluid-structure interactions; fluid-particle coupling; contact problems; thermal flows coupled with conjugate heat transfer. Main results: The proposed methodology was tested in explicit (multi-code) and implicit (single code) approaches. The multi-code coupling doesn’t affect the scalability of each code, because the extra communications have more or less of the same cost as that of one matrix-vector product PRACE-4IP – EINFRA-653838 22 15.04.2016 D7.1 Periodic Report on Applications Enabling communication of the normal parallelization of the code, and it is performed just once each coupling iteration or time step. In the implicit approach, the extra communications is performed after each matrix-vector product, and the impact can be significant. In Figure 12, two subdomains to be coupled are shown, where the lines represent communicating parallel partitions of each subdomain. Figure 12: Two subdomains coupling and parallel partition. The lines show the connection between parallel partitions of the different subdomains. Figure 13 shows the relative cost of using the subdomain coupling with a fixed number of two hundred iterations of the GMRES solver with a Krylov space of size ten compared to the same case and configuration without subdomain coupling. The results show that the cost is feasible and it is more than compensated for the ability of address different physical problems in each subdomain. In addition, the same result can be expressed in terms of the speed up, as shown in Figure 14. Figure 13: Relative cost of using the subdomain coupling with a fixed number of two hundred iterations of the GMRES solver with a Krylov space of size ten. PRACE-4IP – EINFRA-653838 23 15.04.2016 D7.1 Periodic Report on Applications Enabling Figure 14: Speed up using the subdomain coupling with a fixed number of two hundred iterations of the GMRES solver with a Krylov space of size ten, and the same case and configuration in one subdomain. Test cases are being executed for multi-physics problems as Fluid-Structure Interaction problems, particle transport, contact problems and thermal flows, as shown in Figure 15 for single code case and Figure 16 for the multi-code case. Figure 15: Implicit coupling applied to the Navier-Stokes equations. (Left) Meshes (Right) Velocity and pressure. PRACE-4IP – EINFRA-653838 24 15.04.2016 D7.1 Periodic Report on Applications Enabling Figure 16: FSI benchmark. (Left) Geometry. (Right) Results. The project also published a white paper, which can be found online under [7]. 2.10 Cut-off December 2014 2.10.1 Numerical simulation of complex turbulent flows with Discontinuous Galerkin method, 2010PA2737 Code general features Name DG-comp Scientific field Engineering and Energy Short code description The numerical code DG-comp solves the Navier-Stokes equations for unsteady compressible turbulent flows. DG-comp is based on FEMilaro, an open-source finite element library. The equations are discretized in space using a Local Discontinuous Galerkin (LDG) method on tetrahedral elements. The equations are projected in a space of polynomial functions defined in each element of the computational grid. For the numerical fluxes the classical BassiRebay definition is adopted. For the time integration the fourth-order, the SSPRK scheme is used. Other time integration schemes are available: classical explicit Runge-Kutta scheme up to fourth-order, and matrix-free exponential time integrators. A variety of sgs models for LES [8] and hybrid RANS/LES models [9] are implemented in the code. Programming Fortran language Supported compilers On HORNET: Cray Fortran 2.4.0 On FERMI: bgq-xl 1.0 On MARENOSTRUM: Intel 13.1, Gnu fortran 6.0 Parallel implementation MPI Accelerator support Libraries Fortran 90-2008 language, parallel HDF5 I/O library Building procedure Makefile with possibility of parallel compiling Web site http://code.google.com/p/femilaro/ Licence GPL3 PRACE-4IP – EINFRA-653838 25 15.04.2016 D7.1 Periodic Report on Applications Enabling Topic 1 Main objectives Scalability tests on several Tier-0 platforms. Accomplished work Strong scalability tests have been performed on the Tier-0 platforms FERMI, HORNET and MareNostrum. Main results The strong scalability analysis has been performed for the turbulent channel flow simulation with a sub grid anisotropic model for LES [8] using 6912 tetrahedral elements and a fourth order degree polynomial. The wallclock times have been evaluated for the time advancing computation, neglecting time for initialization procedures, input and output. Scaling analysis has been carried out with a range of processor cores varying from 1024 to 16384 on FERMI, with 24 up to 1536 cores on HORNET and 16 to 2048 cores on MareNostrum. For each platform, the speedup is normalized by the speedup obtained with the lowest number of cores. In the following figures, the scaling obtained with the numerical code on FERMI and on HORNET are compared with the linear optimum trend while in the tables Table 9 - Table 11 the details of the scaling tests on the three platforms are reported. Figure 17: Speedup on FERMI. The values are normalized by the speedup with 1024 cores. PRACE-4IP – EINFRA-653838 26 15.04.2016 D7.1 Periodic Report on Applications Enabling Figure 18: Speedup on HORNET. The values are normalized by the speedup with 24 cores. Number of cores 1024 2048 4096 8192 16384 Wall clock time [s] 8.1383 4.1183 2.0793 1.0675 0.5776 Speed-up vs the first one 1 1.9761 3.9139 7.6237 14.0901 Number of Nodes 64 128 256 512 1024 Number of processes 1024 2048 4096 8192 16384 Table 9: Scaling performances on FERMI Number of cores 24 48 96 192 384 768 1536 Wall clock time [s] 40.2157 20.7937 10.2777 5.2527 2.7128 1.3760 0.6596 Speed-up vs the first one 1 1.9340 3.9129 7.6562 14.8243 29.2265 60.9711 Number of Nodes 1 2 4 8 16 32 64 Number of processes 24 48 96 192 384 768 1536 Table 10: Scaling performances on HORNET Number of cores 16 32 64 128 256 512 Wall clock time [s] 52.5055 26.4568 13.3011 6.7066 3.3634 1.7149 PRACE-4IP – EINFRA-653838 Speed-up vs the first one 1 1.9846 3.9475 7.8289 15.6109 30.6179 27 Number of Nodes 1 2 4 8 16 32 Number of processes 16 32 64 128 256 512 15.04.2016 D7.1 Periodic Report on Applications Enabling Number of cores 1024 2048 Wall clock time [s] 0.8907 0.4802 Speed-up vs the first one 58.9492 109.3471 Number of Nodes 64 128 Number of processes 1024 2048 Table 11: Scaling performances on MareNostrum The results confirm the very good scalability properties of the numerical code. Actually, with a LDG method, most of the computations are local to each element. This means that, albeit the computational cost for DG methods is typically higher compared to other formulations, nevertheless DG methods lend themselves very well to parallel execution. Topic 2 Main objectives Optimization of the I-O strategies. Accomplished work Possible option to use HDF5 library to manage efficiently I/O. Main results During the simulations, the complete status of the computed solution is saved at several desired intermediate times. At each of these times, each process writes an output file for the associated domain partition; the file size depends on the dimension of the partition, on the grid resolution and on the amount of optional diagnostics quantities required. Every output file is compatible with Octave, which is used to perform post processing and data analysis (on a serial computer). During the present project, the I/O strategy has been improved and the use of the HDF5 library has been implemented. Without HDF5 10.81 10.82 10.81 HDF5 collective 11.77 11.49 5.57 5.78 5.17 4.8 14.8 13.3 5.9 8.22 8.01 9.95 Table 12: I/O writing time in sec with and without HDF5. In the Table 12 the I/O writing times (sec) with and without HDF5 are reported. The test has been carried out on Hornet with -O2 compiling option and 192 cores. In column 1 are the times for three different writing events in a simulation without HDF5. In columns 2-5 the times for three writing events in four different simulations with same settings but with collective HDF5. We can observe some data scattering but in average with HDF5 only 80% of the time without HDF5 is consumed. Furthermore, 63% of disk space is saved at each writing event (80 MB instead of 217 MB). Topic 3 Main objectives Code profiling in order to identify points where the optimization effort should be directed. Accomplished work Code profiling using SCALASCA tools on FERMI has been carried out. PRACE-4IP – EINFRA-653838 28 15.04.2016 D7.1 Periodic Report on Applications Enabling Main results A profiling analysis of the numerical code has been performed on FERMI using SCALASCA tools. The code was compiled with xlf -O2. The analysis is limited to the main computation cycle. The SCALASCA analysis demonstrates that the computational effort is well distributed between nodes. In Figure 19 a synthesis of the profiling analysis is represented. Figure 19: Percentage of the time consumed in the main steps of the computations. Almost the total time is consumed in the computation of the right hand side terms of the equations. Inside this, the integration over the element volume and the turbulence model are the more time requiring procedures. Very little time is spent for diagnostic computations. Almost the 18% of the time is spent in communications. The computational time appears evenly distributed among the different code portions and no evident concentration of effort has been discovered. The profiling confirms that all the point-to point communications happen in the computation of the hyperbolic numerical fluxes and of the viscous ones, exchanged between elements through element surfaces, while the diagnostics evaluation involves collective MPI operations, such as MPI_ALLREDUCE. The numerical method and LES models require a massive use of matrix-vector and matrixmatrix multiplications of relatively small (order 10x10 to 100x100) full matrices. These products are currently computed using the FORTRAN intrinsic functions MATMUL and DOT_PRODUCT. During the project, a test substituting these functions with the BLAS library ones has been conducted: no relevant performances improvement has been noticed, maybe due to the relatively low dimension of the matrices involved. Topic 4 Main objectives Improvement of the hybrid RANS/LES model. PRACE-4IP – EINFRA-653838 29 15.04.2016 D7.1 Periodic Report on Applications Enabling Accomplished work To improve the turbulence modelling, and in particular the hybrid RANS/LES model, an analysis of the role of the blending factor has been conducted. Main results From a turbulence modelling point of view the project has been used to improve the turbulence modelling, and in particular the hybrid RANS/LES model [9]. In order to perform the analysis of geometrically complex flows, it is important to optimize the ratio between RANS and LES computation. For this reason, the blending factor k used to combine LES and RANS has been analysed. Several simulations with different values of k have been performed to understand the rule of k and how the results change when increasing or decreasing the RANS contribution. The test case chosen is the turbulent channel flow at Reτ =180. Figure 20: Shear stress (left) and turbulent kinetic energy (right) profiles for different value of k are shown. The dashed line represent the modelled quantities, while the continuous lines the resolved ones. In Figure 20 the mean shear stress and the turbulent kinetic energy profiles for different value of k are shown, the dashed lines represent the modelled quantities and the continuous lines the resolved ones. Numerical results show that the blending factor directly affects the amount of resolved and modelled quantities. Moreover, as shown by the comparison with DNS data, it is possible to obtain good results also for low value of k, i.e. increasing the RANS contribution and therefore reducing the number of turbulent scales resolved, and thus reducing the computational effort. The project also published a white paper, which can be found online under [10]. 2.11 Cut-off March 2015 2.11.1 Large Eddy Simulation of unsteady gravity currents and implications for mixing, 2010PA2821 Code general features Name Les Coast Scientific field Engeneering, Physics Short code description The model solves the 3D LES-filtered Navier-Stokes equations in Boussinesq, rigid-lid approximation. It makes use of generalised coordinates and an immersed-boundary method to implement the presence of obstacles and complex geometries. It can be used to simulate hydraulic laboratory conditions or realistic coastal-scale applications with PRACE-4IP – EINFRA-653838 30 15.04.2016 D7.1 Periodic Report on Applications Enabling Code general features environmental forcing. The Navier-Stokes solver adopts the Kim and Moin scheme with generalised coordinates. The convective terms are solved either with a central scheme or the QUICK algorithm. The pressure solver uses SOR or SOR+multigrid. The time scheme can be explicit or semiimplicit (fractional step method). The Large Eddy Simulation subgrid model can be a static-Smagorinsky, dynamic Smagorinsky (Germano 1992) or a Lagrangian dynamic subgrid-scale model (Meneveau 1996). Programming language Fortran 90 & Fortran 2003 Supported compilers INTEL 16.0.0 (tested), Gfortran 4.9 and others that support Fortran 2003 & assumed-rank arrays Parallel implementation MPI Accelerator support No Libraries MPI-3.0 compatible library, Gabriel 1.1 Building procedure Makefile Web site Licence Topic 1 Main objectives Refactoring of the code in order to increase the efficiency in the MPI communications. Accomplished work All codes have been upgraded to standard F90. All MPI communications have been improved by using MPI 3.0 derived types and neighbour collectives. The MPI communication in the original code was effectively serialized by a series of MPI_Gather operations and MPI_Ssend for nearest neighbour exchanges. This had to be changed throughout the code: • • replace of MPI_Ssend+MPI_Recv with MPI_Sendrecv or MPI_Neighbor_alltoallw replace a loop of MPI_Gather+MPI_Barrier with one MPI_Alltoall Matrices definitions and memory allocations have been organized in modules. MPI variables and functions have been organized in separated modules. Main results The first benchmark is a gravity current in a long and narrow channel. The current is generated by an ideal lock-exchange technique. At the beginning of the simulation, the dense fluid is positioned in a small volume on the left side of the tank. The light fluid occupies most of the volume on the right side of the tank and the upper part of the left side. At the beginning of the simulation, the dense fluid collapses and moves beneath the light fluid while the light fluid moves in the opposite direction. The grid is uniform on the xz plane, non-uniform in the vertical direction (y). Periodic boundary conditions are imposed in the transverse direction PRACE-4IP – EINFRA-653838 31 15.04.2016 D7.1 Periodic Report on Applications Enabling (x). Due to the geometry of the problem, the test is ideally suited for the 1-dimensional decomposition in the original code along the length of the channel. The original code runs on 16, 32 and 64 cores. The multi-grid solver limits the number of processes to a maximum of 64 for this particular benchmark. 7 Original 6 Optimized Speedup 5 4 3 2 1 0 0 10 20 30 40 50 60 70 Cores Figure 21: Speedup vs. cores test1: gravity current in a channel. Figure 21 shows that the code can now scale to the maximum number of cores (due to the limitations of the 1-D decomposition and the multi-grid solver). The optimized code is the starting point of topic 2, to change the decomposition for scaling beyond 64 cores. Figure 22: Results for the channel test: propagation of the gravity current (density field) Topic 2 Main objectives Increase the level of parallelization of the code. Accomplished work The code has been parallelized to a 2D decomposition, the multigrid solvers sor/slor have been changed introducing red/black schemes in the Cartesian and curvilinear coordinates. The MPI communications for the 2D parallelization have been organized in a separate library (Gabriel 1.1). The implicit three-diagonal solvers have been optimized. PRACE-4IP – EINFRA-653838 32 15.04.2016 D7.1 Periodic Report on Applications Enabling Main results Another benchmark in a square basin was selected to address the decomposition of the code. In this test, a gravity current is generated in a tank where a rigid wall separates two volumes of fluid at different density. The lock-exchange is realized by means of an opening in the central part of the separating wall. The aperture has a length, which is 1/8 of the length of the wall. The boundary conditions on velocity at the separating wall are realized by using an immersed boundary technique. The grid is not uniform, quasi-Cartesian. No periodic boundary condition is imposed. In this problem, there is not a preferential dimension along which to decompose the grid 1-D, and the use of a dynamic algorithm for the estimate of the sub grid stresses makes the requirements on the memory allocation a severe test. The scalability of the optimized code for this benchmark is limited to 32 cores by the multigrid solver and the 1-D decomposition. The change of the decomposition in the multigrid solver has pushed the limit to 256 cores for this benchmark. Unfortunately, the results are incorrect, which is still under investigation. However, it is expected that the code can be fixed without a significant impact on the scalability. Figure 23: Benchmark for the 3D gravity current simulation. 12 2decomp 10 Speedup 8 6 4 2 0 0 100 200 300 Cores 400 500 600 Figure 24: Speedup vs. cores for the 3D gravity current test. PRACE-4IP – EINFRA-653838 33 15.04.2016 D7.1 Periodic Report on Applications Enabling Production runs are planned to use a higher resolution with an upper limit of 2048 cores. This brings the code within reach of PRACE Tier-0 machines. It is expected that the code will be ready before the next Tier-0 call. Project results will be published in a separate white paper in the context of Work Package 7. PRACE-4IP – EINFRA-653838 34 15.04.2016 D7.1 Periodic Report on Applications Enabling 3 T7.1.B SHAPE In this section, the progress in task 7.1B, SHAPE will be discussed. Summary reports on the work in the second SHAPE call projects are provided. Also, the recently closed third call is reported on, and finally the future of SHAPE is discussed. 3.1 SHAPE Second call: Applications and Review Process The review panel was composed as follows: • • • Two representatives from the SHAPE programme organisers; Two representatives from the PRACE Board of Directors; Two representatives from the Industry Advisory Committee. The two main criteria considered for the review of the applications were: • • Strength of the business case - The expertise and resources provided via SHAPE are expected to produce a significant Return on Investment for the company. In the midterm, the SME should be able to build on the results to, for instance, increase its market share, renew its investment or recruit dedicated staff. It would also be expected that the business plan for the project would lead the SME to further engage in HPC in the long term. Technical Adequacy - The applications are expected to fit the timeframe and resources available in the project. The project activity must only lean on expertise already available within PRACE partners. Other aspects that were considered: • • • The commitment of the SMEs to co-invest with PRACE in achieving the project goals. The effort should at least be equally split between the company and PRACE; The innovative aspect of the proposed solution; The social and economic impact on society as a whole. The applications were reviewed and ranked according to these criteria, and then the recommendations put forward to the board. The board approved 11 projects to go ahead from this call. A twelfth project was deemed not suitable for SHAPE, but was instead encouraged to engage with PRACE via the preparatory access calls. Following the Board’s approval of the recommendations, the successful SMEs were matched with PRACE partners, as shown in Table 13 below: Company Country Project Title Partner ALGO'TECH INFORMATIQUE France High performance to simulate electromagnetic disruption effects INRIA in embedded wiring CybeleTech France Numerical simulations for plant CINES breeding optimization Design Methods Italy Coupled sail and appendage design method for multihull based CINECA on numerical optimization PRACE-4IP – EINFRA-653838 35 15.04.2016 D7.1 Periodic Report on Applications Enabling Company Country Project Title Partner Ergolines s.r.l. Italy HPC-based Design of a Novel Electromagnetic Stirrer for Steel ITU Segment Casting Hydros Innovation Switzerland Automatic Optimal Hull Design by Means of VPP Applications on CINECA HPC Platforms Ingenieurbüro Tobias Loose Germany HPCWelding Open Ocean France High Performance Chain - HPPC Principia France HPC for Hydrodynamics database PSNC/CINES creation RAPHI Italy Optimad CINECA/INRIA VORTEX BLADELESS SL Spain VORTEX BSC WB-Sails Ltd Oy Simulation of sails and sailboat CSC performance Finland HLRS Processing IDRIS Table 13: SHAPE applications to the second call With regards to Principia, PSNC were originally approached to partner them, but the SME wished to work with a local centre and CINES were able to fulfil this. 3.2 SHAPE Second call: Status The approved projects were informed of their success in April 2015 and encouraged to start soon after. With a few exceptions, the projects were underway by the end of May. As such, the projects are somewhat unsynchronised – this is not entirely unexpected given the very diverse range of projects being undertaken, but there are other factors involved that should be taken into consideration for future calls and are discussed in more detail below. Each collaboration in SHAPE is expected to produce a white paper for publication on the PRACE website at the conclusion of the technical work. In addition, every project was required to provide a brief summary of their work for this deliverable (see section 3.3 below). The status (as of March 2016) is as follows: • • • • • • • Ergolines s.r.l. (ITU) – project complete, white paper ready for review April 6th 2016; Cybeletech (CINES) – project completed, white paper ready for review April 6th 2016; OPTIMAD Engineering (CINECA) – project completed, white paper ready for review April 6th 2016; Open Ocean (IDRIS) – project completed, white paper ready for review April 6th 2016; Hydros Innovation (CINECA) - project completed, white paper ready for review April 6th 2016; Vortex Bladeless SL (BSC) – project approaching completion, white paper ready for review April 6th 2016; Design Methods (CINECA) - project approaching completion, white paper ready for review April 6th 2016; PRACE-4IP – EINFRA-653838 36 15.04.2016 D7.1 • • • • Periodic Report on Applications Enabling Ingenieurbüro Tobias Loose (HLRS) – there have been various technical challenges but HLRS is working closely with third party software developers to overcome these, and the technical work is expected to finish by June. More details are included in the summary from HLRS in section 3.3.8; WB-Sails (CSC) – CSC has had various difficulties obtaining machine time mainly due to local restrictions on commercial usage for their platforms. These have now been overcome and technical work has begun, but it is expected that it will not be concluded until summer; Principia (CINES) – Due to issues mainly related to security concerns of the SME, there have been delays to starting this work. As such, little work has taken place yet but it is expected to start imminently. More details are given in section 3.3.10 below; Algo’tech (INRIA) – Similarly to Principia, there have been various delays to the technical work starting, but work is now underway. For the four projects which are unable to provide white papers for the current (April 6th 2016) round of deliverable reviews, it is expected that their white papers will be reviewed at a later review round, possibly alongside the third call projects. 3.3 SHAPE second call: Project summaries This section provides summaries of the eleven projects from the second call of the SHAPE programme. For each project there is a brief overview describing the problem to be solved, the activity undertaken, how PRACE was involved, the benefits to the SMEs, and finally the lessons learned for the further development of the SHAPE programme itself. The lessons learned are also discussed further in Section 3.4. Note that each pilot project is also producing a technical white paper that will cover the activities and results of the projects in greater detail than presented here. The intention of this section is to give a flavour of the broad range of projects and the diversity of the subject areas, along with summarising the benefits of the SHAPE programme to the SMEs. 3.3.1 Ergolines: HPC-based Design of a Novel Electromagnetic Stirrer for Steel Casting Overview Project partners: • Company: Ergolines s.r.l. o Isabella Mazza, Ergolines s.r.l., Physicist, isabella.mazza@ergolines.it o Cristiano Persi, Ergolines s.r.l., Mechanical Engineer, cristiano.persi@ergolines.it o Andrea Santoro, Ergolines s.r.l., Mechanical Engineer andrea.santoro@ergolines.it • Istanbul Technical University o Ahmet Duran, Istanbul Technical University (ITU), Mathematical Engineering, aduran@itu.edu.tr o Yakup Hundur, Istanbul Technical University, Physical Engineering, hundur@itu.edu.tr o Mehmet Tuncel, Istanbul Technical University, Mathematical Engineering, Computational Science and Engineering. • SHAPE contacts: isabella.mazza@ergolines.it, aduran@itu.edu.tr, hundur@itu.edu.tr PRACE-4IP – EINFRA-653838 37 15.04.2016 D7.1 Periodic Report on Applications Enabling As a general consideration, in order to simulate the effects of electromagnetic stirring on liquid steel, a dedicated customization of Ergolines’ current OpenFOAM code has been implemented so as to couple Electromagnetism with Fluid Dynamics. Due to the complexity of the multi-physical system under study, very fine discretization in terms of time and geometry is required. The use of HPC and the potential to take advantage of specialized expertise is key to meet this emerging industrial challenge. In order to better assess how parallelisation improves computational performance, the simulations have been carried out considering an increasing number of cores. Activity performed The project activities have been organised into four different phases: • • • • Porting: deploy and run the code on CINECA Fermi; Profiling: quantification of the computational time spent in each building block of the code; Conducting initial simulations and parameter optimization: EMS design has been following an iterative, multiple-simulation process including: 1) analysis of the geometrical constraints, 2) calculation of the EM performance, 3) fluid dynamic simulation 4) parameter optimization, 5) iteration of steps 2 to 4 until the required EM performance is achieved; Benchmarking: performance analysis for the current version and updated versions of the code via extensive simulations. Overview of the results The fluid dynamics of liquid steel in an electric arc furnace under the effect of electromagnetic stirring has been studied by means of HPC-based numerical simulations. The geometry, mesh and fluid dynamics of the system under study are represented in Figure 25. The velocity field generated by the EMS, which is located under the EAF, is also shown. We performed the scalability tests and the code has shown nearly linear speed-up up to 512 cores. Afterwards, speed-up saturation takes place if more than 512 cores are used as seen in Figure 26. (a) EAF geometry and mesh (b) Fluid-dynamic simulation: velocity field displayed as flux lines Figure 25: Top views of geometry and mesh, and velocity field PRACE-4IP – EINFRA-653838 38 15.04.2016 D7.1 Periodic Report on Applications Enabling Speed-up normalized to 20 cores Speed-up 60 50 40 30 simple 20 hierarchical Linear speed-up 10 0 0 200 400 600 800 1000 1200 Number of cores Figure 26: Speed-up as a function of the number of cores PRACE cooperation The project partners have prepared a detailed workplan to realize the HPC-based project. The project partners at ITU were awarded access to IBM-FERMI at CINECA through their Project 2010PA3012 “Parameter Optimization and Evaluating OpenFOAM Simulations for Magnetohydrodynamics” under the 21st Call for PRACE Preparatory Access call Type B. The project partners at ITU have prepared sequential job submit scripts and parallel job submit scripts to compile and run OpenFOAM with mathematical operators such as turbulence models and various mesh operators and the solver, and also to execute other related programs on IBM-FERMI at CINECA. They have provided the job submit scripts to Ergolines. They provided guidance for performance and scalability of the codes on HPC systems. The SHAPE contacts at ITU attended the PRACE F2F and telco meetings, and communicated with WP7.1 task leader. The project partners have prepared a white paper. Benefits for SME The SME appreciates how the use of HPC has been crucial to carry out the fluid-dynamic simulations by drastically reducing the computational times. Performing the simulations inHouse, on Ergolines’ workstations having 8 cores, required about 15 hours, while running the same calculations on CINECA Fermi took only about 20 minutes by a hierarchical method with 1024 cores. This dramatic advantage enabled to carry out an extensive analysis of the fluid-dynamic of the liquid steel in the furnace under the influence of electromagnetic stirring, providing key information for EMS design and industrialization. Lessons learned PRACE and the project partners at Ergolines s.r.l. and ITU enjoyed an excellent collaboration and completed the project successfully. All parties hope to have the chance to collaborate again in the future. 3.3.2 Cybeletech: Numerical simulations for plant breeding optimization Overview Breeding a new plant variety is a long process that requires a decade and thousands of experimental trials in the field so as to select the most robust and efficient traits. In order to help seed companies to reduce the duration and development cost of a new variety, this work proposes to simulate the growth of the tested genotypes instead of running experiments in the PRACE-4IP – EINFRA-653838 39 15.04.2016 D7.1 Periodic Report on Applications Enabling field. For this purpose, HPC technologies are then critical. In the first step, the plant growth model used in numerical simulations must be calibrated with plant phenotypes data. The project aims to define the optimal experimental protocol to be followed for calibrating the model, i.e. to answer three questions: What observables to measure? In which quantity? In which environments? To address these issues, computer simulations are run to compare the precision derived on the model parameters as a function of the data used in input. Optimization techniques are then used to identify the best protocol offering a balance between quality of the final result and experimental costs. Activity performed The partners performed an installation of the code and all of its third party libraries, and then defined an input dataset for simulation used as a benchmark. Thus the initial performance and correctness could be validated. Then the random number generator usage in the source code was modified to ensure repeatable results and timings. Intel Vtune was used to profile the code and identify which lines led to excessive time consumption. Subsequently, the performance was then improved for those lines of codes that concerned C++ object memory management and mathematical functions. Then the parallelism was improved by adding a master-slave approach to distribute the work among hundreds of MPI ranks. PRACE cooperation PRACE was involved for providing access to the Curie machine at TGCC (CEA computing centre). Benefits for SME The SME Cybeletech was able to compute all the simulations planned and made use of the whole 400k hours allocated. • Lessons learned • • 3.3.3 What worked well was holding a face to face meeting with the engineer from Cybeletech over a period of two days. It greatly speeded up the understanding of the code and the implementation of optimisations. A problem was a delay due to security requirements not being met in order to access Curie. Indeed, the SME has internet through an ADSL box. Porting from one linux environment to another linux environment is sometimes not straightforward for C++ codes. RAPHI: rarefied flow simulations on Intel Xeon Phi Overview Within this project, OPTIMAD Engineering srl wanted to explore the possibility of porting the KOPPA (Kinetic Octree Parallel PolyAtomic) numerical code onto the Intel Xeon Phi architecture using the CINECA GALILEO cluster. KOPPA is used for rarefied gas simulations and demands expensive computation when compared with other CFD or CAE applications. By porting the code onto the Xeon Phi architecture, the goal is to reduce the cost of simulations and thus raise the interest of using this code for industrial applications. Activity performed PRACE-4IP – EINFRA-653838 40 15.04.2016 D7.1 Periodic Report on Applications Enabling To investigate the performance profiler tools, such as Vtune and Scalasca, were used running the initial version of the code on the GALILEO cluster. A simple test case was chosen and the behaviour of the code observed, including whilst increasing the computational load with different input parameters. In order to optimize vectorization and the parallel scalability some parts of the code were then refactored. The computational time requirements have been decreased by almost a factor of eight and a good scalability has been obtained up to 64 cores as compared with only 16 cores of the initial version of the code. PRACE cooperation The cooperation has involved two engineers from OPTIMAD, Marco Cisternino and Haysam Talib, one research engineer from INRIA Bordeaux Sud-Ouest, Florian Bernard and an expert from CINECA, Vittorio Ruggiero. For the computations, an account has been created and 5,000 core hours have been allocated on the GALILEO machine at CINECA. Benefits for SME This project gave the SME a better understanding of the behaviour of MPI code on the Xeon Phi architecture. Pure task parallelism is probably not the right approach giving the best performance on this type of architecture. A hybrid parallelization (such as MPI+OpenMP) might be more suitable and is going to be investigated. Moreover, the vectorization effects (and the optimisation in a more general point of view) are an important aspect and have still to be improved. Memory requirements seem also to be a bottleneck since the code needs to handle a large amount of data, causing problems in memory access. The code, as it is, is not yet ready to have good performance on multi integrated cores (MIC) since the scalability is still too poor and does not exploit any advantages of MIC with respect to CPUs. Lessons learned The access to GALILEO worked well for the users and allowed small tests (compilation or very small cases), before running on more nodes. However, access to the actual resources in order to study code performance and scalability was cumbersome due to the large amount of jobs in the queues. This issue meant that scalability has been tested only on a very restricted number of nodes in order to avoid large waiting times. There was a good and important communication between all the partners of the project resulting in nice improvements of the code and a better understanding of the architecture. From an industrial point of view, the project permitted OPTIMAD to get a hands-on feel for the MIC architecture and, through the support of the computing centre, gain insight on the code and its suitability for this architecture. The project raised issues regarding the performance of the application and gave indications on which improvements need to be introduced beforehand in order to increase the potential of the application for the MIC architecture. This type of information subsequently enables OPTIMAD to program in a more efficient and rational way for the transition to heterogeneous architectures, which is considered a strategic development goal within the company. 3.3.4 Open Ocean: High Performance Processing Chain - faster on‐line statistics calculation Overview Open Ocean is a French SME that develops innovative on-line solutions to help plan and manage offshore developments. They conceived an oceanographic data study tool which computes and formats data (Pre-Processing and Processing) and which provides relevant oceanographic information to industrial marine companies (Post-Processing) through a web PRACE-4IP – EINFRA-653838 41 15.04.2016 D7.1 Periodic Report on Applications Enabling interface. However, the “time-to-solution” of this post-processing step is too long and hence not compatible with industrial use. Therefore, the goal of this SHAPE project was to improve post-processing by optimising a parallelized Python program of Open Ocean which processes and computes statistics (e.g. wind speed) on big datasets. To carry this out, engineers of Open Ocean and IDRIS (CNRS computing centre) worked together to optimise this program by using high performance parallel machine and parallel file system (GPFS, 100 GB/s bandwidth). The post-processing code was ported on to the Tier-1 Ada machine (IBM cluster of Intel E54650 processors, 332 compute nodes) at IDRIS to analyse and optimise its performance. Details about the project and the activity performed can be found in the upcoming white paper “Shape Project Open Ocean: High Performance Processing Chain - faster on‐line statistics calculation” which will be available via the PRACE website by summer 2016. Activity performed The work which was done concerns the porting of the post-processing step on the Ada machine and the analysis of the computation performance. The code was already partially optimised as it had been parallelized using a specific software (ProActive Parallel Suite) but to enable the porting of the post-processing code on any machine, it was necessary to dispense with that software. After this task was done, it was possible to compare the performance between the computations on Open Ocean and IDRIS machines. A profiling of a realistic post-processing case was performed to help the Open Ocean team to better understand the behaviour of their code. PRACE cooperation The PRACE cooperation has involved engineers from Open Ocean SME and from IDRIS: Youen Kervella and Yves Moisan (Open Ocean), Lola Falletti and Sylvie Therond (IDRIS). Data have been transferred to the Ada machine at IDRIS and L. Falletti and S. Therond did the tests. Some meetings were held (in person or by phone) to give updates of project progress. Most communication was done through emails during the entirety of the project. Benefits for SME The PRACE cooperation gave Open Ocean the opportunity to port their codes into a high performance computer system, thus familiarizing them with the standards of this computer science field. The in-depth knowledge of IDRIS engineers also gave Open Ocean a new look at both their hardware and file transfer solution. This study allowed Open Ocean to identify the main bottleneck of its post-processing program (i.e. fetching data from their dataset) and to reconsider their hardware choice. In addition, by porting the post-processing code to the IDRIS infrastructure, this PRACE project gave the opportunity to Open Ocean to try to assess other job schedulers such as SLURM or LoadLeveler, which highly increases the portability and the efficiency of their software solution. Lessons learned The cooperation between Open Ocean and the IDRIS centre worked very well. The communication between the two teams was facilitated by the fact that they both spoke French and they were not located too far away. It was then easier to exchange mails and to organise meetings to increase the efficiency of the project work. The work that was done was different from what was first requested by Open Ocean. However, as both sides were reactive, the work plan was adjusted accordingly. PRACE-4IP – EINFRA-653838 42 15.04.2016 D7.1 Periodic Report on Applications Enabling The engineers of Open Ocean had access to the Ada machine but finally they did not need to use it: instead, their data were transferred on to the machine and the work was done by IDRIS engineers. All of the codes were open source, which facilitated the work. More tests could be done, especially concerning the optimisation of the post-processing step. However, the work that was performed provides the company with a good base to improve their post-processing workflow. 3.3.5 Hydros Innovation: Automatic Optimal Hull Design by Means of VPP Applications on HPC Platforms Overview Hydros is an Engineering & Research Swiss company founded in 2007 with several patented designs in the field of marine and sailing yachting. The main scope of the project was to evaluate the feasibility of automatic optimal hull design on an HPC infrastructure and the impact of such a workflow on the day-by-day work of Hydros personnel. CINECA is the PRACE center that supported the SME. Work Performed The project was subdivided into: • • • • Validation of a 2Degrees of Freedom (DoF) Computational Fluid Dynamics (CFD) analysis of an industrial hull design provided by Hydros using the open-source code OpenFOAM; Scalability tests and comparison of the commercial code CAESES used at present by Hydros and open-source code OpenFOAM for hull 2DoF modelling; Coupling of the CFD result into an existing CAD modification and optimization loop; Submit a complete optimization loop for an industrial hull design using open-source code on the HPC platform and evaluate usability of the solution provided. PRACE cooperation The cooperation has involved one engineer from Hydros, Alaric Lukowski, and two experts from CINECA, Raffaele Ponzini and Ivan Spisso. For the computations, accounts have been opened on Tier-1 CINECA cluster GALILEO and 60,000 core hours have been allocated. Benefits for SME The Project allowed the SME to analyse the feasibility of moving from a workflow based on workstations running commercial CFD codes to a new one involving HPC resources with open source codes. The potential benefits of the outcome are obvious: a sensible reduction of costs due to license expenses and a reduction in time-to-market due to the reduction of simulation time thanks to the possibility to exploit the parallel efficiency of the CFD codes on an HPC cluster. Lessons learned The commercial code results and the open source ones at the end matched within negligible differences. This very important result reassured the SME on the possibility of using open source codes in production. However, while the results from CAESES were obtained quite out-of-the-box, OpenFOAM could match them only after a long analysis and tweaking of the simulation parameters. This suggested that the open source code requires a long learning curve and skilled engineers, balancing the benefits of licensing costs reduction. PRACE-4IP – EINFRA-653838 43 15.04.2016 D7.1 Periodic Report on Applications Enabling From the PRACE centre point of view, this project suggests that there is a strong need for PRACE as an innovation catalyst for SMEs, providing specific competences and deep expertise on CAE open source codes. 3.3.6 Vortex: Numerical Simulation for Vortex-Bladeless Overview Vortex-Bladeless is a Spanish SME whose objective is to develop a new concept of wind turbine without blades called Vortex, or vorticity wind turbine. This design aims to eliminate or reduce many of the existing problems in conventional wind energy. This device represents a new paradigm of wind energy. Due to the significant difference in the project concept, its scope is different from conventional wind turbines. It is particularly suitable for offshore configuration and it can be exploited in wind farms and in environments usually closed to existing ones due to the presence of high intensity winds. Given its morphological simplicity and considering that it is composed of a single structural component, its manufacturing, transport, storage and installation has clear advantages. The new wind turbine design has no bearings, gears, et cetera, so the maintenance requirements could be drastically reduced and their lifespan is expected to be higher than traditional wind turbines. The Barcelona Supercomputing Center (BSC) is in charge of the simulations of the wind energy generation device. The Alya code, developed at BSC, is used to perform the FluidStructure Interaction (FSI) problem simulation for a scaled model of the real device. Activity performed The FSI problem posed by the interaction of the wind energy generator and the wind current in which it is embedded is solved using the Alya code. A comparison between the experimental results of a laboratory scaled device and the numerical simulation was performed. The first objective was to show that the code has the capacity of performing the simulation. In order to be able to do the simulation, the multi-code coupling ability of the Alya code was used, tuned and refined. Different algebraic solvers, mesh kinds and coupling algorithms were tested. The comparison between the numerical and the experimental results is satisfactory and allow the set-up of a full scale device simulation to proceed. PRACE cooperation PRACE provided the expert support to adapt the code for this application and the machine time needed to perform the simulations. Benefits for SME The results of this SHAPE project are providing guidance and support to the company in the development of its wind energy device. Once the full laboratory results are properly reproduced, it is expected to perform full-scale simulations. Also BSC and Vortex-Bladeless are looking forward to cooperating again in the framework of European projects or investigating other collaboration possibilities. Lessons learned The experience in this collaboration in the SHAPE project framework has been satisfactory and encouraging. The main difficulties faced were the full understanding of the PRACEPRACE-4IP – EINFRA-653838 44 15.04.2016 D7.1 Periodic Report on Applications Enabling SHAPE project procedures and the communication of advances and results of the work done by the BSC researchers. The best way we found to cope with this difficulty was to make teleconferences and write periodic reports in a non-deep technical (computer science) language so that the state and results of the work are clear to everyone. One of the most confusing points was the application procedure for SHAPE projects. Given that we had to make two different applications for PRACE resources (the first one to get the approval of the project and the second one to get the real access for the calculation time), it was thought that the resources of the first application were not being used properly. This issue has already been discussed in the face to face meeting and it is foreseen that the next calls will include computational resources starting from the first application. Probably the SHAPE application form can contain a section of the computational resources needed in case that the SME has some experience or idea of what will be needed. 3.3.7 DesignMethods: Coupled sail and appendage design method for multihull based on numerical optimization Overview Design Methods is an engineering firm with fifteen years of experience in the aerospace field. The mission of Design Methods is to provide multidisciplinary engineering consulting and design services to industries and design teams supporting them with highly specialized competences on aerodynamic design, CAE analysis, software development, CAD modelling, numerical optimization environment and customized design tools development. They operate in aerospace, automotive and marine fields. The main scope of the project was to evaluate the feasibility of a numerical optimization workflow for sailing boats sail plans and appendages. CINECA is the PRACE center that supported the SME. Activity performed The numerical optimization workflow integrates a sail parametric geometric module, an automatic mesh generator and a Velocity Prediction Program (VPP) based on both CFD computations and analytical models. The VMG (Velocity Made Good) is evaluated solving the 6 Degrees of Freedom (DOF) equilibrium system iterating between VPP and sail CFD analyses. The hull forces are modelled by empirical formulations tuned against a matrix of multiphase CFD solutions on the demihull. The appendages aerodynamic polars are estimated applying preliminary design criteria from the aerospace literature. A significant part of the tool is already available to DesignMethods at a mature development stage but is implemented using very expensive (especially for an SME) commercial software. The overall goals were thus: • • to investigate the possibility of replacing commercial codes with open source software, by a benchmark activity aimed to select the opportune candidate codes, and to highlight their balance between performance, accuracy, robustness and HPC environment compatibility; to demonstrate the capability to efficiently take on computationally costly problems within an HPC environment. PRACE cooperation The cooperation has involved one engineer from DesignMethods, Ubaldo Cella, and two experts from CINECA, Raffaele Ponzini and Francesco Salvadore. For the computations, accounts have been opened on Tier-1 CINECA cluster GALILEO and 5,000 core hours have been allocated. PRACE-4IP – EINFRA-653838 45 15.04.2016 D7.1 Periodic Report on Applications Enabling Benefits for SME The Project allowed the SME to analyse the feasibility to move from a workflow based on commercial CFD codes to a new one involving HPC resources with open source codes. The potential benefits of the outcome are obvious: a sensible reduction of costs due to license expenses. A demonstration of the market value of then new workflow to the SME was obtained by the application on a real industrial case, the design of an A-Class catamaran sail. Lessons learned The results showed that the open source based workflow was perfectly appropriate to reach the goals and therefore demonstrated its feasibility. From the point of view of the PRACE center the outcome was not completely satisfying. The collaboration with the SME was particularly troubled, due firstly to the lack of effort spent by the SME and feedback provided in the first few months on the activity and later in the requests of value added contributions that in CINECA view were outside the scope of the Project and the SHAPE Program itself. CINECA therefore recommends for the following SHAPE calls the creation of a PRACE statement on Terms and Conditions about the scope of the SHAPE projects, and the extent of the support that PRACE centres are allowed to provide to SMEs, and that selected SMEs sign these Terms and Conditions prior to start the technical phase of the project. 3.3.8 Ingenieurbüro Tobias Loose: HPCWelding: parallelized welding analysis with LS-Dyna Overview Partners: • • • Ingenieurbüro Tobias Loose: Tobias Loose; Höchstleistungsrechenzentrum Stuttgart (HLRS): Jörg Hertzer, Bärbel GroßeWöhrmann; DYNAmore GmbH: Uli Göhner. Ingenieurbüro Tobias Loose is an engineering office specializing in simulations for welding and heat treatment. Loose develops preprocessors (e.g. DynaWeld) and provides consulting and training for industrial customers. In addition, Loose is involved in its own research projects concerning welding and heat treatment simulations. In this project, the scaling behaviour of the welding application of the commercial FE-code LS-DYNA has been tested on the CRAY XC40 “HazelHen” at HLRS. Activity performed A variety of test cases relevant for industrial applications have been set up and run on different numbers of compute cores (strong scaling tests). The results show that the implicit thermal and mechanical solver scales up to only 48 cores depending on the particular test case due to unbalanced workload. The explicit mechanical solver was tested up to 4080 cores with significant scaling. It was the first time that a welding simulation was performed on 4080 cores with the LS-DYNA explicit solver. The details will be presented in the project's white paper. PRACE cooperation HLRS granted access to the CRAY XC40 “HazelHen” in the frame of a PRACE Preparatory Access project. The staff at HLRS coached Loose and supported him in preparing and running the LS-DYNA cases in the HPC working environment. PRACE-4IP – EINFRA-653838 46 15.04.2016 D7.1 Periodic Report on Applications Enabling Benefits for SME During the SHAPE project Loose gained significant knowledge and experience in HPC. The project clarified how HPC can be used for welding analysis consulting. In detail, which welding processes, welding tasks, modelling methods and analysis types are applicable on HPC and with which effort. The overall effort for welding analysis on HPC is now much better known with the help of this SHAPE project, leading to the ability of more precise cost estimation of welding consulting. This is a competitiveness improvement for Loose. The project's results show that the parallelized version of the LS-DYNA implicit solver in application to welding analysis is not satisfactory and requires a revision, maybe in a followup project. The intension is to get a scaling behaviour similar to that of other LS-DYNA applications. Lessons learned • • • • • 3.3.9 LS-DYNA is an appropriate code, but its implicit solver’s scaling behaviour is disappointing in view of LS-DYNA explicit solvers’ good scaling behaviour. The issue of smallest time steps needed for explicit analysis prohibits its application on welding tasks with long duration, e.g. up to 30000s process time. Now it can be distinguished which welding tasks allow explicit welding analysis and which not. With the project's work the scaling behaviour is now known with respect to model size, modelling technique (contact-no contact, solid-shell, transient-metatransient), and analysis type (thermal – mechanical, implicit – explicit). The main problem was that Loose being involved in several consulting and research projects has a tight schedule. Fortunately, the SHAPE project leader and the HLRS staff were flexible and respected this circumstance. The excellent HLRS support enables HPC for SMEs even if the company has less experience in supercomputing. WB-Sails Ltd: CFD simulations of sails and sailboat performance Overview WB-Sails Ltd’s partners in the project have been CSC – IT Center for Science (Center of Supercomputing) in Finland and Next Limit Technologies in Spain. In the project, WB-Sails has deployed Next Limit’s fluid simulation software XFlow on the CSC’s HPC cluster Taito. Preparatory simulations mainly concerning sailboat aerodynamics have been performed. Activity performed Numerous test runs have been performed, including scalability testing, implementing the XFlow GUI on Taito-GPU system, and robustness testing of XFlow’s various solvers, in particular for external aerodynamics as well as multiphase free surface problems. The free surface problem tests have been marred by a glitch in the software, delivering different results in serial and parallel computing. The problem is yet to be solved by Next Limit. PRACE cooperation CSC has been extremely helpful in all aspects of the project: creating the connections to the HPC servers, scripting to automate the creation of batch runs, resolving problems associated with the software compatibility, arranging the possibility for using the GUI (necessary for PRACE-4IP – EINFRA-653838 47 15.04.2016 D7.1 Periodic Report on Applications Enabling post-processing) through GPU nodes. There has been a steep learning curve, by no means possible without the support of the HPC provider CSC. Benefits for SME The HPC cluster allows the SME to do runs at a much higher resolution than with the inhouse workstation. Even if the computational power used so far is modest in supercomputing perspective (128 cores on 8 nodes, around 2500 core h/run), the ability to do runs overnight that normally take a week or more is much appreciated. Also, the high resolution of the runs (up to 42 million elements so far) provides a much more accurate picture of the flow phenomena, improving their understanding of sail interaction & aerodynamics. In the long run, this will reflect in the quality of the SME‘s product as well. Lessons learned Implementing and running relatively new commercial software on an HPC system is much more demanding than expected. In particular, the inability to work with the normal graphic interface for the pre-processing, learning the Linux operating system from scratch, batch scripting for parallel computing have all been both time consuming, and a learning experience. In addition to the co-operation of CSC with the SME, the co-operation between CSC and the software vendor Next Limit has worked well and helped WB-Sails a lot. CSC has given Next Limit access to the HPC systems for their testing, and together with Next Limit the SME has been able to create a launcher script that handles caters for the pre-processing phase. CSC was able to solve the post-processing problem, involving the software’s GUI, in a novel, still experimental way of working through a VNC connection in Taito’s GPU-nodes. Together with Next Limit’s launcher script, the Taito-GPU connection makes it possible to work in an almost office-like environment, hardly needing to employ the Linux command line interface at all. The SME have been forced to scale down ambitions with regards to the objectives of the initial project (optimization through systematic geometry variation). Also, the inability to run free surface cases in parallel has been a disappointment, but there is still hope that with Next Limit the issues can be solved, to be able to fulfil this important part of the initial project. A considerable part of the learning process has to do with the basic interfacing to the remote HPC system: installing and learning the use of the various connecting software, sftp, ssh, VNC etc., in your own operating system which is not necessarily familiar to the HPC provider and not always compatible to the software on the server side. Even such a simple feature as keyboard functioning for example may cause surprising password related issues, and can take a couple of weeks to solve, through googling and & trial & error. 3.3.10 Principia: HPC for Hydrodynamics Database Creation Overview Principia is a scientific engineering company that performs engineering studies for large industrial companies, and develops and industrializes added-value numerical software solutions. The main goal of this project is to optimize one of Principia's code, Diodore, on an HPC infrastructure in order to improve the Deepline HPC product. PRACE-4IP – EINFRA-653838 48 15.04.2016 D7.1 Periodic Report on Applications Enabling Principia and CINES will be working together on this topic, aiming at transferring optimization and parallelization skills so that Principia staff can reuse the methodology on other of their HPC codes. Activity performed The beginning of the project has been delayed due to security aspects and Principia's staff availability. The former goals were to profile, optimize and compile targeted software on a machine administrated by Principia and run benchmarks on the PRACE infrastructure. Instead, now the PRACE infrastructure will be used throughout the project. The project is still in the starting phase, waiting for the NDA between Principia and the CINES experts to be established. PRACE cooperation For now cooperation were restricted to meeting organization. Also the project reporting has been allowed to be moved to the third call. Benefits for SME Not yet applicable. Lessons learned From the PRACE point of view, this project took time to begin. Such delays could have been better anticipated regarding the security requirement asked by the SME. Investigating potential security requirements more thoroughly at the application stage would have been useful. If the exact work-flow required by the SME had been known in advance, the appropriateness or otherwise of moving the code to another country’s infrastructure would have been realised and may have alleviated some of the difficulties the project is now facing. 3.3.11 Algo’tech Informatique Overview ALGO’TECH INFORMATIQUE is an ISV located near Biarritz, in the south-west of France. It creates and sells a suite of software dedicated to electricity. This software is used by design services in order to draw electric schemas. Electrical devices have taken on a major role in all types of electrical, automated and embedded systems. Cables, both shielded and non-shielded, have thus become a serious issue in terms of safety, on-board weight and hence performance and consumption, as well as cost and reliability. Today, the decision to shield or not a cable in response to electromagnetic effects is complex. Simulation became mandatory to obtain a first level of decision. The main target of the project is to validate the possibility for Algo’Tech Informatique to use HPC in the area of electromagnetic simulations to support them up front when designing their installations or later to eliminate the effects when those installations are commissioned. Electromagnetic problems and disturbances occur quite frequently. It is essential to verify during the design stage that their cables are impervious to electromagnetic effects. Most importantly, they must have the means, when the installations are set up, to determine the best configurations to eliminate electromagnetic effects. For several years, Algo’Tech Informatique has been developing, in partnership with the French General Directorate for Armament DGA (RAPID project) and the French Atomic Energy Commission CEA in GRAMAT, an electromagnetic simulator. This simulator uses the circuit simulator developed by Algo’Tech Informatique within the framework of the PRACE-4IP – EINFRA-653838 49 15.04.2016 D7.1 Periodic Report on Applications Enabling FRESH project (a European project of the Sixth Framework Research and Development Program [FRDP] 2004-2007). The simulator is now in the phase of validation. It operates on PCs to solve small and medium-size electromagnetic problems. Unfortunately, when an electrical installation is too big, computing on a PC becomes too much time-consuming to meet user needs. Previous activities performed under the HPC-PME Initiative: The preliminary studies carried out by the INRIA HiePACS team (The French Institute for computer science and applied mathematics) as part of the HPC-PME Initiative (HPC for SME, conducted in France by GENCI, INRIA and BPI-FRANCE) concluded it was necessary to transfer calculations to HPC. Previous activities performed under the Fortissimo program: It is possible to model the electromagnetic effects on the cables, wires, strands and harnesses that make up the connections of electrical systems: we can define a sparse linear system to solve. For example, if we take an installation made up of 100 wires 100 metres long, to obtain a good simulation, the wires have to be cut into 1-metre sections – in other words that makes a total of 10,000 sections. Each of these sections involves about 100 equations to model the electromagnetic effects (depending on the number of contiguous wires), resulting in a system of 1,000,000 by 1,000,000, a sparse but quite voluminous system of linear equations that need to be solved 1,000 times to generate a sweep frequency. Such a system cannot be solved within an acceptable time frame on a PC. The aim of the solution is to produce the whole installation on a PC-type computer (desktop or laptop) and be able to connect automatically with a computing centre to quickly perform the calculations and recover the results for modelling on the PC. In the context of a first project (FP7 Fortissimo program), we addressed the solution to use cloud-based services to solve large scale electromagnetic problems. A specific driver has been implemented to extract the resolution step of the Algo’Tech simulator. The scalability was quite good, but in order to address larger problems (Algo’Tech would like to solve problem with about 5 millions of unknowns), we needed HPC infrastructures for the resolution step. This was the context in which Algo’Tech Informatique has modified its electromagnetic simulator in order to include the calculation libraries of INRIA. It introduces parallelization into the source code of the simulator: it allows taking advantage of multi-processor and multicore architectures. Activity Performed The project led by Algo’Tech into SHAPE was delayed and no concrete activities were realized in 2015. Indeed, Algo’Tech needed to first end its project in Fortissimo before starting its SHAPE project. More precisely, the company had to modify its electromagnetic simulator and to include the calculation libraries of INRIA. In the context of SHAPE, some discussions have been held with INRIA in the beginning of 2016. They are actually working together in order to develop a Windows version of the Algo’Tech software (with a DELPHI integration). The aim is to get a native format library (with VisualC++) as well as a threads and BLAS (with OpenBLAS) better use. PRACE cooperation No access to PRACE supercomputers yet. Benefits for SME PRACE-4IP – EINFRA-653838 50 15.04.2016 D7.1 Periodic Report on Applications Enabling The goal is to apply Electromagnetic Simulation on harnesses in order to find the best position and route of the wires to avoid problems due to electromagnetic disturbances. It prevents electromagnetic bad effects that imply to change the design of the product and it reduces operations and weight due to non-essential shielding and armour of cables. The target market is the all SME’s and independent design offices working around embedded systems and equipment for aeronautic, automotive, railways or ship Industry and the machine manufacturers Industry. Today, to the SME‘s knowledge there is no entry level product on the market, easy to use with no need of having electromagnetic experts. The need for such tools is growing fast as electric control is becoming more and more important in all these industries. The objective is to propose this simulation tool in SaaS mode including the access to HPC resources. The product should be finalized and ready on the market by 2019. This product is proposed to three types of targets: 1. Small Ones with limited configuration, less than 100 wires and 200 connection points. Our estimate is an average of five simulations a year at a fix price of 200€ by simulation. 2. Enterprises with more important configurations between 100 to 1000 wires and/or 2000 connection points. The price of simulation will be linked to the number of wires or connection points and will vary from 200€ to 1000€. The estimate is 10 simulations a year for an average cost of 700€. 3. Enterprises with large or very large configurations above 1000 wires and 2000 connection points. The SME banks on an average simulation cost of 1500€ and 30 simulations a year. Lessons Learned Solving large sparse systems of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. Consequently, many parallel techniques for sparse matrix factorization have been studied, designed and implemented. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern hierarchical supercomputers. In this context, graph partitioning and nested dissection approaches have played a crucial role. The PaStiX solver has been widely used by industrial partners and academic teams, and the current project is the opportunity to demonstrate the efficiency of the approach for SMEs that have assessed and realized the technological leap needed for developing accelerated software using HPC facilities. 3.4 Summary of lessons learned and feedback In this section, the salient points raised from the lessons learned are highlighted, and feedback garnered from the PRACE partners via the SHAPE tele-conferences and face-to-face meetings are also presented. • • • Whilst not always the case, generally SMEs have a preference of working with a centre located in the same country as them. Routes to follow on work once the initial SHAPE project should be investigated, e.g. highlight further collaborative funding opportunities. Security requirements can cause delays – it is important to highlight these as early as possible to ensure correct choices are made at the start. PRACE-4IP – EINFRA-653838 51 15.04.2016 D7.1 • • • • • • • • Periodic Report on Applications Enabling PRACE resources (and shared HPC resources generally) may have queuing systems which offer quite a different experience to the SME, and may not meet their expectations – this needs to be highlighted early to the SME so that their expectations are managed. Regular communication between the SME and the partners appears to be a common theme across the projects, which encountered no significant barriers. Flexibility is key, from both sides – the initial workplans may need to be adapted, and SMEs need to remain engaged and adapt their goals accordingly. Also SMEs by their very nature may not be able to provide consistent effort in the same way as larger organisations, so again an appreciation of this at the outset is beneficial. The dual-application procedure for SHAPE (once for a SHAPE expert, then another for machine time) was confusing for all concerned. An attempt has been made to address this in the third call (see section 3.5). To avoid confusion and potential conflict, some SHAPE terms and conditions or „rules of engagement“ should be drafted and signed up to by collaborators, to ensure that the scope and indeed limitations of the SHAPE programme are clearly understood Many of the applications to SHAPE involve third-party software, so potential licensing issues need to be investigated early in the project. Another potential issue could be successive applications of the same SME to SHAPE that could be pointed to as unfair concurrency to commercial HPC services with public money Some PRACE partner nations have policies restricting the usage of their HPC facilities for projects with industrials – a consideration should be made here on the appropriate choice of facility. 3.5 SHAPE third call Following the successful pilot run of SHAPE under PRACE-3IP, and the subsequent second call for applications under PRACE-4IP, the third call was launched November 16th , 2015 and closed January 14th, 2016. Given the feedback from the previous calls, the application process for this call was changed. One of the main issues repeatedly raised was the double application process – applying for SHAPE assistance, and then if successful, having then to apply for machine time. To mitigate this, the application form was amended to include the opportunity for the SME to include more technical information (if known – it was anticipated that many SMEs would not have the knowledge at hand at this stage, but if they did they could supply it here). In addition, the review team was composed as for the second call, but with the addition of the Preparatory Access Type C coordinator who gave a high-level preparatory access type appraisal of the applications to try to identify any potential showstoppers early in the process. This was a useful exercise: the coordinator’s input to the review was invaluable and provided an angle on the feasibility of the applications that may have been missed by the other board members. However, it is expected that the successful applicants will still have to go through the PA access scheme to get machine time: at least they will have confidence in being successful with their bid, but it is strongly recommended that this situation is revisited before the next call to see if it can be truly made a single-application process. The third call received eight applications, listed in Table 14 below. As of March 2016, the applications have been approved and are in the process of being matched with partners. PRACE-4IP – EINFRA-653838 52 15.04.2016 D7.1 Periodic Report on Applications Enabling Company Country Project Title ACOBIOM France MARS (Matrix of RNA-Seq) Airinnova AB Sweden High level optimization in aerodynamic design AmpliSIM France DemocraSIM: DEMOCRatic Air quaility SIMulation ANEMOS SRL Italy SUNSTAR: Simulation of UNSteady Turbulent flows for the AeRospace industry BAC Engineering Spain Consultancy Group Numerical simulation of accidental fires with a spillage of oil in large buildings Creo Dynamics AB Sweden Large scale aero-acoustics applications using open source CFD FDD Engitec S.L. Spain Pressure drop simulation compressed gas closed system Pharmacelera Spain HPC methodologies for PharmScreen for a Table 14: Third call applications 3.6 SHAPE: future The frequency of the SHAPE calls is being increased, to every six months, so the next call is planned to open June 2016 and then continue at 6-monthly intervals after that. As noted earlier, it is recommended that the SHAPE application process be reviewed and enhanced further to ensure it is a single-application process. With regards to PRACE-5IP, another recommendation is to have a pool of effort for SHAPE projects: at the moment partners already have effort, and they volunteer this effort to take on an SME collaboration. However, this is rather inflexible, for example if the SME has a preference to work with a particular partner but that partner has no effort remaining, or if the expertise required for a project is with a partner without effort in the SHAPE-encompassing work package. In these cases having a pool of effort that the partners could access as appropriate will ensure matching the SMEs with the most suitable PRACE partner will be a much more straightforward process. The numbers of SMEs applying at each call has been decreasing, albeit slowly. Consideration should be given to ways of further raising awareness of the SHAPE programme and enhancing publicity. Indeed, steps are already being taken in this direction and SHAPE is going to liaise with the Industry Advisory Committee to gain advice and assistance with this. PRACE-4IP – EINFRA-653838 53 15.04.2016 D7.1 Periodic Report on Applications Enabling 4 Summary Two parallel sub-tasks on application enabling in Work Package 7 of PRACE-4IP have been described including final reports on the supported applications. These two activities have been organized into support projects formed on a basis of either scaling and optimisation support for Preparatory Access and SHAPE. 4.1 Preparatory Access Type C During PRACE-4IP Task 7.1.A successfully performed five Cut-offs for preparatory access including the associated review process and support for the approved projects. In total four Preparatory Access type C projects have been supported or are currently supported by T7.1.A in PRACE-4IP. Most of all projects reported in this deliverable plan to or have already produced a white paper. Approved white papers are published online on the PRACE RI web page [3]. Table 15 gives an overview of the status of the white papers for all projects. The projects from the Cut-off March 2016 are currently being reviewed and therefore does not appear in this deliverable. Project ID White paper 2010PA2431 White paper status No white paper produced 2010PA2452 WP207: Hybrid MIMD/SIMD High Order DGTD Solver for the Numerical Modeling of Light/Matter Interaction on the Nanoscale Published online [4] 2010PA2457 WP208: Large Scale Parallelized 3d Mesoscopic Simulations of the Mechanical Response to Shear in Disordered Media Published online [5] 2010PA2458 WP209: Optimising PICCANTE – an Open Source Particle-in-Cell Code for Advanced Simulations on Tier-0 Systems Published online [6] 2010PA2486 WP210: Parallel Subdomain Coupling for non-matching Meshes in Alya Published online [7] 2010PA2737 WP211: Numerical Simulation of complex turbulent Flows with Discontinuous Galerkin Method Published online [10] 2010PA2821 Project results will be published in a separate white paper in the context of Work Package 7. 2010PA3125 Project finishes by the end of July 2016. White paper PRACE-4IP – EINFRA-653838 54 15.04.2016 D7.1 Periodic Report on Applications Enabling Project ID White paper White paper status will subsequently be produced 2010PA3056 Project finishes by the end of July 2016. White paper will subsequently be produced Table 15: White paper status of the current PA C projects. Table 15 shows the success of task 7.1.A as almost all finalized projects published their results or plan to publish it in the near future. 4.1 SHAPE The SHAPE programme continues; with a third call having just concluded and the next call planned to open June 2016. Six of the second call projects are finished, and their white papers will be delivered April 2016. The remaining second call projects are progressing. Recommendations have been made to enhance the SHAPE process, such as by streamlining the application to a single-step process, improving publicity, and in 5IP creating a pool of effort for partners willing to work with the SMEs. PRACE-4IP – EINFRA-653838 55 15.04.2016

RELATED PAPERS

RELATED TOPICS

Log In

Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Periodic Report on Applications Enabling (PRACE-4IP D7.1)

Related Papers

RELATED PAPERS

RELATED TOPICS