E-Infrastructures
H2020-EINFRA-2014-2015
EINFRA-4-2014: Pan-European High Performance Computing
Infrastructure and Services
PRACE-4IP
PRACE Fourth Implementation Phase Project
Grant Agreement Number: EINFRA-653838
D7.1
Periodic Report on Applications Enabling
Final
Version:
Author(s):
Date:
1.0
Paul Graham, EPCC; Sebastian Lührs, JUELICH
15.04.2016
D7.1
Periodic Report on Applications Enabling
Project and Deliverable Information Sheet
PRACE Project
Project Ref. №: EINFRA-653838
Project Title: PRACE Fourth Implementation Phase Project
Project Web Site: http://www.prace-project.eu
Deliverable ID:
D7.1
Deliverable Nature: Report
Dissemination Level:
Contractual Date of Delivery:
PU *
29 / April / 2016
Actual Date of Delivery:
30 / April / 2016
EC Project Officer: Leonardo Flores Añover
* - The dissemination level are indicated as follows: PU – Public, CO – Confidential, only for members of the
consortium (including the Commission Services) CL – Classified, as referred to in Commission Decision
2991/844/EC.
Document Control Sheet
Document
Authorship
Title: Periodic Report on Applications Enabling
ID:
D7.1
Version: 1.0
Status: Final
Available at: http://www.prace-project.eu
Software Tool: Microsoft Word 2010
File(s):
D7.1.docx
Paul Graham, EPCC; Sebastian Lührs,
Written by:
JUELICH
PRACE-4IP – EINFRA-653838
i
15.04.2016
D7.1
Periodic Report on Applications Enabling
Gabriel Hautreux, CINES
Tristan Cabel, CINES
Stéphane Lanteri, INRIA
Dimitris Dellis, IASA
Volker Weinberg, LRZ
Bertrand Cirou, CINES
Juan Carlos Caja, BSC
Andrew Emerson, CINECA
John Donners, SURFsara
Ahmet Duran, ITU
Yakup Hundur, ITU
Jörg Hertzer, HLRS
Bärbel Große-Wöhrmann, HLRS
Juan Carlos Garcia, BSC
Jussi Heikonen, CSC
Esko Järvinen, CSC
Claudio Arlandini, CINECA
Raffaele Ponzini, CINECA
Vittorio Ruggiero, CINECA
Isabelle Dupays, IDRIS
Lola Falletti, IDRIS
Philippe Segers, GENCI
Thomas Palychata, GENCI
Hervé Lozach, CEA; Thomas Eickermann,
JUELICH
MB/TB
Contributors:
Reviewed by:
Approved by:
Document Status Sheet
Version
0.1
Date
18/March/2016
Status
Draft
0.2
0.3
0.4
0.5
0.6
24/March/2016
01/April/2016
03/April/2016
05/April/2016
11/April/2016
Draft
Draft
Draft
Draft
Draft
0.7
13/April/2016
Draft
1.0
15/April/2016
Final version
PRACE-4IP – EINFRA-653838
ii
Comments
Set up document
structure
Added PA C content
Added SHAPE content
Formatting
Content, formatting
After project-internal
review
Further post-review
revision
15.04.2016
D7.1
Periodic Report on Applications Enabling
Document Keywords
Keywords:
PRACE, HPC, Research Infrastructure, Preparatory Access, SHAPE
Disclaimer
This deliverable has been prepared by the responsible Work Package of the Project in
accordance with the Consortium Agreement and the Grant Agreement n° EINFRA-653838. It
solely reflects the opinion of the parties to such agreements on a collective basis in the context
of the Project and to the extent foreseen in such agreements. Please note that even though all
participants to the Project are members of PRACE AISBL, this deliverable has not been
approved by the Council of PRACE AISBL and therefore does not emanate from it nor
should it be considered to reflect PRACE AISBL’s individual opinion.
Copyright notices
2016 PRACE Consortium Partners. All rights reserved. This document is a project
document of the PRACE project. All contents are reserved by default and may not be
disclosed to third parties without the written consent of the PRACE partners, except as
mandated by the European Commission contract EINFRA-653838 for reviewing and
dissemination purposes.
All trademarks and other rights on third party products mentioned in this document are
acknowledged as own by the respective holders.
PRACE-4IP – EINFRA-653838
iii
15.04.2016
D7.1
Periodic Report on Applications Enabling
Table of Contents
Project and Deliverable Information Sheet ......................................................................................... i
Document Control Sheet ........................................................................................................................ i
Document Status Sheet ......................................................................................................................... ii
Document Keywords ............................................................................................................................ iii
Table of Contents ................................................................................................................................. iv
List of Figures ........................................................................................................................................ v
List of Tables......................................................................................................................................... vi
References and Applicable Documents .............................................................................................. vi
List of Acronyms and Abbreviations ................................................................................................. vii
List of Project Partner Acronyms ....................................................................................................... ix
Executive Summary .............................................................................................................................. 2
1
Introduction ................................................................................................................................... 3
2
T7.1.A Petascaling & Optimisation Support for Preparatory Access Projects – Preparatory
Access Calls ............................................................................................................................................ 1
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Cut-off statistics ................................................................................................................................... 1
Review Process ..................................................................................................................................... 3
Assigning of PRACE collaborators .................................................................................................... 4
Monitoring of projects ........................................................................................................................ 4
Hand-over between PRACE-3IP and PRACE-4IP PA type C projects ......................................... 5
PRACE Preparatory Access type C projects covered by this report .............................................. 6
Dissemination ....................................................................................................................................... 9
Cut-off June 2014 ................................................................................................................................ 9
2.8.1 Numerical modeling of the interaction of light waves with nanostructures using a high order
discontinuous finite element method, 2010PA2452 ............................................................................... 9
2.8.2 Large scale parallelized 3d mesoscopic simulations of the mechanical response to shear in
disordered media, 2010PA2457 .......................................................................................................... 13
2.8.3 PICCANTE: an open source particle-in-cell code for advanced simulations on tier-0 systems,
2010PA2458 ........................................................................................................................................ 16
2.8.4 OpenFOAM capability for industrial large scale computation of the multiphase flow of future
automotive component: step 2., 2010PA2431 ..................................................................................... 21
2.9 Cut-off September 2014 .....................................................................................................................22
2.9.1 Parallel subdomain coupling for non-matching mesh problems in ALYA, 2010PA2486 ........... 22
2.10 Cut-off December 2014 ......................................................................................................................25
2.10.1
Numerical simulation of complex turbulent flows with Discontinuous Galerkin method,
2010PA2737 ........................................................................................................................................ 25
2.11 Cut-off March 2015 ............................................................................................................................30
2.11.1
Large Eddy Simulation of unsteady gravity currents and implications for mixing,
2010PA2821 ........................................................................................................................................ 30
3
T7.1.B SHAPE ............................................................................................................................. 35
3.1 SHAPE Second call: Applications and Review Process ..................................................................35
3.2 SHAPE Second call: Status ................................................................................................................36
3.3 SHAPE second call: Project summaries ...........................................................................................37
3.3.1 Ergolines: HPC-based Design of a Novel Electromagnetic Stirrer for Steel Casting ................ 37
3.3.2 Cybeletech: Numerical simulations for plant breeding optimization ......................................... 39
3.3.3 RAPHI: rarefied flow simulations on Intel Xeon Phi ................................................................. 40
PRACE-4IP – EINFRA-653838
iv
15.04.2016
D7.1
Periodic Report on Applications Enabling
3.3.4 Open Ocean: High Performance Processing Chain - faster on‐line statistics calculation ........ 41
3.3.5 Hydros Innovation: Automatic Optimal Hull Design by Means of VPP Applications on HPC
Platforms ............................................................................................................................................. 43
3.3.6 Vortex: Numerical Simulation for Vortex-Bladeless .................................................................. 44
3.3.7 DesignMethods: Coupled sail and appendage design method for multihull based on numerical
optimization ......................................................................................................................................... 45
3.3.8 Ingenieurbüro Tobias Loose: HPCWelding: parallelized welding analysis with LS-Dyna ....... 46
3.3.9 WB-Sails Ltd: CFD simulations of sails and sailboat performance ........................................... 47
3.3.10
Principia: HPC for Hydrodynamics Database Creation ................................................... 48
3.3.11
Algo’tech Informatique ...................................................................................................... 49
3.4 Summary of lessons learned and feedback .......................................................................................51
3.5 SHAPE third call ................................................................................................................................52
3.6 SHAPE: future ....................................................................................................................................53
4
Summary ...................................................................................................................................... 54
4.1 Preparatory Access Type C ...............................................................................................................54
4.1 SHAPE.................................................................................................................................................55
List of Figures
Figure 1: Number of submitted and accepted proposals for PA type C per Cut-off. .............................. 2
Figure 2: Amount of PMs assigned to PA type C projects per Cut-off. .................................................. 2
Figure 3: Number of projects per scientific field. ................................................................................... 3
Figure 4: Timeline of the PA C projects. ................................................................................................ 5
Figure 5: View of the computational domain for the Y-shaped waveguide (left) and contour lines of
the amplitude of the electric field (right)............................................................................................... 11
Figure 6: Strong scalability analysis of the DGTD solver with P2 (top), P3 (middle) and P4 (bottom)
interpolation. ......................................................................................................................................... 12
Figure 7: The inverse average iteration time of initial ELASTO code, using up to 32 cores, as function
of number of cores on Froggy and Curie, for system sizes 643, 2563, 5123 and 10243 ......................... 14
Figure 8: The percentage of time spent in MPI calls (left) and the average MPI message size during
run, as function of number of cores on Curie, for system size 643, 2563 and 5123. .............................. 15
Figure 9: Most time consuming routines for strong (top) and weak (bottom) scaling tests, before (left)
and after (right) the optimization work. ................................................................................................ 17
Figure 10: Old vs. new output strategy overview.................................................................................. 18
Figure 11: Comparison of the old and the new output strategy for a strong scaling test. ..................... 19
Figure 12: Two subdomains coupling and parallel partition. The lines show the connection between
parallel partitions of the different subdomains. ..................................................................................... 23
Figure 13: Relative cost of using the subdomain coupling with a fixed number of two hundred
iterations of the GMRES solver with a Krylov space of size ten. ......................................................... 23
Figure 14: Speed up using the subdomain coupling with a fixed number of two hundred iterations of
the GMRES solver with a Krylov space of size ten, and the same case and configuration in one
subdomain. ............................................................................................................................................ 24
Figure 15: Implicit coupling applied to the Navier-Stokes equations. (Left) Meshes (Right) Velocity
and pressure. .......................................................................................................................................... 24
Figure 16: FSI benchmark. (Left) Geometry. (Right) Results. ............................................................. 25
Figure 17: Speedup on FERMI. The values are normalized by the speedup with 1024 cores. ............. 26
Figure 18: Speedup on HORNET. The values are normalized by the speedup with 24 cores. ............. 27
Figure 19: Percentage of the time consumed in the main steps of the computations. ........................... 29
Figure 20: Shear stress (left) and turbulent kinetic energy (right) profiles for different value of k are
shown. The dashed line represent the modelled quantities, while the continuous lines the resolved
ones........................................................................................................................................................ 30
Figure 21: Speedup vs. cores test1: gravity current in a channel. ......................................................... 32
Figure 22: Results for the channel test: propagation of the gravity current (density field) ................... 32
Figure 23: Benchmark for the 3D gravity current simulation. .............................................................. 33
PRACE-4IP – EINFRA-653838
v
15.04.2016
D7.1
Periodic Report on Applications Enabling
Figure 24: Speedup vs. cores for the 3D gravity current test. ............................................................... 33
Figure 25: Top views of geometry and mesh, and velocity field .......................................................... 38
Figure 26: Speed-up as a function of the number of cores .................................................................... 39
List of Tables
Table 1: Projects, which were established and finalized in the PRACE-3IP extension phase, but had to
be finally reported in this deliverable. ..................................................................................................... 7
Table 2: Projects, which were established in the PRACE-3IP extension phase, but were supported by
PRACE-4IP T7.1.A. ................................................................................................................................ 8
Table 3: Projects, which were established in PRACE-4IP. ..................................................................... 8
Table 4: Strong scaling of the DGTD solver with P2 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process). .......................................................................... 12
Table 5: Strong scaling of the DGTD solver with P3 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process). .......................................................................... 12
Table 6: Strong scaling of the DGTD solver with P4 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process). .......................................................................... 12
Table 7: Scalasca analysis of the piccante core routines. ...................................................................... 17
Table 8: Maximum writing speed by using the new output strategy..................................................... 19
Table 9: Scaling performances on FERMI ............................................................................................ 27
Table 10: Scaling performances on HORNET ...................................................................................... 27
Table 11: Scaling performances on MareNostrum................................................................................ 28
Table 12: I/O writing time in sec with and without HDF5.................................................................... 28
Table 13: SHAPE applications to the second call ................................................................................. 36
Table 14: Third call applications ........................................................................................................... 53
Table 15: White paper status of the current PA C projects. .................................................................. 55
References and Applicable Documents
[1] http://www.prace-project.eu (identical to http://www.prace-ri.eu)
[2] http://www.prace-ri.eu/IMG/pdf/D7.1.3_3ip.pdf
[3] http://www.prace-ri.eu/white-papers/
[4] Hybrid MIMD/SIMD High Order DGTD Solver for the Numerical Modeling of
Light/Matter Interaction on the Nanoscale, http://www.prace-ri.eu/IMG/pdf/WP207.pdf
[5] Large Scale Parallelized 3d Mesoscopic Simulations of the Mechanical Response to
Shear in Disordered Media, http://www.prace-ri.eu/IMG/pdf/WP208.pdf
[6] Optimising PICCANTE – an Open Source Particle-in-Cell Code for Advanced
Simulations on Tier-0 Systems, http://www.prace-ri.eu/IMG/pdf/WP209.pdf
[7] Parallel Subdomain Coupling for non-matching Meshes in Alya, http://www.praceri.eu/IMG/pdf/WP210.pdf
[8] A. Abbà , L. Bonaventura, M. Nini , M. Restelli. Dynamic models for Large Eddy
Simulation of compressible flows with a high order DG method. Computers and Fluids. DOI:
10.1016/j.compfluid.2015.08.021, in press
[9] M. Nini, A. Abbà, M. Germano, M. Restelli. Analysis of a Hybrid RANS/LES Model
using RANS Reconstruction. Proceeding of iTi2014 - conference on turbulence, Bertinoro,
September 21-24, 2014
[10] Numerical Simulation of complex turbulent Flows with Discontinuous Galerkin
Method, http://www.prace-ri.eu/IMG/pdf/wp211.pdf
[11] PRACE-3IP Deliverable 5.2 “Integrated HPC Access Programme for SMEs”, February
2013
PRACE-4IP – EINFRA-653838
vi
15.04.2016
D7.1
Periodic Report on Applications Enabling
[12] PRACE-3IP Deliverable 5.3.1 “PRACE Integrated Access Programme Launch”, June
2013
[13] PRACE-3IP Deliverable 5.3.2 “Results of the Integrated Access Programme Pilots”,
June 2014
[14] PRACE-3IP Deliverable 5.3.3 “Report on the SHAPE Implementation”, January 2015
List of Acronyms and Abbreviations
aisbl
BCO
CoE
CPU
CUDA
DARPA
DEISA
DoA
EC
EESI
EoI
ESFRI
GB
Gb/s
GB/s
GÉANT
GFlop/s
GHz
GPU
HET
HMM
HPC
HPL
ISC
KB
LINPACK
MB
MB
MB/s
MFlop/s
MIC
MooC
Association International Sans But Lucratif
(legal form of the PRACE-RI)
Benchmark Code Owner
Center of Excellence
Central Processing Unit
Compute Unified Device Architecture (NVIDIA)
Defense Advanced Research Projects Agency
Distributed European Infrastructure for Supercomputing Applications
EU project by leading national HPC centres
Description of Action (formerly known as DoW)
European Commission
European Exascale Software Initiative
Expression of Interest
European Strategy Forum on Research Infrastructures
Giga (= 230 ~ 109) Bytes (= 8 bits), also GByte
Giga (= 109) bits per second, also Gbit/s
Giga (= 109) Bytes (= 8 bits) per second, also GByte/s
Collaboration between National Research and Education Networks to
build a multi-gigabit pan-European network. The current EC-funded
project as of 2015 is GN4.
Giga (= 109) Floating point operations (usually in 64-bit, i.e. DP) per
second, also GF/s
Giga (= 109) Hertz, frequency =109 periods or clock cycles per second
Graphic Processing Unit
High Performance Computing in Europe Taskforce. Taskforce by
representatives from European HPC community to shape the European
HPC Research Infrastructure. Produced the scientific case and valuable
groundwork for the PRACE project.
Hidden Markov Model
High Performance Computing; Computing at a high performance level
at any given time; often used synonym with Supercomputing
High Performance LINPACK
International Supercomputing Conference; European equivalent to the
US based SCxx conference. Held annually in Germany.
Kilo (= 210 ~103) Bytes (= 8 bits), also KByte
Software library for Linear Algebra
Management Board (highest decision making body of the project)
Mega (= 220 ~ 106) Bytes (= 8 bits), also MByte
Mega (= 106) Bytes (= 8 bits) per second, also MByte/s
Mega (= 106) Floating point operations (usually in 64-bit, i.e. DP) per
second, also MF/s
Multi Integrated Cores
Massively open online Course
PRACE-4IP – EINFRA-653838
vii
15.04.2016
D7.1
MoU
MPI
NDA
PA
PATC
PRACE
PRACE 2
PRIDE
RI
SHAPE
SME
TB
TB
TCO
TDP
TFlop/s
Tier-0
Tier-1
UNICORE
WP
Periodic Report on Applications Enabling
Memorandum of Understanding.
Message Passing Interface
Non-Disclosure Agreement. Typically signed between vendors and
customers working together on products prior to their general
availability or announcement.
Preparatory Access (to PRACE resources)
PRACE Advanced Training Centres
Partnership for Advanced Computing in Europe; Project Acronym
The upcoming next phase of the PRACE Research Infrastructure
following the initial five year period.
Project Information and Dissemination Event
Research Infrastructure
SME HPC Adoption Programme in Europe
Small and Medium Enterprises
Technical Board (group of Work Package leaders)
Tera (= 240 ~ 1012) Bytes (= 8 bits), also TByte
Total Cost of Ownership. Includes recurring costs (e.g. personnel,
power, cooling, maintenance) in addition to the purchase cost.
Thermal Design Power
Tera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per
second, also TF/s
Denotes the apex of a conceptual pyramid of HPC systems. In this
context the Supercomputing Research Infrastructure would host the
Tier-0 systems
National or topical HPC centres in the conceptual pyramid
Uniform Interface to Computing Resources. Grid software for seamless
access to distributed resources.
Work Package
PRACE-4IP – EINFRA-653838
viii
15.04.2016
D7.1
Periodic Report on Applications Enabling
List of Project Partner Acronyms
BADW-LRZ
Leibniz-Rechenzentrum der Bayerischen Akademie der
Wissenschaften, Germany (3rd Party to GCS)
BILKENT
Bilkent University, Turkey (3rd Party to UYBHM)
BSC
Barcelona Supercomputing Center - Centro Nacional de
Supercomputacion, Spain
CaSToRC
Computation-based Science and Technology Research Center,
Cyprus
CCSAS
Computing Centre of the Slovak Academy of Sciences, Slovakia
CEA
Commissariat à l’Energie Atomique et aux Energies Alternatives,
France (3 rd Party to GENCI)
CESGA
Fundacion Publica Gallega Centro Tecnológico de
Supercomputación de Galicia, Spain, (3rd Party to BSC)
CINECA
CINECA Consorzio Interuniversitario, Italy
CINES
Centre Informatique National de l’Enseignement Supérieur,
France (3 rd Party to GENCI)
CNRS
Centre National de la Recherche Scientifique, France (3 rd Party
to GENCI)
CSC
CSC Scientific Computing Ltd., Finland
CSIC
Spanish Council for Scientific Research (3rd Party to BSC)
CYFRONET
Academic Computing Centre CYFRONET AGH, Poland (3rd
party to PNSC)
EPCC
EPCC at The University of Edinburgh, UK
ETHZurich (CSCS) Eidgenössische Technische Hochschule Zürich – CSCS,
Switzerland
FIS
FACULTY OF INFORMATION STUDIES, Slovenia (3rd Party to
ULFME)
GCS
Gauss Centre for Supercomputing e.V.
GENCI
Grand Equipement National de Calcul Intensiv, France
GRNET
Greek Research and Technology Network, Greece
INRIA
Institut National de Recherche en Informatique et Automatique,
France (3 rd Party to GENCI)
IST
Instituto Superior Técnico, Portugal (3rd Party to UC-LCA)
IUCC
INTER UNIVERSITY COMPUTATION CENTRE, Israel
JKU
Institut fuer Graphische und Parallele Datenverarbeitung der
Johannes Kepler Universitaet Linz, Austria
JUELICH
Forschungszentrum Juelich GmbH, Germany
KTH
Royal Institute of Technology, Sweden (3rd Party to SNIC)
LiU
Linkoping University, Sweden (3rd Party to SNIC)
NCSA
NATIONAL CENTRE FOR SUPERCOMPUTING
APPLICATIONS, Bulgaria
NIIF
National Information Infrastructure Development Institute,
Hungary
NTNU
The Norwegian University of Science and Technology, Norway
(3rd Party to SIGMA)
NUI-Galway
National University of Ireland Galway, Ireland
PRACE
Partnership for Advanced Computing in Europe aisbl, Belgium
PSNC
Poznan Supercomputing and Networking Center, Poland
RISCSW
RISC Software GmbH
PRACE-4IP – EINFRA-653838
ix
15.04.2016
D7.1
RZG
SIGMA2
SNIC
STFC
SURFsara
UC-LCA
UCPH
UHEM
UiO
ULFME
UmU
UnivEvora
UPC
UPM/CeSViMa
USTUTT-HLRS
VSB-TUO
WCNS
Periodic Report on Applications Enabling
Max Planck Gesellschaft zur Förderung der Wissenschaften
e.V., Germany (3 rd Party to GCS)
UNINETT Sigma2 AS, Norway
Swedish National Infrastructure for Computing (within the
Swedish Science Council), Sweden
Science and Technology Facilities Council, UK (3rd Party to
EPSRC)
Dutch national high-performance computing and e-Science
support center, part of the SURF cooperative, Netherlands
Universidade de Coimbra, Labotatório de Computação
Avançada, Portugal
Københavns Universitet, Denmark
Istanbul Technical University, Ayazaga Campus, Turkey
University of Oslo, Norway (3rd Party to SIGMA)
UNIVERZA V LJUBLJANI, Slovenia
Umea University, Sweden (3rd Party to SNIC)
Universidade de Évora, Portugal (3rd Party to UC-LCA)
Universitat Politècnica de Catalunya, Spain (3rd Party to BSC)
Madrid Supercomputing and Visualization Center, Spain (3rd
Party to BSC)
Universitaet Stuttgart – HLRS, Germany (3rd Party to GCS)
VYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA
OSTRAVA, Czech Republic
Politechnika Wroclawska, Poland (3rd party to PNSC)
PRACE-4IP – EINFRA-653838
x
15.04.2016
D7.1
PRACE-4IP – EINFRA-653838
Periodic Report on Applications Enabling
1
15.04.2016
D7.1
Periodic Report on Applications Enabling
Executive Summary
Task T7.1 “Enabling Applications Codes for PRACE Systems” in Work Package 7 (WP7) of
PRACE-4IP aims to provide application enabling support for the HPC applications which are
important for the European researchers and small and medium enterprises to ensure the
applications can effectively exploit HPC systems. There were two activities in T7.1:
T7.1.A Petascaling & Optimisation Support for Preparatory Access Projects:
This activity provided code enabling and optimisation to European researchers as well as
industrial projects to make their applications ready for Tier-0 systems. Projects can
continuously apply for such services via the Preparatory Access Call Type C (PA C) with a
cut-off every three months for evaluation of the proposals. Five Preparatory Access Calls
have been carried out in PRACE-4IP. The report focuses on the optimization work and results
gained by the completed projects in PRACE-4IP and will report the last PA C projects of
PRACE-3IP, which finished after the last Deliverable of PRACE-3IP was completed, and
therefore have not be reported so far. In total seven PA C projects have finished their work.
The statistics about the PA C calls in PRACE-4IP as well as a description of the call
organization itself is also included. The results of the completed projects have also been
documented in white papers, which were published on the PRACE-RI website [1].
T7.1.B SHAPE:
This activity continued the support for SHAPE (the SME HPC Adoption Programme in
Europe). SHAPE aims to raise awareness and provide European SMEs with the expertise
necessary to take advantage of the innovation possibilities created by High-Performance
Computing (HPC), thus increasing their competitiveness. It holds regular calls, and successful
applicants to the SHAPE programme get support effort from a PRACE HPC expert and
access to machine time at a PRACE centre. In collaboration with the SME, the PRACE
partner helps them try out their ideas for utilising HPC to enhance their business.
This report focusses on the second call of SHAPE, looking at the results of the projects and
lessons learned from the perspective of both the SMEs and the PRACE partners. In addition,
it covers the recently closed third call for projects, outlining changes to the application
process both already implemented and recommended for future calls.
PRACE-4IP – EINFRA-653838
2
15.04.2016
D7.1
Periodic Report on Applications Enabling
1 Introduction
Computational simulations have proved to be a promising way of finding answers to research
problems from a wide range of scientific fields. However, such complex problems often have
such high demands regarding the needed computation time that these cannot be met by
conventional computer systems. Instead, supercomputers are the method of choice in today’s
simulations.
PRACE offers a wide range of different Tier-0 and Tier-1 architectures to the scientific
community as well as to industrial's innovative projects. The efficient usage of such systems
places high demands on the used software packages and in many cases advanced optimization
work has to be applied to the code to make efficient use of the provided supercomputers. The
complexity of supercomputers requires a high level of experience and advanced knowledge of
different concepts regarding programming techniques, parallelization strategies, etc. Such
demands often cannot be met by the applicants themselves and thus special assistance by
supercomputing experts is essential.
PRACE offers such a service through the Preparatory Access Call type C (PA C) for Tier-0
systems. PA C is managed by Task 7.1.A “Petascaling and Optimization Support for
Preparatory Access Projects”. This includes the evaluation of the PA C proposals as well as
the assignment of PRACE experts to these proposals. Furthermore, the support itself is
provided and monitored within this task. Section 2 gives a more detailed description of PA C
and some facts on the usage of PA C in PRACE-4IP are listed in 2.1. The review process, the
assignment of PRACE experts to the projects and the monitoring of the support work are
detailed in Section 2.2, Section 2.3 and Section 2.4 respectively. The contents of Sections 2.2
- 2.4 can already be found in deliverable D7.1.3 of PRACE-3IP [2]. They are repeated here
for completeness and the benefit of the reader. Section 2.5 describes the relation and hand
over between the PRACE-3IP and the PRACE-4IP project regarding PA C. Section 2.6 gives
an overview about the Preparatory Access type C projects covered in PRACE-4IP and will list
projects supported by PRACE-3IP, which were not reported in former deliverables. The
announcement of the call is described in Section 2.7. Finally, the work done within the
projects along with the outcome of the optimization work is presented in Sections 2.8 - 2.11.
The second part of this deliverable is the report on the SME HPC Adoption Programme in
Europe (SHAPE), which is a pan-European programme to support the adoption of High
Performance Computing (HPC) by European small to medium-size enterprises (SMEs). It was
developed by PRACE under its PRACE-3IP European Union funded project, and continued
under PRACE-4IP.
The SHAPE programme, presented in the PRACE-3IP Deliverable 5.2 [11] aims to equip
European SMEs with the awareness and expertise necessary to take advantage of the
innovation possibilities opened by HPC, increasing their competitiveness. The mission of the
Programme is to help European SMEs to demonstrate a tangible Return on Investment (ROI)
by assessing and adopting solutions supported by HPC, thus facilitating innovation and/or
increased operational efficiency in their businesses.
It can be challenging for SMEs to adopt HPC. They may have no in-house expertise, no
access to hardware, or be unable to commit resources to a potentially risky endeavour. This is
where SHAPE comes in, by making it easier for SMEs to make use of high-performance
computing in their business - be it to improve product quality, reduce time to delivery or
provide innovative new services to their customers. Successful applicants to the SHAPE
programme get support effort from a PRACE HPC expert and access to machine time at a
PRACE centre. In collaboration with the SME, the PRACE partner helps them try out their
ideas for utilising HPC to enhance their business.
PRACE-4IP – EINFRA-653838
3
15.04.2016
D7.1
Periodic Report on Applications Enabling
The initial SHAPE pilot [12][13] was launched in 2013 and helped 10 SMEs adopt HPC, with
a follow-up exercise to gauge the business impact for the SMEs showing in almost all the
cases that the pilot had been of real value to the SMEs, with tangible measures of the return
on investment for the SHAPE work [14]. Following this pilot the PRACE Council decided to
operate the SHAPE programme as a permanent service.
The SHAPE second call was launched November 2014 and closed in January 2015, and is
reported on in sections 3.1 to 3.4. The third call for SHAPE was launched in November 2015
and closed in January 2016, and is reported in section 3.5. Finally section 3.6 looks at the
plans and recommendations for SHAPE going forward.
The deliverable closes with a summary in Section 4 and points out the outcome of Task 7.1.A
and Task 7.1.B.
PRACE-4IP – EINFRA-653838
4
15.04.2016
D7.1
Periodic Report on Applications Enabling
2 T7.1.A Petascaling & Optimisation Support for Preparatory
Access Projects – Preparatory Access Calls
Access to PRACE Tier-0 systems is managed through PRACE regular calls, which are issued
twice a year. To apply for Tier-0 resources the application must meet technical criteria
concerning scaling capability, memory requirements, and runtime set up. There are many
important scientific and commercial applications, which do not meet these criteria today. To
support the researchers PRACE offers the opportunity to test and optimize their applications
on the envisaged Tier-0 system prior to applying for a regular production project. This is the
purpose of the Preparatory Access Call. The PA Call allows for submission of proposals at
any time whereby the review of these proposals takes place every three months. This
procedure is also referred to as Cut-off. Therefore, new projects can be admitted for
preparatory purposes to PRACE Tier-0 systems once every quarter. It is possible to choose
between three different types of access:
•
•
•
Type A is meant for code scalability tests, the outcome of which is to be included in
the proposal in a future PRACE Regular Call. Users receive a limited number of core
hours; the allocation period is two months.
Type B is intended for code development and optimization by the user. Users get also
a small number of core hours; the allocation period is 6 months.
Type C is also designed for code development and optimization with the core hours
and the allocation period being the same as for Type B. The important difference is
that Type C projects receive special assistance by PRACE experts to support the
optimization. As well as access to the Tier-0 systems, the applicants also apply for 1 to
6 PMs of supporting work to be performed by PRACE experts.
The following Tier-0 systems were available for PA during the reporting period:
•
•
•
•
•
•
CURIE, BULL Bullx cluster at GENCI-CEA, France (thin (TN), fat (FN), and hybrid
(HN) nodes were available)
FERMI, IBM Blue Gene/Q at CINECA, Italy
HAZEL HEN, Cray XC40, replacing HORNET, Cray XC40 at GCS-HLRS, Germany
MARENOSTRUM, IBM System X iDataplex at BSC, Spain (normal and hybrid
nodes are available),
SUPERMUC, IBM System X iDataplex at GCS-LRZ, Germany
JUQUEEN, IBM Blue Gene/Q at GCS-JSC, Germany
2.1 Cut-off statistics
In PRACE-4IP, five Cut-offs for PA took place resulting in three projects so far. Despite the
fact that Cut-off June 2014, September 2014 and December 2014 took place in PRACE-3IP
they are included in the presented statistics because the corresponding projects are either
reported in this deliverable or were taken over by PRACE-4IP.
In the March 2015 Cut-off one project had to be rejected due to poor scalability. Another
project was already supported by SHAPE in T7.1.B and used PA C to gain the needed
computing time. SHAPE already handles the support on this project. Therefore, no extra
support by T7.1.A had to be provided on this proposal.
In the June 2015 and the September 2015 Cut-off no new proposals had been accepted. In
June 2015, only one single proposal applied for PA C but this project was already supported
by SHAPE and used PA C to gain the needed computing time. In September 2015, no new
proposals applied for PA C.
PRACE-4IP – EINFRA-653838
1
15.04.2016
D7.1
Periodic Report on Applications Enabling
Proposals
6
Applied
Accepted
5
4
3
2
1
0
Jun-14
Sep-14
Dec-14
Mar-15
Cut-off
Jun-15
Sep-15
Dec-15
Mar-16
Figure 1: Number of submitted and accepted proposals for PA type C per Cut-off.
Figure 1 presents the number of proposals, which have been submitted and accepted
respectively for each Cut-off covered in this deliverable. In total 3 out of 6 proposals were
accepted during the PRACE-4IP Cut-off phase beginning in March 2015 until the Cut-off in
December 2015. The two projects which were handled by SHAPE and T7.1.B marked here as
rejected because they gain no extra support out of PA C and T7.1.A. The Cut-off March 2016
is currently in progress and therefore the final status is not yet available for the report.
Provided support by PRACE-4IP experts
7
6
PMs
5
4
3
2
1
0
December 2014
March 2015
June 2015
September 2015 December 2015
March 2016
Cut-off
Figure 2: Amount of PMs assigned to PA type C projects per Cut-off.
Figure 2 gives an overview of the number of PMs from PRACE-4IP assigned to the projects
per Cut-off. In total 11 PMs were made available to these projects.
Finally, Figure 3 provides an overview of the scientific fields, which are covered by the
supported projects in PRACE-4IP.
PRACE-4IP – EINFRA-653838
2
15.04.2016
D7.1
Periodic Report on Applications Enabling
Scientific Areas
4
3
2
1
0
Medicine and Engineering and Earth Sciences Chemistry and
Life Sciences
Energy
and
Materials
Environment
Astrophysics
Mathematics Fundamental
Physics
and Computer
Science
Financial &
Social Science
Figure 3: Number of projects per scientific field.
2.2 Review Process
The organization of the review procedure, the assignment of PRACE collaborators and the
supervision of the PA C projects are managed by task 7.1.A. In this section, the review
process for the preparatory access proposals of Type C is explained.
All preparatory access proposals undergo a technical review performed by technical staff of
the hosting sites to ensure that the underlying codes are in principle able to run on the
requested system. In parallel, all projects are additionally reviewed by work package 7 in
order to assess their optimization requests. Each proposal is assigned to two WP7 reviewers.
The review is performed by PRACE partners who all have a strong background in
supercomputing. Currently a list of 24 experts is maintained and the task leader has the
responsibility to contact them to launch the review process. As the procedure of reviewing
proposals and establishing the collaboration of submitted projects and PRACE experts takes
place only four times a year it is necessary to keep the review process swift and efficient. A
close collaboration between PRACE aisbl, T7.1.A and the hosting sites is important in this
context. The process for both the technical and the WP7 review is limited to two weeks. In
close collaboration with PRACE aisbl and the hosting sites, the whole procedure from PA
Cut-off to project start on PRACE supercomputing systems is completed in less than six
weeks.
Based on the proposals the Type C reviewers need to focus on the following aspects:
•
•
•
•
•
Does the project require support for achieving production runs on the chosen
architecture?
Are the performance problems and their underlying reasons well understood by the
applicant?
Is the amount of support requested reasonable for the proposed goals?
Will the code optimisation be useful to a broader community, and is it possible to
integrate the achieved results during the project in the main release of the code(s)?
Will there be restrictions in disseminating the results achieved during the project?
PRACE-4IP – EINFRA-653838
3
15.04.2016
D7.1
Periodic Report on Applications Enabling
Additionally, the task leader evaluates whether the level and type of support requested is still
available within PRACE. Finally, the recommendation from WP7 to accept or reject the
proposal is made.
Based on the provided information from the reviewers the Board of Directors has the final
decision on whether proposals are approved or rejected. The outcome is communicated to the
applicant through PRACE aisbl. Approved proposals receive the contact data of the assigned
PRACE collaborators, rejected projects are provided with further advice on how to address
the shortcomings.
2.3 Assigning of PRACE collaborators
To ensure the success of the projects it is essential to assign suitable experts from the PRACE
project. Based on the described optimization issues and support requests from the proposal
experts are thus chosen who are most familiar with the subject matter.
This is done in two steps: First, summaries of the proposals describing the main optimization
issues are distributed via corresponding mailing lists. Here, personal data is explicitly
removed from the reports to maintain the anonymity of the applicants. Interested experts can
get in touch with the task leader offering to work on one or more projects.
Should the response not be sufficient to cover the support requirements of the projects, the
task leader contacts the experts directly and asks them to contribute.
There is one exception to the procedure when a proposal has a close connection to a PRACE
site which has already worked on the code: In this case this site is asked first if they are able
to extend the collaboration in the context of the new PA C project.
This procedure has proven to be extremely successful. No proposals had to be rejected in the
past reporting period due to a lack of available support.
The assignment of PRACE experts takes place concurrently to the review process so that the
entire review can be completed within six weeks. This has proven itself to be a suitable
approach, as the resulting overhead is negligible.
As soon as the review process is finished, the support experts are introduced to the PIs and
can start the work on the projects. The role of the PRACE collaborator includes the following
tasks:
•
•
•
•
•
Formulating a detailed work plan together with the applicant,
Participating in the optimization work,
Reporting the status in the task 7.1A phone conference every second month,
Participating in the writing of the final report together with the PI (the PI has the main
responsibility for this report), due at project end and requested by the PRACE office,
Writing a white paper containing the results, which is published on the PRACE web
site.
2.4 Monitoring of projects
Task 7.1.A includes the supervision of the Type C projects. This is challenging as the
projects’ durations (six months) and the intervals of the Cut-offs (3 months) are not cleanly
aligned. Due to this, projects do not necessarily start and end at the same time but overlap, i.e.
at each point in time different projects might be in different phases. To solve this problem, a
phone conference takes place in task 7.1.A every two months to discuss the status of running
PRACE-4IP – EINFRA-653838
4
15.04.2016
D7.1
Periodic Report on Applications Enabling
projects, to advice on how to proceed with new projects and to manage the finalization and
reporting of finished projects.
In addition, the T7.1.A task leader gives a status overview in a monthly WP7 conference call
to address all PRACE collaborators who are involved in these projects. All project relevant
information is maintained on a PRACE wiki page, which is available to all PRACE
collaborators.
The T7.1.A task leader is also available to address urgent problems and additional phone
conferences are held in such cases.
Twice a year, a WP7 face-to-face meeting is scheduled. This meeting gives all involved
collaborators the opportunity to discuss the status of the projects and to exchange their
experience.
2.5 Hand-over between PRACE-3IP and PRACE-4IP PA type C projects
The support for Preparatory Access Type C projects has been and is part of all PRACE
projects (PRACE-1IP, -2IP, -3IP, -4IP). For the hand-over between the projects, the tasks
decided to treat the regarding projects in the following way:
The hand-over between the extension phase of PRACE-3IP and PRACE-4IP PA type C
projects took place at the beginning of PRACE-4IP, February 1st, 2015. The Cut-offs, which
took place in June, September and December 2014 were still under the responsibility of T7.1
in PRACE-3IP.
Projects out of the June 2014 cut-off ran until February 1st, 2015. These projects could be
finalized within the context of the PRACE-3IP extension phase but could not be finally
reported in the final deliverable D7.1.3 [2].
The project out of the September 2014 cut-off ran until April 30th, 2015 and was completely
supported by PRACE-3IP, but could not be finally reported in the final PRACE-3IP
deliverable.
The project out of the December 2014 cut-off started at February 16th, 2016, it was
completely supported by PRACE-4IP. Thus, no hand-over of ongoing projects was needed.
06.05.14 30.06.14 24.08.14 18.10.14 12.12.14 05.02.15 01.04.15 26.05.15 20.07.15 13.09.15 07.11.15 01.01.16 25.02.16 20.04.16 14.06.16
PA2431
PA2452
PA2457
PA2458
PA2486
PA2737
PA2821
PA3125
PA3056
[ZELLBEREICH]
[ZELLBEREICH]
[ZELLBEREICH]
[ZELLBEREICH]
[ZELLBEREICH][ZELLBEREICH]
[ZELLBEREICH][ZELLBEREICH]
PRACE-3IP
PRACE-4IP
Cut-off
Figure 4: Timeline of the PA C projects.
PRACE-4IP – EINFRA-653838
5
15.04.2016
D7.1
Periodic Report on Applications Enabling
The timeline of these projects is shown in the Gantt chart in Figure 4. The chart shows the
time span of each project. Projects, which were supported by PRACE-3IP but are reported in
this deliverable, are shown in red. PRACE-4IP projects are shown in green.
The slightly different starting dates of the projects per Cut-off is the result of the decisions
made by the hosting members which determine the exact start of the projects at their local
site. Additionally, PIs can set the starting date of their projects within a limited time frame.
The final results of all these projects are described in this deliverable.
2.6 PRACE Preparatory Access type C projects covered by this report
Projects from Cut-off June 2014 and Cut-off September 2014 have their origin in the PRACE3IP extension phase and were also finalized as part of this phase. Because of the overlap
between the creation of the final deliverable D7.1.3 of the PRACE-3IP extension phase [2]
and the creation of the corresponding final reports, these projects could not be reported in
D7.1.3. For completeness, their results are reported in this deliverable. Table 1 lists the
corresponding projects.
Cut-off June 2014
Title
Numerical modelling of the interaction of light waves
with nanostructures using a high order discontinuous
finite element method
Project leader
Stéphane Lanteri
PRACE expert
Gabriel Hautreux, Tristan Cabel
PRACE facility
CURIE TN, CURIE FN
PA number
2010PA2452
Project's start
15-Jul-2014
Project's end
15-Jan-2015
Title
Large scale parallelized 3d mesoscopic simulations of
the mechanical response to shear in disordered media
Project leader
Kirsten Martens
PRACE expert
Dimitris Dellis
PRACE facility
CURIE TN
PA number
2010PA2457
Project's start
01-Aug-2014
Project's end
01-Feb-2015
Title
PICCANTE: an open source particle-in-cell code for
advanced simulations on tier-0 systems
Project leader
Andrea Macchi
PRACE expert
Volker Weinberg
PRACE facility
FERMI, JUQUEEN
PRACE-4IP – EINFRA-653838
6
15.04.2016
D7.1
Periodic Report on Applications Enabling
Cut-off June 2014
PA number
2010PA2458
Project's start
15-Jul-2014
Project's end
15-Jan-2015
Title
OpenFOAM capability for industrial large scale
computation of the multiphase flow of future
automotive component: step 2
Project leader
Jerome Helie
PRACE expert
Gabriel Hautreux, Bertrand Cirou
PRACE facility
CURIE TN
PA number
2010PA2431
Project's start
01-Aug-2014
Project's end
01-Feb-2015
Cut-off September 2014
Title
Parallel subdomain coupling for non-matching mesh
problems in ALYA
Project leader
Guillaume Houzeaux
PRACE expert
Juan Carlos Caja
PRACE facility
MARENOSTRUM, FERMI
PA number
2010PA2486
Project's start
01-Nov-2014
Project's end
30-Apr-2015
Table 1: Projects, which were established and finalized in the PRACE-3IP extension phase, but had to be
finally reported in this deliverable.
Projects from Cut-off December 2014 also have their origin within the PRACE 3IP extension
phase. However, due to the project start in February (see 2.5) the corresponding project was
completely handed over to PRACE-4IP T7.1.A. Table 2 lists the key information of the
corresponding project.
Cut-off December 2014
Title
Numerical simulation of complex turbulent flows with
Discontinuous Galerkin method
Project leader
Antonella Abba'
PRACE expert
Andrew Emerson
PRACE facility
MARENOSTRUM, FERMI, HORNET
PA number
2010PA2737
Project's start
16-Feb-2015
PRACE-4IP – EINFRA-653838
7
15.04.2016
D7.1
Periodic Report on Applications Enabling
Cut-off December 2014
Project's end
15-Aug-2015
Table 2: Projects, which were established in the PRACE-3IP extension phase, but were supported by
PRACE-4IP T7.1.A.
All remaining Cut-offs, starting with the Cut-off in March 2015, take place within the
PRACE-4IP project phase and were supported by PRACE-4IP. Projects which were
established in the Cut-off in December 2015 or the Cut-off in March 2016 are still in
progress. Therefore, the final results cannot be presented in this deliverable but will be
reported in a later deliverable. Table 3 lists all of this projects.
Cut-off March 2015
Title
Large Eddy Simulation of unsteady gravity currents
and implications for mixing
Project leader
Claudia Adduce
PRACE expert
John Donners
PRACE facility
MARENOSTRUM, FERMI
PA number
2010PA2821
Project's start
20-Apr-2015
Project's end
14-Nov-2015
Cut-off December 2015
Title
Optimization of Hybrid Molecular Dynamics-Self
Consistent Field OCCAM CODE
Project leader
Antonio De Nicola
PRACE expert
Chandan Basu
PRACE facility
MARENOSTRUM, FERMI, HAZEL HEN
PA number
2010PA3125
Project's start
01-Feb-2016
Project's end
31-Jul-2016
Title
HOVE Higher-Order finite-Volume unstructured code
Enhancement for compressible turbulent flows.
Project leader
Claudia Adduce
PRACE expert
Thomas Ponweiser
PRACE facility
SUPERMUC, HAZEL HEN
PA number
2010PA3056
Project's start
01-Feb-2016
Project's end
31-Jul-2016
Table 3: Projects, which were established in PRACE-4IP.
PRACE-4IP – EINFRA-653838
8
15.04.2016
D7.1
Periodic Report on Applications Enabling
The evaluation of the March 2016 proposals is currently in progress and is therefore not listed
here.
2.7 Dissemination
New PA Cut-offs are normally announced on the PRACE website [1].
After the low number of new proposals in the Cut-offs in June 2015 and September 2015
PRACE sites were asked to distribute an email to their users to advertise preparatory access
and especially the possibility of dedicated support via PA C.
Each successfully completed project should be made known to the public and therefore the
PRACE collaborators are asked to write a white paper about the optimization work carried
out. These white papers are published on the PRACE web page [3] and are also referenced by
this deliverable.
2.8 Cut-off June 2014
This section and the following sub-sections describe the optimizations performed on the
Preparatory Projects type C. The projects are listed in accordance with the Cut-off dates in
which they appeared. General information regarding the optimization work done as well as
the achieved results is presented here using the recommended evaluation form. The
application evaluation form ensures a consistent and coherent presentation of all projects
which were managed in the context of PA C. Additionally the white papers created by these
projects are referenced so that the interested reader is provided with further information.
2.8.1 Numerical modeling of the interaction of light waves with nanostructures using
a high order discontinuous finite element method, 2010PA2452
Code general features
Name
DIOGENeS - DIscOntinuous GalErkin Nano Solver
Scientific field
Computational nanophotonics
Short code description
The code is a high order finite element type solver for the numerical
modeling of light interaction with nanometer scale structures. From
the mathematical modeling point of view, one has to deal with the
differential system of Maxwell equations in the time domain,
coupled to an appropriate differential model of the behavior of the
underlying material (which can be a dielectric and/or a noble metal)
at optical frequencies. For the numerical solution of the resulting
system of differential equations, the code implements a high order
discontinuous finite element method (DGTD – Discontinuous
Galerkin Time-Domain solver) which has been adapted to hybrid
MIMD/SIMD computing in the context of the present project.
Programming
language
Fortran 90
Supported compilers
Pgf90, ifort, g95
Parallel
implementation
The hybrid MIMD/SIMD parallelization in DIOGENeS combines
the use of the MPI and OpenMP parallel programming models
Accelerator support
Not yet
PRACE-4IP – EINFRA-653838
9
15.04.2016
D7.1
Periodic Report on Applications Enabling
Code general features
Libraries
MPI
Building procedure
Standard Makefile
Web site
Not available yet
Licence
None
Topic 1
Main objectives
The main objectives of the project were to implement the OpenMP part of the hybrid
MIMD/SIMD parallelization strategy of the code, and to demonstrate the benefits of this
parallelization on the scaling properties of the code.
Accomplished work
The starting point software implementing the DGTD solver was programmed in Fortran and
parallelized for a distributed memory system using a SPMD strategy combining a partitioning
of the underlying tetrahedral mesh and a message passing programming model using the MPI
standard. During the project, the effort has been put on the introduction of a fine grain
parallelization of the main loops of the solver using the OpenMP programming model. The
specific feature that was to be optimized was the fine grain shared memory parallelization of
the underlying discontinuous finite element solver. The associated computer code has thus
been equipped with OpenMP directives for the intra-node parallelization of the loops over the
elements (tetrahedra) of the meshes, which occur in the core solver routines for updating the
electric and magnetic fields components (at each time step). It happened that simple DO
PRIVATE constructs were sufficient to achieve acceptable speedups on the thin nodes of the
Curie system. In order to improve further the scalability of the overall MPI/OpenMP solver
on multi-node configurations of Curie, we also implemented a non-blocking based
communication protocol for exchanging the values of the electromagnetic field components
attached to elements on each side of the faces localized on the interfaces (surfaces) between
neighbouring sub meshes in the mesh partitioning. This protocol has allowed to partially
overlap point-to-point communication operations with computation when updating the
electric and magnetic fields components.
Main results
We limited ourselves to a parallel performance evaluation in terms of strong scalability
analysis. For that purpose, we selected a use case typical of optical guiding applications. A Yshaped waveguide is considered which consists of a nanosphere embedded in vacuum. The
computational domain is shown of Figure 5 below. The constructed tetrahedral mesh consists
of 520,704 vertices and 2,988,103 elements. The high order discontinuous finite element
method designed for the solution of the system of time-domain Maxwell equations coupled to
a Drude model for the dispersion of noble metals at optical frequencies is formulated on a
tetrahedral mesh. Within each element (tetrahedron) of the mesh, the components of the
electric and magnetic field, as well as the component of the electric polarization, are
approximated by a nodal (Lagrange type) interpolation method. The unknowns of the problem
are thus given by the values of these physical quantities at the nodes of the polynomial
interpolation. For instance, for a linear (i.e. P1) interpolation of the fields, the number of DoFs
(Degrees of Freedoms) within a tetrahedron is 6x4 if the element is located in vacuum, and
9x4 if the element is located in the metallic structure. For a quadratic (i.e. P2) interpolation,
PRACE-4IP – EINFRA-653838
10
15.04.2016
D7.1
Periodic Report on Applications Enabling
the corresponding figures are 6x10 and 9x10, and so on for higher interpolation degrees. Then
the global number of DoFs is the sum of these figures of the elements of the given mesh.
Figure 5: View of the computational domain for the Y-shaped waveguide (left) and contour lines of the
amplitude of the electric field (right).
The strong scalability analysis has been conducted on the thin nodes of the Curie system.
Each run has been made considering eight OpenMP threads per socket and two sockets per
node. Plots of the parallel speedup of the DGTD solver with P2 (top), P3 (middle) and P4
(bottom) interpolation are shown on Figure 6. The maximum number of cores that has been
exploited is 8192 for a simulation based on the DGTD-P4 method. A quasi-ideal scaling is
obtained up to 1024 cores. Achieving a better parallel speedup for a number of MPI processes
greater than 1024 would probably require a finer tetrahedral mesh (with several million mesh
vertices).
PRACE-4IP – EINFRA-653838
11
15.04.2016
D7.1
Periodic Report on Applications Enabling
Figure 6: Strong scalability analysis of the DGTD solver with P2 (top), P3 (middle) and P4 (bottom)
interpolation.
Number of
cores
128
256
512
1024
Wall clock
time
4066 sec
1972 sec
952 sec
462 sec
Speed-up vs
the first one
1.00
2.00
4.05
8.40
Number of
Nodes
8
16
32
64
Number of MPI
processes
16
32
64
128
Table 4: Strong scaling of the DGTD solver with P2 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process).
Number of
cores
512
1024
2048
Wall clock
time
2580 sec
1271 sec
646 sec
Speed-up vs
the first one
1.00
2.00
4.00
Number of
Nodes
32
64
128
Number of MPI
processes
64
128
256
Table 5: Strong scaling of the DGTD solver with P3 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process).
Number of
cores
2048
4096
8192
Wall clock
time
1714 sec
897 sec
529 sec
Speed-up vs
the first one
1.00
1.90
3.25
Number of
Nodes
128
256
512
Number of MPI
processes
256
512
1024
Table 6: Strong scaling of the DGTD solver with P4 interpolation (on each node, we spawn two MPI
process and eight OpenMP threads per MPI process).
The obtained results in terms of parallel performances are perfectly in line with our
expectations and clearly demonstrate the potential of the high order discontinuous finite
element solver that we are studying for exascale class simulations. In particular, these results
open the route for possible future work toward the numerical treatment of more complex
physical models relevant to nanophotonics which will require the use of higher resolution
discretization meshes involving much more structures than what has been considered in the
application selected for this project.
The project also published a white paper, which can be found online under [4].
PRACE-4IP – EINFRA-653838
12
15.04.2016
D7.1
Periodic Report on Applications Enabling
2.8.2 Large scale parallelized 3d mesoscopic simulations of the mechanical
response to shear in disordered media, 2010PA2457
Code general features
Name
ELASTO
Scientific field
Modeling the athermal shear response of 3d dense disordered
material
Short code description
This code evolves the equations for a 3d lattice model for disordered
systems under shear, which contain a stochastic local yielding part
and a deterministic one for the long range elastic response,
containing a convolution that is resolved in Fourier space. The
parallelisation of the 3d FFT is divided in three steps. For clarity the
x and y directions are considered parallelized and the z one the nonparallelized. First of all data are reorganized such that x is
transposed with the z direction. Then each processor performs along
the x direction multiple uni-dimensional FFT (multi-1DFFT). In the
second step the y direction is transposed with the x one. Then multi1DFFT along the non-parallelized direction are performed. Finally, z
is transposed with the y and the algorithm is repeated.
Programming language
FORTRAN 90 / MPI
Supported compilers
intel/14.0.3.174, mkl/14.0.3.174 and bullxmpi/1.2.8.2
Parallel implementation MPI (bullxmpi/1.2.8.2)
Accelerator support
N/A
Libraries
HDF5, MKL, FFTW3
Building procedure
Makefile (see attachment)
Web site
N/A
Licence
N/A
Topic 1
Main objectives
The aim of this project was to resolve serious scaling problems of the ELASTO code
experienced on the PI's local cluster, named Froggy. Initial runs were performed on Froggy
with problem sizes 643, 2563 and 5123. In all these runs, code exhibited a slowing down with
more than 16 cores (1 node).
Accomplished work
We can state that we could enhance the portability of our code from the Curie cluster to the
local Ciment cluster "Froggy". Switching to Intel compilers and Intel MPI, using the compiler
flags that were used on Curie, recompiling the fftw3 library and applying the minor code
changes to our local cluster Froggy, the performance and scaling of code is close to the
performance/scaling on Curie Thin Nodes. Note that for the large problem size of 10243 on
Curie the codes scales almost linearly up to 4096 cores.
PRACE-4IP – EINFRA-653838
13
15.04.2016
D7.1
Periodic Report on Applications Enabling
Scaling results
For the small problem size of 643 slow down appears from 8 to 16 cores in a single node. The
code was compiled and ran on Curie without code changes, using compilers and flags
suggested in PRACE Best Practice Guides. BullX MPI with Intel compilers was used. The
available on Curie FFTW3 library was used to provide FFTW3 functions. The compiler
optimization flags selected:
-O3 -xAVX -unroll -unroll-aggressive -ip
Figure 7: The inverse average iteration time of initial ELASTO code, using up to 32 cores, as function of
number of cores on Froggy and Curie, for system sizes 643, 2563, 5123 and 10243
Figure 7 shows the inverse average iteration time of initial ELASTO code, using up to 32
cores, as function of number of cores on Froggy and Curie, for system sizes 643, 2563 and
5123 on the left site. Depicted are the initial performance of code on Froggy (●) and Curie
(), together with the performance on Froggy after applying minor code, compiler and flags
changes (). The right site of the Figure shows the inverse average iteration time of ELASTO
code as function of number of cores on Curie, for system sizes 643, 2563, 5123 and 10243.
Depicted are the performance before () and after () applying the minor code changes.
The main hardware difference of two machines is the network interface. Curie uses QDR
while Froggy uses FDR. Their CPUs are similar: E5-2680 on Curie, E5-2670 on Froggy.
Surprisingly, on Curie with no code changes the scaling is much better when using up to two
nodes. The inverse average iteration time of the original code as a function of the number of
cores, up to 32 cores, is presented on the left side of Figure 7. Since code appears to exhibit
better scaling on Curie a number of runs were performed on Curie with more than 32 cores.
These results are presented on the right site of Figure 7.
It seems that there is a large discrepancy of the performance and scaling of the same runs on
the two machines. On Curie, the code scaling looks like a typical case. On the other hand, on
Froggy, going from one to two nodes leads to slow down. In addition, for single node runs,
i.e. up to 16 cores, the performance on Froggy is much lower than on Curie. These two
findings suggest that there is some problem on PI's cluster related with network, batch system
environment, MPI implementation etc. Inspecting software in use on Froggy, we found that:
PRACE-4IP – EINFRA-653838
14
15.04.2016
D7.1
Periodic Report on Applications Enabling
1. MPI implementation is openmpi-1.6.4 with openib support, compiled with and using
with GNU-4.6.2 compilers
2. fftw-3.3 compiled with GNU-4.6.2
3. Intel MPI and compiler version 13.0.1 is available on Froggy.
Inspecting the code, a number of non-crucial changes was applied. Initially all timings were
performed using the Fortran cpu_time() function. All occurrences of cpu_time() calls
replaced with MPI_Wtime() calls to get an accurate measure of the elapsed time, since
cpu_time() measures the CPU only time of a process and is not accurate in case of load
imbalancing.
Profiling
Profiling of the code was performed using Scalasca and mpiP. In addition, some run time
variables were examined inserting pieces of code at certain points. From profiling runs, the
percentage of run time spent in MPI calls as well as the MPI messages size were collected.
These measurements are presented in Figure 8.
Figure 8: The percentage of time spent in MPI calls (left) and the average MPI message size during run,
as function of number of cores on Curie, for system size 643, 2563 and 5123.
The main conclusions from the profiling of code on Curie are summarized below.
1. The code uses only few MPI functions. Few MPI_Allreduce, MPI_BCast and
mainly point to point send/recv calls. When the code periodically saves the trajectory,
for example every 1000 iterations, it uses the HDF5 library that is also using MPI
calls. These HDF5 originated calls were not profiled here.
2. During a multistep run, the first iteration takes more time to complete that the rest of
iterations. This fact should be taken into account, especially when one runs a small
number of steps.
3. During execution, on some processes and for small numbers of iterations, the FFTW
plane size is not identical on all processes although the deviation is not large (± 1-3).
This introduces a small load imbalance.
4. The scalability of code depends on the problem size. As shown in Figure 7 for a
problem size of 10243 the performance is almost linear up to 4096 cores. For smaller
problem sizes, the speed up starts to decrease after a number of cores.
5. The percentage of time spent in MPI calls during run was measured indicating that the
communication time increases with increasing the number of cores.
6. The average MPI message decreases increasing the number of cores. The network
performance (Bandwidth/Latency) depends on message size.
The project also published a white paper, which can be found online under [5].
PRACE-4IP – EINFRA-653838
15
15.04.2016
D7.1
Periodic Report on Applications Enabling
2.8.3 PICCANTE: an open source particle-in-cell code for advanced simulations on
tier-0 systems, 2010PA2458
Code general features
Name
piccante
Scientific field
Plasma Physics
Short code description
piccante is an open source, massively parallel, fully-relativistic
Particle-In-Cell (PIC) code.
PIC codes are widely used in plasma physics and astrophysics to
study problems where kinetic effects are relevant.
A PIC code integrates in time Maxwell-Vlasov equations.
Electromagnetic field equations are solved on a grid. The plasma
distribution function is sampled with numerical particles.
piccante is primarily developed to study laser-plasma interaction.
The code allows to design a simulation with great flexibility (i.e.
multiple laser pulses, arbitrary target geometry, mutiple particle
species...).
Programming language
C++
Supported compilers
successfully compiled on GNU compiler (g++ v 4.8.2) and IBM
compiler (Blue Gene, XL v12.1)
Parallel implementation yes, using mainly MPI (openMP is used in some code sections)
Accelerator support
no
Libraries
STL, GSL, BOOST (optional), jsoncpp (optional)
Building procedure
Makefile. Script for aided compilation on FERMI and JUQUEEN
Web site
http://aladyn.github.io/piccante/
Licence
GNU GPLv3
Topic 1
Main objectives
Code scaling and profiling tests
Accomplished work:
Strong scaling and weak scaling tests were performed on FERMI and JUQUEEN, using a
typical simulation case. Output routines were disabled for these tests.
Scaling was tested on up to 16384 computing cores.
Scalasca and Vampir were used for detailed profiling of the code.
Main results:
A detailed profiling of the code allowed identifying the most time-consuming routines of the
code as a function of the task number.
PRACE-4IP – EINFRA-653838
16
15.04.2016
D7.1
Periodic Report on Applications Enabling
The Scalasca analysis at the beginning of the project showed that the core routines of the code
(everything but the output routines) scaled well on up to 16384 MPI tasks (the largest tested
configuration).
#cores
Strong Scaling
(sec/step)
Strong Scaling
(% ideal scal.)
Weak Scaling
(sec/step)
Weak Scaling
(% ideal scal.)
512
--
--
3.785
100%
1024
7.538
100%
3.777
100%
2048
3.777
99.8%
3.777
100%
4096
1.885
100%
3.769
100%
8192
0.946
99.6%
3.769
100%
16384
0.485
97.2%
3.769
100%
Table 7: Scalasca analysis of the piccante core routines.
However, the profiling highlighted a very bad behaviour of the output routines. While
showing a decent performance on a small number of MPI tasks (1024), they were requiring a
major fraction of the computing time on larger numbers of MPI tasks. This guided us to spend
most of our efforts on this problem (which was successfully addressed, see Topic 2 for
details).
At the end of the optimization work, piccante with all its function enabled, including very
demanding output routines, proved to scale very well up to the maximum tested number of
MPI_tasks (16384). See Figure 9 below showing the top ten most time consuming routines
for strong and weak scaling tests, before and after the optimization work.
Figure 9: Most time consuming routines for strong (top) and weak (bottom) scaling tests, before (left) and
after (right) the optimization work.
PRACE-4IP – EINFRA-653838
17
15.04.2016
D7.1
Periodic Report on Applications Enabling
Topic 2
Main objectives:
Output performance
Accomplished work:
When using more than 2048 MPI Tasks the usual MPI-IO routines (MPI_File_write)
dramatically slow down the output process up to an unacceptable level on the BlueGene
architecture. Two main routines were changed: grid-based output (electromagnetic fields,
charge density and current density, fields in the followings) and particles coordinates (particle
output in the following). Several new parallel output strategies were developed and tested:
1. Collective call to mpi_file_write_all: allowed for a good improvement: up to
5-10 times for grid-based and 5-10 times for particles (problem dependent). The
scaling remained rather bad: linear increase of the output time with the number of
MPI_tasks for a strong scenario.
2. Several files, one for each MPI_task: decent speedup for small number of MPI-tasks
but extremely large number of output files (very difficult handling).
3. One file written by a limited (GroupSize = 32-128) number of “sub-master”
MPI tasks (one every sub-group, with the other tasks sending their buffer to the “submaster”): improvement for fields of a factor up to 25 and up to 10 for particles (mainly
limited by the size of the file). GroupSize was varied to find the best values for the
architecture.
The scaling with the number of MPI_Tasks was not ideal: sublinear in a strong scaling
test.
4. Several files (2-16) for each output-item, one every 1024 (or 2048) MPI_tasks. The
speedup at small total number of MPI_tasks remained good (10-20 for fields, about 10
for particles). The scaling was very good: nearly constant output time for a weak
scaling test and decreasing for a strong scaling test.
Solution 4) was chosen as the new output strategy (see Figure 10).
Figure 10: Old vs. new output strategy overview.
PRACE-4IP – EINFRA-653838
18
15.04.2016
D7.1
Periodic Report on Applications Enabling
Main results:
We targeted at least 8192 and we tested up to 16384 cores.
The output time for a 2048 MPI-tasks on 2048 cores was improved by a factor of: 15 for
particles and 200 for fields.
The output time for 4096 MPI-tasks on 4096 cores was improved by a factor of: 34 for
particles and 180 for fields.
The output time for 8192 MPI-tasks on 8192 cores was improved by a factor of: 40 for
particles and 600 for fields.
The amount of rewriting of the output routines was significant (~ 50%).
Figure 11: Comparison of the old and the new output strategy for a strong scaling test.
#MPI-tasks
Particles files
size (GB)
1024
2048
4096
8192
16384
24
48
96
192
384
Fields files size
particles GB/sec fields GB/sec
(GB)
0.42
1.45
0.60
0.83
1.78
1.11
1.66
3.48
1.81
3.33
4.13
2.41
6.66
8.12
4.66
Table 8: Maximum writing speed by using the new output strategy.
The new output allowed reaching a satisfactory maximum writing speed (total aggregated
writing speed on 16384 MPI-Tasks): more than 8 GB/s for the particles files and 4.66 GB/s
for the much smaller field files.
Topic 3
Main objectives:
Improved output strategies
Accomplished work:
The output routines were extensively rewritten.
The “output manager” now allows to define a greater variety of output requests. In particular,
it is now possible to produce reduced-dimension outputs (e.g. a plane in a 3D simulation or a
PRACE-4IP – EINFRA-653838
19
15.04.2016
D7.1
Periodic Report on Applications Enabling
1D subset in a multidimensional run) and subdomain outputs (e.g. a small box in a large 3D
simulation).
The list of output requests can also be controlled in time: a given output can be active within a
time interval with a given frequency, independently from the other requests.
Main results:
Several new output functions are now available. These functions allow a more flexible output
strategy (both in time and space). Much smaller total output size can be achieved performing
the output of only the relevant regions of the simulation when they are of interest (e.g. output
frequency can be increased to better resolve crucial processes). This allows to increase the
accuracy of the output data (time and spatial resolution) limiting the required disk space.
Topic 4
Main objectives:
Add input-file support
Accomplished work:
piccante is designed as a library. Thus, the user was originally required to write or modify the
“main-1.cpp” file and re-compile the code for each simulation.
New functions were introduced to initialise a simulation reading a JSON input file. A new
main file was designed to allow typical a simulation setup without any code editing.
Support for JSON parsing is provided using the library “jsoncpp”, which is licensed as
“Public Domain” (or with MIT license in some jurisdictions): https://github.com/open-sourceparsers/jsoncpp.
Main results:
Support for simulation setup via input-file was successfully added.
All the typical simulation parameters can now be controlled with a user-friendly and easy to
read JSON input-file.
Topic 5
Main objectives:
Memory allocation strategies
Accomplished work:
Particle coordinates are stored in large arrays (7 double variables per particle are stored:
x,y,z,px,py,pz,w).
We tested two main allocation
[x1,x2,...,xN,y1,y2,...,yN,...].
strategies:
[x1,y1,...,w1,x2,y2,...,w2,x3...]
and
On an Intel linux cluster one strategy proved to be slightly better, while on BlueGene/Q the
differences were minimal.
Main results:
We tested a few memory allocation strategies for particles. The best allocation strategy is
enabled by default in the code. This allows a slight performance gain on some architectures.
PRACE-4IP – EINFRA-653838
20
15.04.2016
D7.1
Periodic Report on Applications Enabling
Topic 6
Main objectives:
Hybridization (MPI+OpenMP) of the code
Accomplished work:
We managed to introduce MPI+OpenMP hybridization for the electromagnetic solver and this
provided a slight performance enhancement.
We tested a similar improvement for the particle solver, but our preliminary results were
unsatisfactory.
We suspended the development of this feature to concentrate our efforts on more urgent
topics (e.g. output optimization).
Main results:
Only a limited hybridization was performed, which provides a slight performance gain in
some code sections. This feature is temporarily disabled in the code.
Topic 7
Main objectives:
Code Refactoring
Accomplished work:
Together with the development of new functions, a consistent code refactoring was pursued to
ease maintenance.
Main results:
Long and complicated functions were split in smaller and simpler ones. Unused or obsolete
functions were deleted from the source files.
The project also published a white paper, which can be found online under [6].
2.8.4 OpenFOAM capability for industrial large scale computation of the multiphase
flow of future automotive component: step 2., 2010PA2431
Due to an unexpectedly long administrative procedure for one of the project members to be
accepted as user on the computer platforms, the PA2431 only ran for two weeks. Because of
this problem, no sufficient results could be created for a publication in a final report.
Within the remaining time, the PI chose an industrial test case and updated it with the support
and the advice of the PRACE expert. This test case needed an older version of OpenFOAM
than the one installed on Curie. Two old versions were compiled by the PRACE expert before
finding the right one. The test case was run successfully at the end.
Critical points were identified to progress on the parallelization for OpenFOAM on realistic
geometries, complex phase change - but the time ran out at this point. The applicant will
continue the work on a next project, to be submitted again, and will plan with slightly more
secured delays and resources from his side.
PRACE-4IP – EINFRA-653838
21
15.04.2016
D7.1
Periodic Report on Applications Enabling
2.9 Cut-off September 2014
2.9.1 Parallel subdomain coupling for non-matching mesh problems in ALYA,
2010PA2486
Code general features
Name
Alya
Scientific field
Multi-physics problems, fluid flow, structural dynamics, thermal
flow
Short code description
Alya is a multi-physics code developed at the Barcelona
Supercomputing Center. It is based on a finite element formulation
and is structured using a modular architecture, organised in kernel,
modules and services. The kernel contains the facilities required to
solve any set of discretized partial differential equations, while the
modules provide the physical description of a given problem.
Programming language
Fortran
Supported compilers
ifort, gfortran
Parallel implementation mpi+openmp+ompss
Accelerator support
No
Libraries
metis, hdf5
Building procedure
Makefile
Web site
http://bsccase02.bsc.es/alya/overview/
Licence
Open source
Topic 1
Main objectives:
This work aimed to implement Domain Composition Methods at the algebraic level to couple
subdomains with non-matching meshes in a distributed memory supercomputer environment
for multi-physics problems. The coupling is performed at the algebraic level, thus, it is almost
independent of the problem considered. This approach enables us to solve multi-domain and
multi-physics problems, using both single and multi-code approaches.
Accomplished work:
We have implemented a strategy to couple subdomains with non-matching meshes for
distributed memory supercomputers. The method can be explicit (multi code) or implicit
(single code). The latter one was implemented using an MPI communicator splitting in order
to set inter- and intra-subdomain communicator. This enabled us not to affect the original
parallelization of the code. In addition, the methodology is currently being tested for multiphysics simulations, like fluid-structure interactions; fluid-particle coupling; contact
problems; thermal flows coupled with conjugate heat transfer.
Main results:
The proposed methodology was tested in explicit (multi-code) and implicit (single code)
approaches. The multi-code coupling doesn’t affect the scalability of each code, because the
extra communications have more or less of the same cost as that of one matrix-vector product
PRACE-4IP – EINFRA-653838
22
15.04.2016
D7.1
Periodic Report on Applications Enabling
communication of the normal parallelization of the code, and it is performed just once each
coupling iteration or time step.
In the implicit approach, the extra communications is performed after each matrix-vector
product, and the impact can be significant. In Figure 12, two subdomains to be coupled are
shown, where the lines represent communicating parallel partitions of each subdomain.
Figure 12: Two subdomains coupling and parallel partition. The lines show the connection between
parallel partitions of the different subdomains.
Figure 13 shows the relative cost of using the subdomain coupling with a fixed number of two
hundred iterations of the GMRES solver with a Krylov space of size ten compared to the
same case and configuration without subdomain coupling. The results show that the cost is
feasible and it is more than compensated for the ability of address different physical problems
in each subdomain. In addition, the same result can be expressed in terms of the speed up, as
shown in Figure 14.
Figure 13: Relative cost of using the subdomain coupling with a fixed number of two hundred iterations of
the GMRES solver with a Krylov space of size ten.
PRACE-4IP – EINFRA-653838
23
15.04.2016
D7.1
Periodic Report on Applications Enabling
Figure 14: Speed up using the subdomain coupling with a fixed number of two hundred iterations of the
GMRES solver with a Krylov space of size ten, and the same case and configuration in one subdomain.
Test cases are being executed for multi-physics problems as Fluid-Structure Interaction
problems, particle transport, contact problems and thermal flows, as shown in Figure 15 for
single code case and Figure 16 for the multi-code case.
Figure 15: Implicit coupling applied to the Navier-Stokes equations. (Left) Meshes (Right) Velocity and
pressure.
PRACE-4IP – EINFRA-653838
24
15.04.2016
D7.1
Periodic Report on Applications Enabling
Figure 16: FSI benchmark. (Left) Geometry. (Right) Results.
The project also published a white paper, which can be found online under [7].
2.10
Cut-off December 2014
2.10.1 Numerical simulation of complex turbulent flows with Discontinuous Galerkin
method, 2010PA2737
Code general features
Name
DG-comp
Scientific field
Engineering and Energy
Short code description The numerical code DG-comp solves the Navier-Stokes equations for
unsteady compressible turbulent flows. DG-comp is based on
FEMilaro, an open-source finite element library. The equations are
discretized in space using a Local Discontinuous Galerkin (LDG)
method on tetrahedral elements. The equations are projected in a
space of polynomial functions defined in each element of the
computational grid. For the numerical fluxes the classical BassiRebay definition is adopted. For the time integration the fourth-order,
the SSPRK scheme is used. Other time integration schemes are
available: classical explicit Runge-Kutta scheme up to fourth-order,
and matrix-free exponential time integrators. A variety of sgs models
for LES [8] and hybrid RANS/LES models [9] are implemented in
the code.
Programming
Fortran
language
Supported compilers
On HORNET: Cray Fortran 2.4.0
On FERMI: bgq-xl 1.0
On MARENOSTRUM: Intel 13.1, Gnu fortran 6.0
Parallel
implementation
MPI
Accelerator support
Libraries
Fortran 90-2008 language, parallel HDF5 I/O library
Building procedure
Makefile with possibility of parallel compiling
Web site
http://code.google.com/p/femilaro/
Licence
GPL3
PRACE-4IP – EINFRA-653838
25
15.04.2016
D7.1
Periodic Report on Applications Enabling
Topic 1
Main objectives
Scalability tests on several Tier-0 platforms.
Accomplished work
Strong scalability tests have been performed on the Tier-0 platforms FERMI, HORNET and
MareNostrum.
Main results
The strong scalability analysis has been performed for the turbulent channel flow simulation
with a sub grid anisotropic model for LES [8] using 6912 tetrahedral elements and a fourth
order degree polynomial. The wallclock times have been evaluated for the time advancing
computation, neglecting time for initialization procedures, input and output. Scaling analysis
has been carried out with a range of processor cores varying from 1024 to 16384 on FERMI,
with 24 up to 1536 cores on HORNET and 16 to 2048 cores on MareNostrum. For each
platform, the speedup is normalized by the speedup obtained with the lowest number of cores.
In the following figures, the scaling obtained with the numerical code on FERMI and on
HORNET are compared with the linear optimum trend while in the tables Table 9 - Table 11
the details of the scaling tests on the three platforms are reported.
Figure 17: Speedup on FERMI. The values are normalized by the speedup with 1024 cores.
PRACE-4IP – EINFRA-653838
26
15.04.2016
D7.1
Periodic Report on Applications Enabling
Figure 18: Speedup on HORNET. The values are normalized by the speedup with 24 cores.
Number of
cores
1024
2048
4096
8192
16384
Wall clock
time [s]
8.1383
4.1183
2.0793
1.0675
0.5776
Speed-up vs
the first one
1
1.9761
3.9139
7.6237
14.0901
Number of
Nodes
64
128
256
512
1024
Number of
processes
1024
2048
4096
8192
16384
Table 9: Scaling performances on FERMI
Number of
cores
24
48
96
192
384
768
1536
Wall clock
time [s]
40.2157
20.7937
10.2777
5.2527
2.7128
1.3760
0.6596
Speed-up vs
the first one
1
1.9340
3.9129
7.6562
14.8243
29.2265
60.9711
Number of
Nodes
1
2
4
8
16
32
64
Number of
processes
24
48
96
192
384
768
1536
Table 10: Scaling performances on HORNET
Number of
cores
16
32
64
128
256
512
Wall clock
time [s]
52.5055
26.4568
13.3011
6.7066
3.3634
1.7149
PRACE-4IP – EINFRA-653838
Speed-up vs
the first one
1
1.9846
3.9475
7.8289
15.6109
30.6179
27
Number of
Nodes
1
2
4
8
16
32
Number of
processes
16
32
64
128
256
512
15.04.2016
D7.1
Periodic Report on Applications Enabling
Number of
cores
1024
2048
Wall clock
time [s]
0.8907
0.4802
Speed-up vs
the first one
58.9492
109.3471
Number of
Nodes
64
128
Number of
processes
1024
2048
Table 11: Scaling performances on MareNostrum
The results confirm the very good scalability properties of the numerical code. Actually, with
a LDG method, most of the computations are local to each element. This means that, albeit
the computational cost for DG methods is typically higher compared to other formulations,
nevertheless DG methods lend themselves very well to parallel execution.
Topic 2
Main objectives
Optimization of the I-O strategies.
Accomplished work
Possible option to use HDF5 library to manage efficiently I/O.
Main results
During the simulations, the complete status of the computed solution is saved at several
desired intermediate times. At each of these times, each process writes an output file for the
associated domain partition; the file size depends on the dimension of the partition, on the
grid resolution and on the amount of optional diagnostics quantities required. Every output
file is compatible with Octave, which is used to perform post processing and data analysis (on
a serial computer).
During the present project, the I/O strategy has been improved and the use of the HDF5
library has been implemented.
Without HDF5
10.81
10.82
10.81
HDF5 collective
11.77
11.49
5.57
5.78
5.17
4.8
14.8
13.3
5.9
8.22
8.01
9.95
Table 12: I/O writing time in sec with and without HDF5.
In the Table 12 the I/O writing times (sec) with and without HDF5 are reported. The test has
been carried out on Hornet with -O2 compiling option and 192 cores. In column 1 are the
times for three different writing events in a simulation without HDF5. In columns 2-5 the
times for three writing events in four different simulations with same settings but with
collective HDF5. We can observe some data scattering but in average with HDF5 only 80%
of the time without HDF5 is consumed. Furthermore, 63% of disk space is saved at each
writing event (80 MB instead of 217 MB).
Topic 3
Main objectives
Code profiling in order to identify points where the optimization effort should be directed.
Accomplished work
Code profiling using SCALASCA tools on FERMI has been carried out.
PRACE-4IP – EINFRA-653838
28
15.04.2016
D7.1
Periodic Report on Applications Enabling
Main results
A profiling analysis of the numerical code has been performed on FERMI using SCALASCA
tools. The code was compiled with xlf -O2. The analysis is limited to the main computation
cycle.
The SCALASCA analysis demonstrates that the computational effort is well distributed
between nodes. In Figure 19 a synthesis of the profiling analysis is represented.
Figure 19: Percentage of the time consumed in the main steps of the computations.
Almost the total time is consumed in the computation of the right hand side terms of the
equations. Inside this, the integration over the element volume and the turbulence model are
the more time requiring procedures. Very little time is spent for diagnostic computations.
Almost the 18% of the time is spent in communications.
The computational time appears evenly distributed among the different code portions and no
evident concentration of effort has been discovered.
The profiling confirms that all the point-to point communications happen in the computation
of the hyperbolic numerical fluxes and of the viscous ones, exchanged between elements
through element surfaces, while the diagnostics evaluation involves collective MPI
operations, such as MPI_ALLREDUCE.
The numerical method and LES models require a massive use of matrix-vector and matrixmatrix multiplications of relatively small (order 10x10 to 100x100) full matrices. These
products are currently computed using the FORTRAN intrinsic functions MATMUL and
DOT_PRODUCT. During the project, a test substituting these functions with the BLAS library
ones has been conducted: no relevant performances improvement has been noticed, maybe
due to the relatively low dimension of the matrices involved.
Topic 4
Main objectives
Improvement of the hybrid RANS/LES model.
PRACE-4IP – EINFRA-653838
29
15.04.2016
D7.1
Periodic Report on Applications Enabling
Accomplished work
To improve the turbulence modelling, and in particular the hybrid RANS/LES model, an
analysis of the role of the blending factor has been conducted.
Main results
From a turbulence modelling point of view the project has been used to improve the
turbulence modelling, and in particular the hybrid RANS/LES model [9]. In order to perform
the analysis of geometrically complex flows, it is important to optimize the ratio between
RANS and LES computation. For this reason, the blending factor k used to combine LES and
RANS has been analysed. Several simulations with different values of k have been performed
to understand the rule of k and how the results change when increasing or decreasing the
RANS contribution. The test case chosen is the turbulent channel flow at Reτ =180.
Figure 20: Shear stress (left) and turbulent kinetic energy (right) profiles for different value of k are
shown. The dashed line represent the modelled quantities, while the continuous lines the resolved ones.
In Figure 20 the mean shear stress and the turbulent kinetic energy profiles for different value
of k are shown, the dashed lines represent the modelled quantities and the continuous lines the
resolved ones. Numerical results show that the blending factor directly affects the amount of
resolved and modelled quantities. Moreover, as shown by the comparison with DNS data, it is
possible to obtain good results also for low value of k, i.e. increasing the RANS contribution
and therefore reducing the number of turbulent scales resolved, and thus reducing the
computational effort.
The project also published a white paper, which can be found online under [10].
2.11
Cut-off March 2015
2.11.1 Large Eddy Simulation of unsteady gravity currents and implications for
mixing, 2010PA2821
Code general features
Name
Les Coast
Scientific field
Engeneering, Physics
Short code description
The model solves the 3D LES-filtered Navier-Stokes equations in
Boussinesq, rigid-lid approximation. It makes use of generalised
coordinates and an immersed-boundary method to implement the presence
of obstacles and complex geometries. It can be used to simulate hydraulic
laboratory conditions or realistic coastal-scale applications with
PRACE-4IP – EINFRA-653838
30
15.04.2016
D7.1
Periodic Report on Applications Enabling
Code general features
environmental forcing. The Navier-Stokes solver adopts the Kim and Moin
scheme with generalised coordinates. The convective terms are solved
either with a central scheme or the QUICK algorithm. The pressure solver
uses SOR or SOR+multigrid. The time scheme can be explicit or semiimplicit (fractional step method). The Large Eddy Simulation subgrid
model can be a static-Smagorinsky, dynamic Smagorinsky (Germano
1992) or a Lagrangian dynamic subgrid-scale model (Meneveau 1996).
Programming
language
Fortran 90 & Fortran 2003
Supported compilers
INTEL 16.0.0 (tested), Gfortran 4.9 and others that support Fortran
2003 & assumed-rank arrays
Parallel
implementation
MPI
Accelerator support
No
Libraries
MPI-3.0 compatible library, Gabriel 1.1
Building procedure
Makefile
Web site
Licence
Topic 1
Main objectives
Refactoring of the code in order to increase the efficiency in the MPI communications.
Accomplished work
All codes have been upgraded to standard F90.
All MPI communications have been improved by using MPI 3.0 derived types and neighbour
collectives. The MPI communication in the original code was effectively serialized by a series
of MPI_Gather operations and MPI_Ssend for nearest neighbour exchanges. This had to
be changed throughout the code:
•
•
replace of MPI_Ssend+MPI_Recv with MPI_Sendrecv or
MPI_Neighbor_alltoallw
replace a loop of MPI_Gather+MPI_Barrier with one MPI_Alltoall
Matrices definitions and memory allocations have been organized in modules. MPI variables
and functions have been organized in separated modules.
Main results
The first benchmark is a gravity current in a long and narrow channel. The current is
generated by an ideal lock-exchange technique. At the beginning of the simulation, the dense
fluid is positioned in a small volume on the left side of the tank. The light fluid occupies most
of the volume on the right side of the tank and the upper part of the left side. At the beginning
of the simulation, the dense fluid collapses and moves beneath the light fluid while the light
fluid moves in the opposite direction. The grid is uniform on the xz plane, non-uniform in the
vertical direction (y). Periodic boundary conditions are imposed in the transverse direction
PRACE-4IP – EINFRA-653838
31
15.04.2016
D7.1
Periodic Report on Applications Enabling
(x). Due to the geometry of the problem, the test is ideally suited for the 1-dimensional
decomposition in the original code along the length of the channel.
The original code runs on 16, 32 and 64 cores. The multi-grid solver limits the number of
processes to a maximum of 64 for this particular benchmark.
7
Original
6
Optimized
Speedup
5
4
3
2
1
0
0
10
20
30
40
50
60
70
Cores
Figure 21: Speedup vs. cores test1: gravity current in a channel.
Figure 21 shows that the code can now scale to the maximum number of cores (due to the
limitations of the 1-D decomposition and the multi-grid solver). The optimized code is the
starting point of topic 2, to change the decomposition for scaling beyond 64 cores.
Figure 22: Results for the channel test: propagation of the gravity current (density field)
Topic 2
Main objectives
Increase the level of parallelization of the code.
Accomplished work
The code has been parallelized to a 2D decomposition, the multigrid solvers sor/slor have
been changed introducing red/black schemes in the Cartesian and curvilinear coordinates.
The MPI communications for the 2D parallelization have been organized in a separate library
(Gabriel 1.1). The implicit three-diagonal solvers have been optimized.
PRACE-4IP – EINFRA-653838
32
15.04.2016
D7.1
Periodic Report on Applications Enabling
Main results
Another benchmark in a square basin was selected to address the decomposition of the code.
In this test, a gravity current is generated in a tank where a rigid wall separates two volumes
of fluid at different density. The lock-exchange is realized by means of an opening in the
central part of the separating wall. The aperture has a length, which is 1/8 of the length of the
wall.
The boundary conditions on velocity at the separating wall are realized by using an immersed
boundary technique. The grid is not uniform, quasi-Cartesian. No periodic boundary condition
is imposed. In this problem, there is not a preferential dimension along which to decompose
the grid 1-D, and the use of a dynamic algorithm for the estimate of the sub grid stresses
makes the requirements on the memory allocation a severe test.
The scalability of the optimized code for this benchmark is limited to 32 cores by the
multigrid solver and the 1-D decomposition.
The change of the decomposition in the multigrid solver has pushed the limit to 256 cores for
this benchmark. Unfortunately, the results are incorrect, which is still under investigation.
However, it is expected that the code can be fixed without a significant impact on the
scalability.
Figure 23: Benchmark for the 3D gravity current simulation.
12
2decomp
10
Speedup
8
6
4
2
0
0
100
200
300
Cores
400
500
600
Figure 24: Speedup vs. cores for the 3D gravity current test.
PRACE-4IP – EINFRA-653838
33
15.04.2016
D7.1
Periodic Report on Applications Enabling
Production runs are planned to use a higher resolution with an upper limit of 2048 cores. This
brings the code within reach of PRACE Tier-0 machines. It is expected that the code will be
ready before the next Tier-0 call.
Project results will be published in a separate white paper in the context of Work Package 7.
PRACE-4IP – EINFRA-653838
34
15.04.2016
D7.1
Periodic Report on Applications Enabling
3 T7.1.B SHAPE
In this section, the progress in task 7.1B, SHAPE will be discussed. Summary reports on the
work in the second SHAPE call projects are provided. Also, the recently closed third call is
reported on, and finally the future of SHAPE is discussed.
3.1 SHAPE Second call: Applications and Review Process
The review panel was composed as follows:
•
•
•
Two representatives from the SHAPE programme organisers;
Two representatives from the PRACE Board of Directors;
Two representatives from the Industry Advisory Committee.
The two main criteria considered for the review of the applications were:
•
•
Strength of the business case - The expertise and resources provided via SHAPE are
expected to produce a significant Return on Investment for the company. In the midterm, the SME should be able to build on the results to, for instance, increase its
market share, renew its investment or recruit dedicated staff. It would also be expected
that the business plan for the project would lead the SME to further engage in HPC in
the long term.
Technical Adequacy - The applications are expected to fit the timeframe and resources
available in the project. The project activity must only lean on expertise already
available within PRACE partners.
Other aspects that were considered:
•
•
•
The commitment of the SMEs to co-invest with PRACE in achieving the project
goals. The effort should at least be equally split between the company and PRACE;
The innovative aspect of the proposed solution;
The social and economic impact on society as a whole.
The applications were reviewed and ranked according to these criteria, and then the
recommendations put forward to the board. The board approved 11 projects to go ahead from
this call. A twelfth project was deemed not suitable for SHAPE, but was instead encouraged
to engage with PRACE via the preparatory access calls.
Following the Board’s approval of the recommendations, the successful SMEs were matched
with PRACE partners, as shown in Table 13 below:
Company
Country
Project Title
Partner
ALGO'TECH
INFORMATIQUE
France
High performance to simulate
electromagnetic disruption effects INRIA
in embedded wiring
CybeleTech
France
Numerical simulations for plant
CINES
breeding optimization
Design Methods
Italy
Coupled sail and appendage
design method for multihull based CINECA
on numerical optimization
PRACE-4IP – EINFRA-653838
35
15.04.2016
D7.1
Periodic Report on Applications Enabling
Company
Country
Project Title
Partner
Ergolines s.r.l.
Italy
HPC-based Design of a Novel
Electromagnetic Stirrer for Steel ITU
Segment Casting
Hydros Innovation
Switzerland
Automatic Optimal Hull Design
by Means of VPP Applications on CINECA
HPC Platforms
Ingenieurbüro Tobias Loose Germany
HPCWelding
Open Ocean
France
High Performance
Chain - HPPC
Principia
France
HPC for Hydrodynamics database
PSNC/CINES
creation
RAPHI
Italy
Optimad
CINECA/INRIA
VORTEX BLADELESS SL Spain
VORTEX
BSC
WB-Sails Ltd Oy
Simulation of sails and sailboat
CSC
performance
Finland
HLRS
Processing
IDRIS
Table 13: SHAPE applications to the second call
With regards to Principia, PSNC were originally approached to partner them, but the SME
wished to work with a local centre and CINES were able to fulfil this.
3.2 SHAPE Second call: Status
The approved projects were informed of their success in April 2015 and encouraged to start
soon after. With a few exceptions, the projects were underway by the end of May. As such,
the projects are somewhat unsynchronised – this is not entirely unexpected given the very
diverse range of projects being undertaken, but there are other factors involved that should be
taken into consideration for future calls and are discussed in more detail below.
Each collaboration in SHAPE is expected to produce a white paper for publication on the
PRACE website at the conclusion of the technical work. In addition, every project was
required to provide a brief summary of their work for this deliverable (see section 3.3 below).
The status (as of March 2016) is as follows:
•
•
•
•
•
•
•
Ergolines s.r.l. (ITU) – project complete, white paper ready for review April 6th 2016;
Cybeletech (CINES) – project completed, white paper ready for review April 6th
2016;
OPTIMAD Engineering (CINECA) – project completed, white paper ready for review
April 6th 2016;
Open Ocean (IDRIS) – project completed, white paper ready for review April 6th
2016;
Hydros Innovation (CINECA) - project completed, white paper ready for review April
6th 2016;
Vortex Bladeless SL (BSC) – project approaching completion, white paper ready for
review April 6th 2016;
Design Methods (CINECA) - project approaching completion, white paper ready for
review April 6th 2016;
PRACE-4IP – EINFRA-653838
36
15.04.2016
D7.1
•
•
•
•
Periodic Report on Applications Enabling
Ingenieurbüro Tobias Loose (HLRS) – there have been various technical challenges
but HLRS is working closely with third party software developers to overcome these,
and the technical work is expected to finish by June. More details are included in the
summary from HLRS in section 3.3.8;
WB-Sails (CSC) – CSC has had various difficulties obtaining machine time mainly
due to local restrictions on commercial usage for their platforms. These have now
been overcome and technical work has begun, but it is expected that it will not be
concluded until summer;
Principia (CINES) – Due to issues mainly related to security concerns of the SME,
there have been delays to starting this work. As such, little work has taken place yet
but it is expected to start imminently. More details are given in section 3.3.10 below;
Algo’tech (INRIA) – Similarly to Principia, there have been various delays to the
technical work starting, but work is now underway.
For the four projects which are unable to provide white papers for the current (April 6th 2016)
round of deliverable reviews, it is expected that their white papers will be reviewed at a later
review round, possibly alongside the third call projects.
3.3 SHAPE second call: Project summaries
This section provides summaries of the eleven projects from the second call of the SHAPE
programme. For each project there is a brief overview describing the problem to be solved,
the activity undertaken, how PRACE was involved, the benefits to the SMEs, and finally the
lessons learned for the further development of the SHAPE programme itself. The lessons
learned are also discussed further in Section 3.4.
Note that each pilot project is also producing a technical white paper that will cover the
activities and results of the projects in greater detail than presented here. The intention of this
section is to give a flavour of the broad range of projects and the diversity of the subject areas,
along with summarising the benefits of the SHAPE programme to the SMEs.
3.3.1 Ergolines: HPC-based Design of a Novel Electromagnetic Stirrer for Steel
Casting
Overview
Project partners:
• Company: Ergolines s.r.l.
o Isabella Mazza, Ergolines s.r.l., Physicist, isabella.mazza@ergolines.it
o Cristiano Persi, Ergolines s.r.l., Mechanical Engineer,
cristiano.persi@ergolines.it
o Andrea Santoro, Ergolines s.r.l., Mechanical Engineer
andrea.santoro@ergolines.it
• Istanbul Technical University
o Ahmet Duran, Istanbul Technical University (ITU), Mathematical
Engineering, aduran@itu.edu.tr
o Yakup Hundur, Istanbul Technical University, Physical Engineering,
hundur@itu.edu.tr
o Mehmet Tuncel, Istanbul Technical University, Mathematical Engineering,
Computational Science and Engineering.
• SHAPE contacts: isabella.mazza@ergolines.it, aduran@itu.edu.tr, hundur@itu.edu.tr
PRACE-4IP – EINFRA-653838
37
15.04.2016
D7.1
Periodic Report on Applications Enabling
As a general consideration, in order to simulate the effects of electromagnetic stirring on
liquid steel, a dedicated customization of Ergolines’ current OpenFOAM code has been
implemented so as to couple Electromagnetism with Fluid Dynamics. Due to the complexity
of the multi-physical system under study, very fine discretization in terms of time and
geometry is required. The use of HPC and the potential to take advantage of specialized
expertise is key to meet this emerging industrial challenge. In order to better assess how
parallelisation improves computational performance, the simulations have been carried out
considering an increasing number of cores.
Activity performed
The project activities have been organised into four different phases:
•
•
•
•
Porting: deploy and run the code on CINECA Fermi;
Profiling: quantification of the computational time spent in each building block of the
code;
Conducting initial simulations and parameter optimization: EMS design has been
following an iterative, multiple-simulation process including: 1) analysis of the
geometrical constraints, 2) calculation of the EM performance, 3) fluid dynamic
simulation 4) parameter optimization, 5) iteration of steps 2 to 4 until the required EM
performance is achieved;
Benchmarking: performance analysis for the current version and updated versions of
the code via extensive simulations.
Overview of the results
The fluid dynamics of liquid steel in an electric arc furnace under the effect of
electromagnetic stirring has been studied by means of HPC-based numerical simulations. The
geometry, mesh and fluid dynamics of the system under study are represented in Figure 25.
The velocity field generated by the EMS, which is located under the EAF, is also shown. We
performed the scalability tests and the code has shown nearly linear speed-up up to 512 cores.
Afterwards, speed-up saturation takes place if more than 512 cores are used as seen in Figure
26.
(a) EAF geometry and mesh
(b) Fluid-dynamic simulation: velocity field
displayed as flux lines
Figure 25: Top views of geometry and mesh, and velocity field
PRACE-4IP – EINFRA-653838
38
15.04.2016
D7.1
Periodic Report on Applications Enabling
Speed-up normalized to 20 cores
Speed-up
60
50
40
30
simple
20
hierarchical
Linear speed-up
10
0
0
200
400
600
800
1000
1200
Number of cores
Figure 26: Speed-up as a function of the number of cores
PRACE cooperation
The project partners have prepared a detailed workplan to realize the HPC-based project. The
project partners at ITU were awarded access to IBM-FERMI at CINECA through their
Project 2010PA3012 “Parameter Optimization and Evaluating OpenFOAM Simulations for
Magnetohydrodynamics” under the 21st Call for PRACE Preparatory Access call Type B.
The project partners at ITU have prepared sequential job submit scripts and parallel job
submit scripts to compile and run OpenFOAM with mathematical operators such as
turbulence models and various mesh operators and the solver, and also to execute other
related programs on IBM-FERMI at CINECA. They have provided the job submit scripts to
Ergolines. They provided guidance for performance and scalability of the codes on HPC
systems. The SHAPE contacts at ITU attended the PRACE F2F and telco meetings, and
communicated with WP7.1 task leader. The project partners have prepared a white paper.
Benefits for SME
The SME appreciates how the use of HPC has been crucial to carry out the fluid-dynamic
simulations by drastically reducing the computational times. Performing the simulations inHouse, on Ergolines’ workstations having 8 cores, required about 15 hours, while running the
same calculations on CINECA Fermi took only about 20 minutes by a hierarchical method
with 1024 cores. This dramatic advantage enabled to carry out an extensive analysis of the
fluid-dynamic of the liquid steel in the furnace under the influence of electromagnetic stirring,
providing key information for EMS design and industrialization.
Lessons learned
PRACE and the project partners at Ergolines s.r.l. and ITU enjoyed an excellent collaboration
and completed the project successfully. All parties hope to have the chance to collaborate
again in the future.
3.3.2
Cybeletech: Numerical simulations for plant breeding optimization
Overview
Breeding a new plant variety is a long process that requires a decade and thousands of
experimental trials in the field so as to select the most robust and efficient traits. In order to
help seed companies to reduce the duration and development cost of a new variety, this work
proposes to simulate the growth of the tested genotypes instead of running experiments in the
PRACE-4IP – EINFRA-653838
39
15.04.2016
D7.1
Periodic Report on Applications Enabling
field. For this purpose, HPC technologies are then critical. In the first step, the plant growth
model used in numerical simulations must be calibrated with plant phenotypes data. The
project aims to define the optimal experimental protocol to be followed for calibrating the
model, i.e. to answer three questions: What observables to measure? In which quantity? In
which environments? To address these issues, computer simulations are run to compare the
precision derived on the model parameters as a function of the data used in input.
Optimization techniques are then used to identify the best protocol offering a balance between
quality of the final result and experimental costs.
Activity performed
The partners performed an installation of the code and all of its third party libraries, and then
defined an input dataset for simulation used as a benchmark. Thus the initial performance and
correctness could be validated.
Then the random number generator usage in the source code was modified to ensure
repeatable results and timings.
Intel Vtune was used to profile the code and identify which lines led to excessive time
consumption. Subsequently, the performance was then improved for those lines of codes that
concerned C++ object memory management and mathematical functions. Then the
parallelism was improved by adding a master-slave approach to distribute the work among
hundreds of MPI ranks.
PRACE cooperation
PRACE was involved for providing access to the Curie machine at TGCC (CEA computing
centre).
Benefits for SME
The SME Cybeletech was able to compute all the simulations planned and made use of the
whole 400k hours allocated.
•
Lessons learned
•
•
3.3.3
What worked well was holding a face to face meeting with the engineer from
Cybeletech over a period of two days. It greatly speeded up the understanding of the
code and the implementation of optimisations.
A problem was a delay due to security requirements not being met in order to access
Curie. Indeed, the SME has internet through an ADSL box.
Porting from one linux environment to another linux environment is sometimes not
straightforward for C++ codes.
RAPHI: rarefied flow simulations on Intel Xeon Phi
Overview
Within this project, OPTIMAD Engineering srl wanted to explore the possibility of porting
the KOPPA (Kinetic Octree Parallel PolyAtomic) numerical code onto the Intel Xeon Phi
architecture using the CINECA GALILEO cluster. KOPPA is used for rarefied gas
simulations and demands expensive computation when compared with other CFD or CAE
applications. By porting the code onto the Xeon Phi architecture, the goal is to reduce the cost
of simulations and thus raise the interest of using this code for industrial applications.
Activity performed
PRACE-4IP – EINFRA-653838
40
15.04.2016
D7.1
Periodic Report on Applications Enabling
To investigate the performance profiler tools, such as Vtune and Scalasca, were used running
the initial version of the code on the GALILEO cluster. A simple test case was chosen and the
behaviour of the code observed, including whilst increasing the computational load with
different input parameters. In order to optimize vectorization and the parallel scalability some
parts of the code were then refactored. The computational time requirements have been
decreased by almost a factor of eight and a good scalability has been obtained up to 64 cores
as compared with only 16 cores of the initial version of the code.
PRACE cooperation
The cooperation has involved two engineers from OPTIMAD, Marco Cisternino and Haysam
Talib, one research engineer from INRIA Bordeaux Sud-Ouest, Florian Bernard and an expert
from CINECA, Vittorio Ruggiero. For the computations, an account has been created and
5,000 core hours have been allocated on the GALILEO machine at CINECA.
Benefits for SME
This project gave the SME a better understanding of the behaviour of MPI code on the Xeon
Phi architecture. Pure task parallelism is probably not the right approach giving the best
performance on this type of architecture. A hybrid parallelization (such as MPI+OpenMP)
might be more suitable and is going to be investigated. Moreover, the vectorization effects
(and the optimisation in a more general point of view) are an important aspect and have still
to be improved. Memory requirements seem also to be a bottleneck since the code needs to
handle a large amount of data, causing problems in memory access. The code, as it is, is not
yet ready to have good performance on multi integrated cores (MIC) since the scalability is
still too poor and does not exploit any advantages of MIC with respect to CPUs.
Lessons learned
The access to GALILEO worked well for the users and allowed small tests (compilation or
very small cases), before running on more nodes. However, access to the actual resources in
order to study code performance and scalability was cumbersome due to the large amount of
jobs in the queues. This issue meant that scalability has been tested only on a very restricted
number of nodes in order to avoid large waiting times.
There was a good and important communication between all the partners of the project
resulting in nice improvements of the code and a better understanding of the architecture.
From an industrial point of view, the project permitted OPTIMAD to get a hands-on feel for
the MIC architecture and, through the support of the computing centre, gain insight on the
code and its suitability for this architecture.
The project raised issues regarding the performance of the application and gave indications on
which improvements need to be introduced beforehand in order to increase the potential of the
application for the MIC architecture. This type of information subsequently enables
OPTIMAD to program in a more efficient and rational way for the transition to heterogeneous
architectures, which is considered a strategic development goal within the company.
3.3.4 Open Ocean: High Performance Processing Chain - faster on‐line statistics
calculation
Overview
Open Ocean is a French SME that develops innovative on-line solutions to help plan and
manage offshore developments. They conceived an oceanographic data study tool which
computes and formats data (Pre-Processing and Processing) and which provides relevant
oceanographic information to industrial marine companies (Post-Processing) through a web
PRACE-4IP – EINFRA-653838
41
15.04.2016
D7.1
Periodic Report on Applications Enabling
interface. However, the “time-to-solution” of this post-processing step is too long and hence
not compatible with industrial use. Therefore, the goal of this SHAPE project was to improve
post-processing by optimising a parallelized Python program of Open Ocean which processes
and computes statistics (e.g. wind speed) on big datasets. To carry this out, engineers of Open
Ocean and IDRIS (CNRS computing centre) worked together to optimise this program by
using high performance parallel machine and parallel file system (GPFS, 100 GB/s
bandwidth).
The post-processing code was ported on to the Tier-1 Ada machine (IBM cluster of Intel E54650 processors, 332 compute nodes) at IDRIS to analyse and optimise its performance.
Details about the project and the activity performed can be found in the upcoming white paper
“Shape Project Open Ocean: High Performance Processing Chain - faster on‐line statistics
calculation” which will be available via the PRACE website by summer 2016.
Activity performed
The work which was done concerns the porting of the post-processing step on the Ada
machine and the analysis of the computation performance. The code was already partially
optimised as it had been parallelized using a specific software (ProActive Parallel Suite) but
to enable the porting of the post-processing code on any machine, it was necessary to dispense
with that software. After this task was done, it was possible to compare the performance
between the computations on Open Ocean and IDRIS machines. A profiling of a realistic
post-processing case was performed to help the Open Ocean team to better understand the
behaviour of their code.
PRACE cooperation
The PRACE cooperation has involved engineers from Open Ocean SME and from IDRIS:
Youen Kervella and Yves Moisan (Open Ocean), Lola Falletti and Sylvie Therond (IDRIS).
Data have been transferred to the Ada machine at IDRIS and L. Falletti and S. Therond did
the tests.
Some meetings were held (in person or by phone) to give updates of project progress. Most
communication was done through emails during the entirety of the project.
Benefits for SME
The PRACE cooperation gave Open Ocean the opportunity to port their codes into a high
performance computer system, thus familiarizing them with the standards of this computer
science field. The in-depth knowledge of IDRIS engineers also gave Open Ocean a new look
at both their hardware and file transfer solution. This study allowed Open Ocean to identify
the main bottleneck of its post-processing program (i.e. fetching data from their dataset) and
to reconsider their hardware choice. In addition, by porting the post-processing code to the
IDRIS infrastructure, this PRACE project gave the opportunity to Open Ocean to try to assess
other job schedulers such as SLURM or LoadLeveler, which highly increases the portability
and the efficiency of their software solution.
Lessons learned
The cooperation between Open Ocean and the IDRIS centre worked very well. The
communication between the two teams was facilitated by the fact that they both spoke French
and they were not located too far away. It was then easier to exchange mails and to organise
meetings to increase the efficiency of the project work.
The work that was done was different from what was first requested by Open Ocean.
However, as both sides were reactive, the work plan was adjusted accordingly.
PRACE-4IP – EINFRA-653838
42
15.04.2016
D7.1
Periodic Report on Applications Enabling
The engineers of Open Ocean had access to the Ada machine but finally they did not need to
use it: instead, their data were transferred on to the machine and the work was done by IDRIS
engineers. All of the codes were open source, which facilitated the work.
More tests could be done, especially concerning the optimisation of the post-processing step.
However, the work that was performed provides the company with a good base to improve
their post-processing workflow.
3.3.5 Hydros Innovation: Automatic Optimal Hull Design by Means of VPP
Applications on HPC Platforms
Overview
Hydros is an Engineering & Research Swiss company founded in 2007 with several patented
designs in the field of marine and sailing yachting. The main scope of the project was to
evaluate the feasibility of automatic optimal hull design on an HPC infrastructure and the
impact of such a workflow on the day-by-day work of Hydros personnel. CINECA is the
PRACE center that supported the SME.
Work Performed
The project was subdivided into:
•
•
•
•
Validation of a 2Degrees of Freedom (DoF) Computational Fluid Dynamics (CFD)
analysis of an industrial hull design provided by Hydros using the open-source code
OpenFOAM;
Scalability tests and comparison of the commercial code CAESES used at present by
Hydros and open-source code OpenFOAM for hull 2DoF modelling;
Coupling of the CFD result into an existing CAD modification and optimization loop;
Submit a complete optimization loop for an industrial hull design using open-source
code on the HPC platform and evaluate usability of the solution provided.
PRACE cooperation
The cooperation has involved one engineer from Hydros, Alaric Lukowski, and two experts
from CINECA, Raffaele Ponzini and Ivan Spisso. For the computations, accounts have been
opened on Tier-1 CINECA cluster GALILEO and 60,000 core hours have been allocated.
Benefits for SME
The Project allowed the SME to analyse the feasibility of moving from a workflow based on
workstations running commercial CFD codes to a new one involving HPC resources with
open source codes. The potential benefits of the outcome are obvious: a sensible reduction of
costs due to license expenses and a reduction in time-to-market due to the reduction of
simulation time thanks to the possibility to exploit the parallel efficiency of the CFD codes on
an HPC cluster.
Lessons learned
The commercial code results and the open source ones at the end matched within negligible
differences. This very important result reassured the SME on the possibility of using open
source codes in production.
However, while the results from CAESES were obtained quite out-of-the-box, OpenFOAM
could match them only after a long analysis and tweaking of the simulation parameters. This
suggested that the open source code requires a long learning curve and skilled engineers,
balancing the benefits of licensing costs reduction.
PRACE-4IP – EINFRA-653838
43
15.04.2016
D7.1
Periodic Report on Applications Enabling
From the PRACE centre point of view, this project suggests that there is a strong need for
PRACE as an innovation catalyst for SMEs, providing specific competences and deep
expertise on CAE open source codes.
3.3.6
Vortex: Numerical Simulation for Vortex-Bladeless
Overview
Vortex-Bladeless is a Spanish SME whose objective is to develop a new concept of wind
turbine without blades called Vortex, or vorticity wind turbine. This design aims to eliminate
or reduce many of the existing problems in conventional wind energy.
This device represents a new paradigm of wind energy. Due to the significant difference in
the project concept, its scope is different from conventional wind turbines. It is particularly
suitable for offshore configuration and it can be exploited in wind farms and in environments
usually closed to existing ones due to the presence of high intensity winds.
Given its morphological simplicity and considering that it is composed of a single structural
component, its manufacturing, transport, storage and installation has clear advantages. The
new wind turbine design has no bearings, gears, et cetera, so the maintenance requirements
could be drastically reduced and their lifespan is expected to be higher than traditional wind
turbines.
The Barcelona Supercomputing Center (BSC) is in charge of the simulations of the wind
energy generation device. The Alya code, developed at BSC, is used to perform the FluidStructure Interaction (FSI) problem simulation for a scaled model of the real device.
Activity performed
The FSI problem posed by the interaction of the wind energy generator and the wind current
in which it is embedded is solved using the Alya code. A comparison between the
experimental results of a laboratory scaled device and the numerical simulation was
performed. The first objective was to show that the code has the capacity of performing the
simulation.
In order to be able to do the simulation, the multi-code coupling ability of the Alya code was
used, tuned and refined. Different algebraic solvers, mesh kinds and coupling algorithms were
tested.
The comparison between the numerical and the experimental results is satisfactory and allow
the set-up of a full scale device simulation to proceed.
PRACE cooperation
PRACE provided the expert support to adapt the code for this application and the machine
time needed to perform the simulations.
Benefits for SME
The results of this SHAPE project are providing guidance and support to the company in the
development of its wind energy device. Once the full laboratory results are properly
reproduced, it is expected to perform full-scale simulations. Also BSC and Vortex-Bladeless
are looking forward to cooperating again in the framework of European projects or
investigating other collaboration possibilities.
Lessons learned
The experience in this collaboration in the SHAPE project framework has been satisfactory
and encouraging. The main difficulties faced were the full understanding of the PRACEPRACE-4IP – EINFRA-653838
44
15.04.2016
D7.1
Periodic Report on Applications Enabling
SHAPE project procedures and the communication of advances and results of the work done
by the BSC researchers. The best way we found to cope with this difficulty was to make
teleconferences and write periodic reports in a non-deep technical (computer science)
language so that the state and results of the work are clear to everyone.
One of the most confusing points was the application procedure for SHAPE projects. Given
that we had to make two different applications for PRACE resources (the first one to get the
approval of the project and the second one to get the real access for the calculation time), it
was thought that the resources of the first application were not being used properly. This issue
has already been discussed in the face to face meeting and it is foreseen that the next calls will
include computational resources starting from the first application. Probably the SHAPE
application form can contain a section of the computational resources needed in case that the
SME has some experience or idea of what will be needed.
3.3.7 DesignMethods: Coupled sail and appendage design method for multihull
based on numerical optimization
Overview
Design Methods is an engineering firm with fifteen years of experience in the aerospace field.
The mission of Design Methods is to provide multidisciplinary engineering consulting and
design services to industries and design teams supporting them with highly specialized
competences on aerodynamic design, CAE analysis, software development, CAD modelling,
numerical optimization environment and customized design tools development. They operate
in aerospace, automotive and marine fields. The main scope of the project was to evaluate the
feasibility of a numerical optimization workflow for sailing boats sail plans and appendages.
CINECA is the PRACE center that supported the SME.
Activity performed
The numerical optimization workflow integrates a sail parametric geometric module, an
automatic mesh generator and a Velocity Prediction Program (VPP) based on both CFD
computations and analytical models. The VMG (Velocity Made Good) is evaluated solving
the 6 Degrees of Freedom (DOF) equilibrium system iterating between VPP and sail CFD
analyses. The hull forces are modelled by empirical formulations tuned against a matrix of
multiphase CFD solutions on the demihull. The appendages aerodynamic polars are estimated
applying preliminary design criteria from the aerospace literature.
A significant part of the tool is already available to DesignMethods at a mature development
stage but is implemented using very expensive (especially for an SME) commercial software.
The overall goals were thus:
•
•
to investigate the possibility of replacing commercial codes with open source
software, by a benchmark activity aimed to select the opportune candidate codes, and
to highlight their balance between performance, accuracy, robustness and HPC
environment compatibility;
to demonstrate the capability to efficiently take on computationally costly problems
within an HPC environment.
PRACE cooperation
The cooperation has involved one engineer from DesignMethods, Ubaldo Cella, and two
experts from CINECA, Raffaele Ponzini and Francesco Salvadore. For the computations,
accounts have been opened on Tier-1 CINECA cluster GALILEO and 5,000 core hours have
been allocated.
PRACE-4IP – EINFRA-653838
45
15.04.2016
D7.1
Periodic Report on Applications Enabling
Benefits for SME
The Project allowed the SME to analyse the feasibility to move from a workflow based on
commercial CFD codes to a new one involving HPC resources with open source codes. The
potential benefits of the outcome are obvious: a sensible reduction of costs due to license
expenses. A demonstration of the market value of then new workflow to the SME was
obtained by the application on a real industrial case, the design of an A-Class catamaran sail.
Lessons learned
The results showed that the open source based workflow was perfectly appropriate to reach
the goals and therefore demonstrated its feasibility.
From the point of view of the PRACE center the outcome was not completely satisfying. The
collaboration with the SME was particularly troubled, due firstly to the lack of effort spent by
the SME and feedback provided in the first few months on the activity and later in the
requests of value added contributions that in CINECA view were outside the scope of the
Project and the SHAPE Program itself.
CINECA therefore recommends for the following SHAPE calls the creation of a PRACE
statement on Terms and Conditions about the scope of the SHAPE projects, and the extent of
the support that PRACE centres are allowed to provide to SMEs, and that selected SMEs sign
these Terms and Conditions prior to start the technical phase of the project.
3.3.8 Ingenieurbüro Tobias Loose: HPCWelding: parallelized welding analysis with
LS-Dyna
Overview
Partners:
•
•
•
Ingenieurbüro Tobias Loose: Tobias Loose;
Höchstleistungsrechenzentrum Stuttgart (HLRS): Jörg Hertzer, Bärbel GroßeWöhrmann;
DYNAmore GmbH: Uli Göhner.
Ingenieurbüro Tobias Loose is an engineering office specializing in simulations for welding
and heat treatment. Loose develops preprocessors (e.g. DynaWeld) and provides consulting
and training for industrial customers. In addition, Loose is involved in its own research
projects concerning welding and heat treatment simulations. In this project, the scaling
behaviour of the welding application of the commercial FE-code LS-DYNA has been tested
on the CRAY XC40 “HazelHen” at HLRS.
Activity performed
A variety of test cases relevant for industrial applications have been set up and run on
different numbers of compute cores (strong scaling tests). The results show that the implicit
thermal and mechanical solver scales up to only 48 cores depending on the particular test case
due to unbalanced workload. The explicit mechanical solver was tested up to 4080 cores with
significant scaling. It was the first time that a welding simulation was performed on 4080
cores with the LS-DYNA explicit solver. The details will be presented in the project's white
paper.
PRACE cooperation
HLRS granted access to the CRAY XC40 “HazelHen” in the frame of a PRACE Preparatory
Access project. The staff at HLRS coached Loose and supported him in preparing and
running the LS-DYNA cases in the HPC working environment.
PRACE-4IP – EINFRA-653838
46
15.04.2016
D7.1
Periodic Report on Applications Enabling
Benefits for SME
During the SHAPE project Loose gained significant knowledge and experience in HPC. The
project clarified how HPC can be used for welding analysis consulting. In detail, which
welding processes, welding tasks, modelling methods and analysis types are applicable on
HPC and with which effort.
The overall effort for welding analysis on HPC is now much better known with the help of
this SHAPE project, leading to the ability of more precise cost estimation of welding
consulting. This is a competitiveness improvement for Loose.
The project's results show that the parallelized version of the LS-DYNA implicit solver in
application to welding analysis is not satisfactory and requires a revision, maybe in a followup project. The intension is to get a scaling behaviour similar to that of other LS-DYNA
applications.
Lessons learned
•
•
•
•
•
3.3.9
LS-DYNA is an appropriate code, but its implicit solver’s scaling behaviour is
disappointing in view of LS-DYNA explicit solvers’ good scaling behaviour.
The issue of smallest time steps needed for explicit analysis prohibits its application
on welding tasks with long duration, e.g. up to 30000s process time. Now it can be
distinguished which welding tasks allow explicit welding analysis and which not.
With the project's work the scaling behaviour is now known with respect to model
size, modelling technique (contact-no contact, solid-shell, transient-metatransient),
and analysis type (thermal – mechanical, implicit – explicit).
The main problem was that Loose being involved in several consulting and research
projects has a tight schedule. Fortunately, the SHAPE project leader and the HLRS
staff were flexible and respected this circumstance.
The excellent HLRS support enables HPC for SMEs even if the company has less
experience in supercomputing.
WB-Sails Ltd: CFD simulations of sails and sailboat performance
Overview
WB-Sails Ltd’s partners in the project have been CSC – IT Center for Science (Center of
Supercomputing) in Finland and Next Limit Technologies in Spain.
In the project, WB-Sails has deployed Next Limit’s fluid simulation software XFlow on the
CSC’s HPC cluster Taito.
Preparatory simulations mainly concerning sailboat aerodynamics have been performed.
Activity performed
Numerous test runs have been performed, including scalability testing, implementing the
XFlow GUI on Taito-GPU system, and robustness testing of XFlow’s various solvers, in
particular for external aerodynamics as well as multiphase free surface problems. The free
surface problem tests have been marred by a glitch in the software, delivering different results
in serial and parallel computing. The problem is yet to be solved by Next Limit.
PRACE cooperation
CSC has been extremely helpful in all aspects of the project: creating the connections to the
HPC servers, scripting to automate the creation of batch runs, resolving problems associated
with the software compatibility, arranging the possibility for using the GUI (necessary for
PRACE-4IP – EINFRA-653838
47
15.04.2016
D7.1
Periodic Report on Applications Enabling
post-processing) through GPU nodes. There has been a steep learning curve, by no means
possible without the support of the HPC provider CSC.
Benefits for SME
The HPC cluster allows the SME to do runs at a much higher resolution than with the inhouse workstation. Even if the computational power used so far is modest in supercomputing
perspective (128 cores on 8 nodes, around 2500 core h/run), the ability to do runs overnight
that normally take a week or more is much appreciated. Also, the high resolution of the runs
(up to 42 million elements so far) provides a much more accurate picture of the flow
phenomena, improving their understanding of sail interaction & aerodynamics. In the long
run, this will reflect in the quality of the SME‘s product as well.
Lessons learned
Implementing and running relatively new commercial software on an HPC system is much
more demanding than expected. In particular, the inability to work with the normal graphic
interface for the pre-processing, learning the Linux operating system from scratch, batch
scripting for parallel computing have all been both time consuming, and a learning
experience.
In addition to the co-operation of CSC with the SME, the co-operation between CSC and the
software vendor Next Limit has worked well and helped WB-Sails a lot. CSC has given Next
Limit access to the HPC systems for their testing, and together with Next Limit the SME has
been able to create a launcher script that handles caters for the pre-processing phase.
CSC was able to solve the post-processing problem, involving the software’s GUI, in a novel,
still experimental way of working through a VNC connection in Taito’s GPU-nodes. Together
with Next Limit’s launcher script, the Taito-GPU connection makes it possible to work in an
almost office-like environment, hardly needing to employ the Linux command line interface
at all.
The SME have been forced to scale down ambitions with regards to the objectives of the
initial project (optimization through systematic geometry variation). Also, the inability to run
free surface cases in parallel has been a disappointment, but there is still hope that with Next
Limit the issues can be solved, to be able to fulfil this important part of the initial project.
A considerable part of the learning process has to do with the basic interfacing to the remote
HPC system: installing and learning the use of the various connecting software, sftp, ssh,
VNC etc., in your own operating system which is not necessarily familiar to the HPC provider
and not always compatible to the software on the server side. Even such a simple feature as
keyboard functioning for example may cause surprising password related issues, and can take
a couple of weeks to solve, through googling and & trial & error.
3.3.10 Principia: HPC for Hydrodynamics Database Creation
Overview
Principia is a scientific engineering company that performs engineering studies for large
industrial companies, and develops and industrializes added-value numerical software
solutions.
The main goal of this project is to optimize one of Principia's code, Diodore, on an HPC
infrastructure in order to improve the Deepline HPC product.
PRACE-4IP – EINFRA-653838
48
15.04.2016
D7.1
Periodic Report on Applications Enabling
Principia and CINES will be working together on this topic, aiming at transferring
optimization and parallelization skills so that Principia staff can reuse the methodology on
other of their HPC codes.
Activity performed
The beginning of the project has been delayed due to security aspects and Principia's staff
availability. The former goals were to profile, optimize and compile targeted software on a
machine administrated by Principia and run benchmarks on the PRACE infrastructure.
Instead, now the PRACE infrastructure will be used throughout the project.
The project is still in the starting phase, waiting for the NDA between Principia and the
CINES experts to be established.
PRACE cooperation
For now cooperation were restricted to meeting organization. Also the project reporting has
been allowed to be moved to the third call.
Benefits for SME
Not yet applicable.
Lessons learned
From the PRACE point of view, this project took time to begin. Such delays could have been
better anticipated regarding the security requirement asked by the SME. Investigating
potential security requirements more thoroughly at the application stage would have been
useful. If the exact work-flow required by the SME had been known in advance, the
appropriateness or otherwise of moving the code to another country’s infrastructure would
have been realised and may have alleviated some of the difficulties the project is now facing.
3.3.11 Algo’tech Informatique
Overview
ALGO’TECH INFORMATIQUE is an ISV located near Biarritz, in the south-west of France.
It creates and sells a suite of software dedicated to electricity. This software is used by design
services in order to draw electric schemas.
Electrical devices have taken on a major role in all types of electrical, automated and
embedded systems. Cables, both shielded and non-shielded, have thus become a serious issue
in terms of safety, on-board weight and hence performance and consumption, as well as cost
and reliability. Today, the decision to shield or not a cable in response to electromagnetic
effects is complex. Simulation became mandatory to obtain a first level of decision.
The main target of the project is to validate the possibility for Algo’Tech Informatique to use
HPC in the area of electromagnetic simulations to support them up front when designing their
installations or later to eliminate the effects when those installations are commissioned.
Electromagnetic problems and disturbances occur quite frequently. It is essential to verify
during the design stage that their cables are impervious to electromagnetic effects. Most
importantly, they must have the means, when the installations are set up, to determine the best
configurations to eliminate electromagnetic effects.
For several years, Algo’Tech Informatique has been developing, in partnership with the
French General Directorate for Armament DGA (RAPID project) and the French Atomic
Energy Commission CEA in GRAMAT, an electromagnetic simulator. This simulator uses
the circuit simulator developed by Algo’Tech Informatique within the framework of the
PRACE-4IP – EINFRA-653838
49
15.04.2016
D7.1
Periodic Report on Applications Enabling
FRESH project (a European project of the Sixth Framework Research and Development
Program [FRDP] 2004-2007).
The simulator is now in the phase of validation. It operates on PCs to solve small and
medium-size electromagnetic problems. Unfortunately, when an electrical installation is too
big, computing on a PC becomes too much time-consuming to meet user needs.
Previous activities performed under the HPC-PME Initiative: The preliminary studies carried
out by the INRIA HiePACS team (The French Institute for computer science and applied
mathematics) as part of the HPC-PME Initiative (HPC for SME, conducted in France by
GENCI, INRIA and BPI-FRANCE) concluded it was necessary to transfer calculations to
HPC.
Previous activities performed under the Fortissimo program: It is possible to model the
electromagnetic effects on the cables, wires, strands and harnesses that make up the
connections of electrical systems: we can define a sparse linear system to solve.
For example, if we take an installation made up of 100 wires 100 metres long, to obtain a
good simulation, the wires have to be cut into 1-metre sections – in other words that makes a
total of 10,000 sections. Each of these sections involves about 100 equations to model the
electromagnetic effects (depending on the number of contiguous wires), resulting in a system
of 1,000,000 by 1,000,000, a sparse but quite voluminous system of linear equations that need
to be solved 1,000 times to generate a sweep frequency.
Such a system cannot be solved within an acceptable time frame on a PC. The aim of the
solution is to produce the whole installation on a PC-type computer (desktop or laptop) and be
able to connect automatically with a computing centre to quickly perform the calculations and
recover the results for modelling on the PC.
In the context of a first project (FP7 Fortissimo program), we addressed the solution to use
cloud-based services to solve large scale electromagnetic problems. A specific driver has been
implemented to extract the resolution step of the Algo’Tech simulator. The scalability was
quite good, but in order to address larger problems (Algo’Tech would like to solve problem
with about 5 millions of unknowns), we needed HPC infrastructures for the resolution step.
This was the context in which Algo’Tech Informatique has modified its electromagnetic
simulator in order to include the calculation libraries of INRIA. It introduces parallelization
into the source code of the simulator: it allows taking advantage of multi-processor and multicore architectures.
Activity Performed
The project led by Algo’Tech into SHAPE was delayed and no concrete activities were
realized in 2015. Indeed, Algo’Tech needed to first end its project in Fortissimo before
starting its SHAPE project. More precisely, the company had to modify its electromagnetic
simulator and to include the calculation libraries of INRIA.
In the context of SHAPE, some discussions have been held with INRIA in the beginning of
2016. They are actually working together in order to develop a Windows version of the
Algo’Tech software (with a DELPHI integration). The aim is to get a native format library
(with VisualC++) as well as a threads and BLAS (with OpenBLAS) better use.
PRACE cooperation
No access to PRACE supercomputers yet.
Benefits for SME
PRACE-4IP – EINFRA-653838
50
15.04.2016
D7.1
Periodic Report on Applications Enabling
The goal is to apply Electromagnetic Simulation on harnesses in order to find the best
position and route of the wires to avoid problems due to electromagnetic disturbances. It
prevents electromagnetic bad effects that imply to change the design of the product and it
reduces operations and weight due to non-essential shielding and armour of cables.
The target market is the all SME’s and independent design offices working around embedded
systems and equipment for aeronautic, automotive, railways or ship Industry and the machine
manufacturers Industry.
Today, to the SME‘s knowledge there is no entry level product on the market, easy to use
with no need of having electromagnetic experts. The need for such tools is growing fast as
electric control is becoming more and more important in all these industries.
The objective is to propose this simulation tool in SaaS mode including the access to HPC
resources. The product should be finalized and ready on the market by 2019.
This product is proposed to three types of targets:
1. Small Ones with limited configuration, less than 100 wires and 200 connection points.
Our estimate is an average of five simulations a year at a fix price of 200€ by
simulation.
2. Enterprises with more important configurations between 100 to 1000 wires and/or
2000 connection points. The price of simulation will be linked to the number of wires
or connection points and will vary from 200€ to 1000€. The estimate is 10 simulations
a year for an average cost of 700€.
3. Enterprises with large or very large configurations above 1000 wires and 2000
connection points. The SME banks on an average simulation cost of 1500€ and 30
simulations a year.
Lessons Learned
Solving large sparse systems of linear equations is a crucial and time-consuming step, arising
in many scientific and engineering applications. Consequently, many parallel techniques for
sparse matrix factorization have been studied, designed and implemented. Solving a sparse
linear system by a direct method is generally a highly irregular problem that induces some
challenging algorithmic problems and requires a sophisticated implementation scheme in
order to fully exploit the capabilities of modern hierarchical supercomputers. In this context,
graph partitioning and nested dissection approaches have played a crucial role. The PaStiX
solver has been widely used by industrial partners and academic teams, and the current
project is the opportunity to demonstrate the efficiency of the approach for SMEs that have
assessed and realized the technological leap needed for developing accelerated software using
HPC facilities.
3.4 Summary of lessons learned and feedback
In this section, the salient points raised from the lessons learned are highlighted, and feedback
garnered from the PRACE partners via the SHAPE tele-conferences and face-to-face
meetings are also presented.
•
•
•
Whilst not always the case, generally SMEs have a preference of working with a
centre located in the same country as them.
Routes to follow on work once the initial SHAPE project should be investigated, e.g.
highlight further collaborative funding opportunities.
Security requirements can cause delays – it is important to highlight these as early as
possible to ensure correct choices are made at the start.
PRACE-4IP – EINFRA-653838
51
15.04.2016
D7.1
•
•
•
•
•
•
•
•
Periodic Report on Applications Enabling
PRACE resources (and shared HPC resources generally) may have queuing systems
which offer quite a different experience to the SME, and may not meet their
expectations – this needs to be highlighted early to the SME so that their expectations
are managed.
Regular communication between the SME and the partners appears to be a common
theme across the projects, which encountered no significant barriers.
Flexibility is key, from both sides – the initial workplans may need to be adapted, and
SMEs need to remain engaged and adapt their goals accordingly. Also SMEs by their
very nature may not be able to provide consistent effort in the same way as larger
organisations, so again an appreciation of this at the outset is beneficial.
The dual-application procedure for SHAPE (once for a SHAPE expert, then another
for machine time) was confusing for all concerned. An attempt has been made to
address this in the third call (see section 3.5).
To avoid confusion and potential conflict, some SHAPE terms and conditions or
„rules of engagement“ should be drafted and signed up to by collaborators, to ensure
that the scope and indeed limitations of the SHAPE programme are clearly understood
Many of the applications to SHAPE involve third-party software, so potential
licensing issues need to be investigated early in the project.
Another potential issue could be successive applications of the same SME to SHAPE
that could be pointed to as unfair concurrency to commercial HPC services with
public money
Some PRACE partner nations have policies restricting the usage of their HPC
facilities for projects with industrials – a consideration should be made here on the
appropriate choice of facility.
3.5 SHAPE third call
Following the successful pilot run of SHAPE under PRACE-3IP, and the subsequent second
call for applications under PRACE-4IP, the third call was launched November 16th , 2015 and
closed January 14th, 2016. Given the feedback from the previous calls, the application process
for this call was changed. One of the main issues repeatedly raised was the double application
process – applying for SHAPE assistance, and then if successful, having then to apply for
machine time. To mitigate this, the application form was amended to include the opportunity
for the SME to include more technical information (if known – it was anticipated that many
SMEs would not have the knowledge at hand at this stage, but if they did they could supply it
here). In addition, the review team was composed as for the second call, but with the addition
of the Preparatory Access Type C coordinator who gave a high-level preparatory access type
appraisal of the applications to try to identify any potential showstoppers early in the process.
This was a useful exercise: the coordinator’s input to the review was invaluable and provided
an angle on the feasibility of the applications that may have been missed by the other board
members. However, it is expected that the successful applicants will still have to go through
the PA access scheme to get machine time: at least they will have confidence in being
successful with their bid, but it is strongly recommended that this situation is revisited before
the next call to see if it can be truly made a single-application process.
The third call received eight applications, listed in Table 14 below. As of March 2016, the
applications have been approved and are in the process of being matched with partners.
PRACE-4IP – EINFRA-653838
52
15.04.2016
D7.1
Periodic Report on Applications Enabling
Company
Country
Project Title
ACOBIOM
France
MARS (Matrix of RNA-Seq)
Airinnova AB
Sweden
High level optimization in aerodynamic
design
AmpliSIM
France
DemocraSIM: DEMOCRatic Air quaility
SIMulation
ANEMOS SRL
Italy
SUNSTAR: Simulation of UNSteady
Turbulent flows for the AeRospace
industry
BAC
Engineering
Spain
Consultancy Group
Numerical simulation of accidental fires
with a spillage of oil in large buildings
Creo Dynamics AB
Sweden
Large scale aero-acoustics applications
using open source CFD
FDD Engitec S.L.
Spain
Pressure drop simulation
compressed gas closed system
Pharmacelera
Spain
HPC methodologies for PharmScreen
for
a
Table 14: Third call applications
3.6 SHAPE: future
The frequency of the SHAPE calls is being increased, to every six months, so the next call is
planned to open June 2016 and then continue at 6-monthly intervals after that. As noted
earlier, it is recommended that the SHAPE application process be reviewed and enhanced
further to ensure it is a single-application process.
With regards to PRACE-5IP, another recommendation is to have a pool of effort for SHAPE
projects: at the moment partners already have effort, and they volunteer this effort to take on
an SME collaboration. However, this is rather inflexible, for example if the SME has a
preference to work with a particular partner but that partner has no effort remaining, or if the
expertise required for a project is with a partner without effort in the SHAPE-encompassing
work package. In these cases having a pool of effort that the partners could access as
appropriate will ensure matching the SMEs with the most suitable PRACE partner will be a
much more straightforward process.
The numbers of SMEs applying at each call has been decreasing, albeit slowly. Consideration
should be given to ways of further raising awareness of the SHAPE programme and
enhancing publicity. Indeed, steps are already being taken in this direction and SHAPE is
going to liaise with the Industry Advisory Committee to gain advice and assistance with this.
PRACE-4IP – EINFRA-653838
53
15.04.2016
D7.1
Periodic Report on Applications Enabling
4 Summary
Two parallel sub-tasks on application enabling in Work Package 7 of PRACE-4IP have been
described including final reports on the supported applications. These two activities have been
organized into support projects formed on a basis of either scaling and optimisation support
for Preparatory Access and SHAPE.
4.1 Preparatory Access Type C
During PRACE-4IP Task 7.1.A successfully performed five Cut-offs for preparatory access
including the associated review process and support for the approved projects.
In total four Preparatory Access type C projects have been supported or are currently
supported by T7.1.A in PRACE-4IP. Most of all projects reported in this deliverable plan to
or have already produced a white paper. Approved white papers are published online on the
PRACE RI web page [3]. Table 15 gives an overview of the status of the white papers for all
projects.
The projects from the Cut-off March 2016 are currently being reviewed and therefore does
not appear in this deliverable.
Project ID
White paper
2010PA2431
White paper status
No white paper produced
2010PA2452
WP207: Hybrid MIMD/SIMD
High Order DGTD Solver for the
Numerical Modeling of
Light/Matter Interaction on the
Nanoscale
Published online [4]
2010PA2457
WP208: Large Scale Parallelized
3d Mesoscopic Simulations of the
Mechanical Response to Shear in
Disordered Media
Published online [5]
2010PA2458
WP209: Optimising PICCANTE –
an Open Source Particle-in-Cell
Code for Advanced Simulations on
Tier-0 Systems
Published online [6]
2010PA2486
WP210: Parallel Subdomain
Coupling for non-matching Meshes
in Alya
Published online [7]
2010PA2737
WP211: Numerical Simulation of
complex turbulent Flows with
Discontinuous Galerkin Method
Published online [10]
2010PA2821
Project results will be
published in a separate
white paper in the context
of Work Package 7.
2010PA3125
Project finishes by the end
of July 2016. White paper
PRACE-4IP – EINFRA-653838
54
15.04.2016
D7.1
Periodic Report on Applications Enabling
Project ID
White paper
White paper status
will subsequently be
produced
2010PA3056
Project finishes by the end
of July 2016. White paper
will subsequently be
produced
Table 15: White paper status of the current PA C projects.
Table 15 shows the success of task 7.1.A as almost all finalized projects published their
results or plan to publish it in the near future.
4.1 SHAPE
The SHAPE programme continues; with a third call having just concluded and the next call
planned to open June 2016. Six of the second call projects are finished, and their white papers
will be delivered April 2016. The remaining second call projects are progressing.
Recommendations have been made to enhance the SHAPE process, such as by streamlining
the application to a single-step process, improving publicity, and in 5IP creating a pool of
effort for partners willing to work with the SMEs.
PRACE-4IP – EINFRA-653838
55
15.04.2016