# True convergence of HPC and Big Data/AI, towards AI-exaflops Satoshi Matsuoka Professor, GSIC, Tokyo Institute of Technology / Director, AIST-Tokyo Tech. Big Data Open Invovation Lab

Fellow, Artificial Intelligence Research Genter, AIST, Japan / Vis. Researcher, Advanced Institute for Computational Science, Riken

> VI-HPS Keynote Presentation 2017/06/23

Seeheim, Germany

|                                                | H                          | PC Asia 2018                          |
|------------------------------------------------|----------------------------|---------------------------------------|
| <b>FP</b> (Asia 2018                           |                            | Call for Papers                       |
| International Conference on                    |                            | Important Dates<br>Call for Workshops |
|                                                | acific Region              | Venue                                 |
| Tokyo Japan, Jan. 28 - 31, 2018                | Workshop proposal deadline | June 30, 2017                         |
|                                                | Paper submission deadline  | July 28, 2017                         |
| Society for Industrial and Applied Mathematics | _f 🖪 🔠 in 😣                |                                       |

#### CONFERENCES >

About SIAM Activity Groups Advertising Books Careers & Jobs Conferences **Customer Service Digital Library Fellows Program** History Project Journals Membership Prizes & Recognitions Proceedings Public Awareness Reports Sections SIAM News Students

SIAM Home

#### siam news



SIAM Conference on Parallel Processing for Scientific Computing



#### **SUBMISSION DEADLINES**

August 21, 2017: Minisymposium Proposal Submissions September 18, 2017: Contributed Lecture, Poster and Minisymposium Presentation Abstracts



Haneda Airport

### Big Data Exploding with IoT

#### Zeta(=10<sup>21</sup>)Bytes

Peta(=10<sup>15</sup>)Bytes

#### Exa(=10<sup>18</sup>)Bytes

### Tera(=10<sup>12</sup>)Bytes

### TSUBAME2.0 Nov. 1, 2010 "The Greenest Production Supercomputer in the World"

System

(42 Racks)

1408 GPU Compute Nodes,

34 Nehalem "Fat Memory" Nodes

- GPU-centric (> 4000) high performance & low power
- Small footprint (~200m2 or 2000 sq.ft), low TCO
- High bandwidth memory, optical network, SSD storage...



HPC and BD/AI Convergence Example [Yutaka Akiyama, Tokyo Tech]



6

### **EBD vs. EBD :** Large Scale Homology Search for Metagenomics

- Revealing uncultured microbiomes and finding novel genes in various environments
- Applied for human health in recent years



### Development of Ultra-fast Homology Search Tools x100,000 ~ x1,000,000 c.f. high-end BLAST WS (both FLOPS and BYTES)



#### Plasma Protein Binding (PPB) Prediction by Machine Learning Application for peptide drug discovery



- Candidate peptides are tend to be degraded and excreted faster than small molecule drugs
- Strong needs to design bio-stable peptides for drug candidates



Previous PPB prediction software for small molecule can not predict peptide PPB



#### Molecular Dynamics Simulation for Membrane Permeability Application for peptide drug discovery



### RWBC-OIL 2-3: Tokyo Tech IT-Drug Discovery Factory Simulation & Big Data & AI at Top HPC Scale

(Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama)





### EBD App2: Miyoshi Group (Weather Forecast Application)



#### **Big Data Assimilation** for severe weather forecast

Goal : Pinpoint (100-m resol.) forecast of severe local weather by updating 30-min forecast every 30 sec!

Revolutionary super-rapid 30-sec. cycle



## Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoT

- Three national centers on Big Data and AI launched by three competing Ministries for FY 2016 (Apr 2015-)
  - METI AIRC (Artificial Intelligence Research Center): AIST (AIST internal budget + > \$200 million FY 2017), April 2015
    - Broad AI/BD/IoT, industry focus
  - MEXT AIP (Artificial Intelligence Platform): Riken and other institutions (\$~50 mil), April 2016
    - A separate Post-K related AI funding as well.
    - Narrowly focused on DNN
  - MOST Universal Communication Lab: NICT (\$50~55 mil)
    - Brain –related AI
  - \$1 billion commitment on inter-ministry AI research over 10 years



Vice Minsiter Tsuchiya@MEXT Annoucing AIP estabishment



### 2015- AI Research Center (AIRC), AIST





Director: Jun-ichi Tsujii



Matsuoka : Joint appointment as "Designated" Fellow since July 2017



DENSO IT LABORATORY, INC.

JST BigData CREST JST AI CREST

**Application Area** Natural Langauge Processing

Robotics

#### Characteristics of Big Data and Al Computing As BD / Al Dense LA: DNN

Graph Analytics e.g. Social Networks

Sort, Hash, e.g. DB, log analysis

Symbolic Processing: Traditional AI



As HPC Task Integer Ops & Sparse Matrices Data Movement, Large Memory Sparse and Random Data, Low Locality

Acceleration, Scaling

Opposite ends of HPC computing spectrum, but HPC simulation apps can also be categorized likewise



Acceleration via Supercomputers adapted to AI/BD

Inference, Training, Generation

As HPC Task Dense Matrices, Reduced Precision Dense and well organized neworks and Data



Acceleration, Scaling

### Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4 K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu



### K-computer No.1 on Graph500: 4<sup>th</sup> Consecutive Time

- What is Graph500 Benchmark?
  - Supercomputer benchmark for data intensive applications.
  - Rank supercomputers by the performance of **Breadth-First Search** for very huge graph data.



This is achieved by a combination of high machine performance and **our software optimization**.

- Efficient Sparse Matrix Representation with Bitmap
- Vertex Reordering for Bitmap Optimization
- Optimizing Inter-Node Communications
- Load Balancing

etc.

 Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)

### TSUBAME-KFC/DL: TSUBAME3 Prototype [ICPADS2014]

Oil Immersive Cooling + Hot Water Cooling + High Density Packaging + Fine-Grained Power Monitoring and Control, <u>upgrade to /DL Oct. 2015</u>

> High Temperature Cooling Oil Loop 35~45°C ⇒ Water Loop 25~35°C (c.f. TSUBAME2: 7~17°C)

Single Rack High Density Oil Immersion 168 NVIDIA K80 GPUs + Xeon 413+TFlops (DFP) 1.5PFlops (SFP) ~60KW/rack

Container Facility 20 feet container (16m<sup>2</sup>) Fully Unmanned Operation

**Cooling Tower**:

Water 25~35°C

⇒ To Ambient Air

2013年11月/2014年6

Word #1 Green500

(Big Data) BYTES capabilities, in bandwidth and capacity, unilaterally important but often missing from modern HPC machines in their pursuit of FLOPS...

- Need <u>BOTH bandwidth and capacity</u> (BYTES) in a HPC-BD/AI machine:
  - Obvious for lefthand sparse ,bandwidthdominated apps
  - But also for righthand DNN: Strong scaling, large networks and datasets, in particular for future 3D dataset analysis such as CTscans, seismic simu. vs. analysis...)



(Source: http://www.dgi.com/images/cvmain\_overview/CV4DOverview\_Model\_001.jpg)





Proper arch. to support large memory cap. and BW, network latency and BW important

(Source: https://www.spineuniverse.com/imagelibrary/anterior-3d-ct-scan-progressive-kyphoscoliosis)

- 2017 Q2 TSUBAME3.0 Leading Machine Towards Exa & Big Data 1. "Everybody's Supercomputer" - High Performance (12~24 DP Petaflops, 125~325TB/s Mem, 55~185Tbit/s NW), innovative high cost/performance packaging & design, in mere 180m<sup>2</sup>...
- 2."Extreme Green" ~10GFlops/W power-efficient architecture, system-wide power control, advanced cooling, future energy reservoir load leveling & energy recovery
- 3. "Big Data Convergence" BYTES-Centric Architecture, Extreme high BW & capacity, deep memory hierarchy, extreme I/O acceleration, Big Data SW Stack for machine learning, graph processing, ...
- 4. "Cloud SC" dynamic deployment, container-based node co-location & dynamic configuration, resource elasticity, assimilation of public clouds...
- 5. "Transparency" full monitoring & user visibility of machine & job state, accountability via reproducibility

2006 TSUBAME1.0 80 Teraflops, #1 Asia #7 World "Everybody's Supercomputer"



2010 TSUBAME2.0 2.4 Petaflops #4 World "Greenest Production SC"

### 

2013 TSUBAME-KFC #1 Green 500

2013

**TSUBAME2.5** 

upgrade 5.7PF DFP

/17.1PF SFP

20% power

reduction

2017 TSUBAME3.0+2.5 ~18PF(DFP) 4~5PB/s Mem BW 10GFlops/W power efficiency Big Data & Cloud Convergence



Large Scale Simulation C Big Data Analytics Industrial Apps Overview of TSUBAME3.0 (#1 June 2017 Green 500) BYTES-centric Architecture, Scalability to all 2160 GPUs, all nodes, the entire memory hierarchy



### Early TSUBAME3 Architecture for Proposal Ultra High BW, Deep Mem Hierarchy, Low Latency NW



#### TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC



~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)

→ Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

#### TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC



~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)

→ Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

### TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new) - No exterior cable mess (power, NW, water) - Plan to become a future HPE product



#### TSUBAME3.0 Compute Node SGI ICE-XA, a New GPU Compute Blade Co-



# Node Performance Comparison T2/2.5/3

| Metric                        | TSUBAME2.0<br>(2010) | TSUBAME2.5<br>(2013) | TSUBAME3.0<br>(2017) | Factor      |
|-------------------------------|----------------------|----------------------|----------------------|-------------|
| CPU Cores x Freq (GHz)        | 35.16                | 35.16                | 72.8                 | 2.07        |
| CPU Memory Capacity (GB)      | 54                   | 54                   | 256                  | 4.74        |
| CPU Memory Bandwidth (GB/s)   | 64                   | 64                   | 153.6                | 2.40        |
| GPU CUDA Cores                | 1,344                | 8,064                | 14,336               | 1.78        |
| GPU FP64 Peak (TFLOPS)        | 1.58                 | 3.93                 | 21.2                 | 13.4 & 5.39 |
| GPU FP32 Peak (TFLOPS)        | 3.09                 | 11.85                | 42.4                 | 13.7 & 3.58 |
| GPU FP16 (TFLOPS)             | 3.09                 | 11.85                | 84.8                 | 27.4 & 7.16 |
| GPU Memory Capacity (GB)      | 9                    | 18                   | 64                   | 7.1 & 3.56  |
| GPU Memory Bandwidth (GB/s)   | 450                  | 750                  | 2928                 | 6.5 & 3.90  |
| SSD Capacity (GB)             | 120                  | 120                  | 2000                 | 16.67       |
| SSD READ (MB/s)               | 550                  | 550                  | 2700                 | 4.91        |
| SSD WRITE (MB/s)              | 500                  | 500                  | 1800                 | 3.60        |
| Interconnect Bandwidth (Gbps) | 80                   | 80                   | 400                  | 5.00        |

# Hot Pluggable ICE-

### Smaller than the serverno cables or pipes

In India Maria

# 100Gbps x 4

**Figuid Cooled NVMe** 

2-25/10 83071

117 8

Xeon x 2

<mark>≥ 20 TeraFlops</mark>

**256 GByte Memory** 

D

PCIe NVMe

rive Bay x 4



### Integrated 100/200Gbps Fabric Backplane



MARAANAA

ELEKKKKKKK



# TSUBAME3.0 Datacenter





15 SGI ICE-XA Racks2 Network Racks3 DDN Storage Racks20 Total Racks

Compute racks cooled with 32 degrees warm water, Yearound ambient cooling **Av. PUE = 1.033** 

### Japanese Open Supercomputing Sites Aug. 2017 (pink=HPCI Sites)

| Peak<br>Rank | Institution                                | System                                                                                               | Double FP<br>Rpeak | Nov. 2016<br>Top500 |
|--------------|--------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------|---------------------|
| 1            | U-Tokyo/Tsukuba U<br>JCAHP                 | Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C<br>1.4GHz, Intel Omni-Path              | 24.9               | 6                   |
| 2            | Tokyo Institute of<br>Technology GSIC      | TSUBAME 3.0 - HPE/SGI ICE-XA custom NVIDIA Pascal P100 + Intel Xeon, Intel OmniPath                  | 12.1               | NA                  |
| 3            | Riken AICS                                 | K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect<br>Fujitsu                                      | 11.3               | 7                   |
| 4            | Tokyo Institute of<br>Technology GSIC      | TSUBAME 2.5 - Cluster Platform SL390s G7, Xeon X5670 6C 2.93GHz, Infiniband QDR, NVIDIA K20x NEC/HPE | 5.71               | 40                  |
| 5            | Kyoto University                           | Camphor 2 – Cray XC40 Intel Xeon Phi 68C 1.4Ghz                                                      | 5.48               | 33                  |
| 6            | Japan Aerospace<br>eXploration Agency      | SORA-MA - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz,<br>Tofu interconnect 2                   | 3.48               | 30                  |
| 7            | Information Tech.<br>Center, Nagoya U      | Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 2.2GHz, Tofu interconnect 2                                 | 3.24               | 35                  |
| 8            | National Inst. for<br>Fusion Science(NIFS) | Plasma Simulator - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C<br>1.98GHz, Tofu interconnect 2          | 2.62               | 48                  |
| 9            | Japan Atomic Energy<br>Agency (JAEA)       | SGI ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR                                                 | 2.41               | 54                  |
| 10           | AIST AI Research<br>Center (AIRC)          | AAIC (AIST AI Cloud) – NEC/SMC Cluster, NVIDIA Pascal P100 + Intel<br>Xeon, Infiniband EDR           | 2.2                | NA                  |







#### Distributed Large-Scale Dynamic Graph Data Store

Keita Iwabuchi<sup>1, 2</sup>, Scott Sallinen<sup>3</sup>, Roger Pearce<sup>2</sup>, Brian Van Essen<sup>2</sup>, Maya Gokhale<sup>2</sup>, Satoshi Matsuoka<sup>1</sup>

- 1. Tokyo Institute of Technology (Tokyo Tech)
- 2. Lawrence Livermore National Laboratory (LLNL)

3. University of British Columbia



a place of mind The university of British columbia

#### Dynamic Graphs (temporal graph) Sparse Large Scale-free

- the structure of a graph changes dynamically over time
- many real-world graphs are classified into dynamic graph

Source: Jakob Enemark and Kim Sneppen, "Gene duplication models for directed networks with limits on growth", Journal of Statistical Mechanics: Theory and Experiment 2007

- social network, genome analysis, WWW, etc.
  - e.g., Facebook manages
    1.39 billion active users
    as of 2014, with more
    than 400 billion edges

• Most studies for large graphs have not focused on a dynamic

- graph data structure, but rather a static one, such as Graph 500
- Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory (this part is still at an early stage)

Based on K-Computer results, adaping to (1) deep memory hierarchy, (2) rapid dynamic graph changes



#### Node Level Dynamic Graph Data Store

Follows an adjacency-list format and leverages an open address hashing to construct its tables

Develop algorithms and SW exploiting large hierarchical memory



#### Dynamic Graph Construction (on-memory & NVM)

C.f. STINGER (single-node, on memory)

#### STINGER

 A state-of-the-art dynamic graph processing framework developed at Georgia Tech

Baseline model

• A naïve implementation using *Boost* library (C++) and the MPI communication framework



K. Iwabuchi, S. Sallinen, R. Pearce, B. V. Essen, M. Gokhale, and S. Matsuoka, Towards a distributed large-scale dynamic graph data store. In 2016 IEEE Interna- tional Parallel and Distributed Processing Symposium Workshops (IPDPSW) Large-scale Graph Colouring (vertex coloring)

- Color each vertices with the minimal #colours so that **no** two adjacent vertices have the same colour
- Compare our dynamic graph colouring algorithm on **DegAwareRHH** against:
  - 1. two static algorithms including GraphLab
  - 2. an another graph store implementation with same dynamic algorithm (Dynamic-MAP)



Scott Sallinen, Keita Iwabuchi, Roger Pearce, Maya Gokhale, Matei Ripeanu, "Graph Coloring as a Challenge Problem for Dynamic Graph Processing on Distributed Systems", SC' 16

**SC'16** 

### ScaleGraph Large-scale Graph Processing Framework enhanced w/ User-Friendly Python / Spark Interface

- ScaleGraph [Suzumura]
  - X10-based open source **Highly Scalable Large Scale Graph Analytics Library** beyond the scale of billions of vertices and edges on Distributed Systems
    - **XPregel**: Pregel-based bulk synchronous parallel graph processing framework
    - Built-in graph algorithms (Centrality, Connected Component, Clustering, etc.)
- NEW Development: Python Interface
  - Allow users to use ScaleGraph with Spark\* by easy python interface





\*Apache Spark: http://spark.apache.org/

#### Software stack

# Incremental Graph Community Detection

- Background
  - Community detection for large-scale time-evolving and dynamic graphs has been one of important research problems in graph computing.
  - It is time-wasting to compute communities entire graphs every time from scratch.
- Proposal
  - An incremental community detection algorithm based on core procedures in a state-of-the-art community detection algorithm named DEMON.
    - Ego Minus Ego, Label Propagation and Merge



Hiroki Kanezashi and Toyotaro Suzumura, An Incremental Local-First Community Detection Method for Dynamic Graphs, Third International Workshop on High Performance Big Graph Data Management, Analysis, and Mining (BigGraphs 2016), to appear



### GPU-based Distributed Sorting [Shamoto, IEEE BigData 2014, IEEE Trans. Big Data 2015]

- Sorting: Kernel algorithm for various EBD processing
- Fast sorting methods
  - Distributed Sorting: Sorting for distributed system
    - Splitter-based parallel sort
    - Radix sort
    - Merge sort
  - Sorting on heterogeneous architectures
    - Many sorting algorithms are accelerated by many cores and high memory bandwidth.
- Sorting for large-scale heterogeneous systems remains unclear
- We develop and evaluate <u>bandwidth and latency reducing</u> GPU-based HykSort on TSUBAME2.5 <u>via latency hiding</u>
  - Now preparing to release the sorting library



#### **GPU implementation of splitterbased sorting** (HykSort)

- Weak scaling performance (Grand Challenge on TSUBAME2.5)
  - 1 ~ 1024 nodes (2 ~ 2048 GPUs)
  - 2 processes per node
  - Each node has 2GB 64bit integer
- C.f. Yahoo/Hadoop Terasort: 0.02[TB/s]
  - Including I/O

#### **Performance prediction**





### Xtr2sort: Out-of-core Sorting Acceleration using GPU and Flash NVM [IEEE BigData2016]

How to combine deepening memory layers for future HPC/Big Data workloads, targeting Post Moore Era?

- Sample-sort-based Out-of-core Sorting Approach for Deep Memory Hierarchy Systems w/ GPU and Flash NVM
  - I/O chunking to fit device memory capacity of GPU
  - Pipeline-based Latency hiding to overlap data transfers between NVM, CPU, and GPU using asynchronous data transfers, e.g., cudaMemCpyAsync(), libaio



### Out-of-core GPU-MapReduce for

### Large-scale Graph Processing [IEEE Cluster 2014]

Emergence of large-scale graphs

- SNS, road network, smart grid, etc.
- Millions to trillions of vertices/edges
- $\rightarrow$  Need for fast graph processing on supercomputers

**Problem:** GPU memory capacity limits scalable large-scale graph processing

**Proposal:** Out-of-core GPU memory management on MapReduce

- Stream-based GPU MapReduce
- Out-of-core GPU sorting

#### **Experimental Results:**

performance improvement over CPUs

- Map: 1.41x, Reduce: 1.49x, Sort: 4.95x speedup
- Overlapping communication effectively



# Hierarchical, UseR-level and ON-demand File system(HuronFS) (IEEE ICPADS 2016) w/LLNL



- HuronFS: dedicated dynamic instances to provide "burst buffer" for caching data
- I/O requests from Compute Nodes are forwarded to HuronFS
- The whole system consists of several SHFS (Sub HuronFS)
  - Workload are distributed among all the SHFS using hash of file path
- Each SHFS consists of a Master and several IOnodes
  - Masters: controlling all IOnodes in the same SHFS and handling all I/O requests
  - IOnodes: storing actual data and transferring data with Compute Nodes
- Supporting TCP/IP, Infiniband (CCI framework)
- Supporting Fuse, LD\_PRELOAD

### HuronFS Basic IO Performance



Throughput from single IOnode

### Plans

- Continuing researching on auto buffer allocation
- Utilizing computation power on IOnodes
  - Data preprocessing
  - Format conversion



### GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data

Background

Snore sound (SnS) data carry very important information for diagnosis and evaluation of Primary Snoring and Obstructive Sleep Apnea (OSA). With the increasing number of collected SnS data from subjects, how to handle such large amount of data is a big challenge. In this study, we utilize the Graphics Processing Unit (GPU) to process a large amount of SnS data collected from two hospitals in China and Germany to accelerate the features extraction of biomedical signal.

• Acoustic features of SnS data

we extract **11** acoustic features from a large amount of SnS data, which can be visualized to help doctors and specialists to diagnose, research, and remedy the diseases efficiently.

| Subjects       | Total Time<br>(hours) | Data Size<br>(GB) | Data<br>format | Sampling Rate |
|----------------|-----------------------|-------------------|----------------|---------------|
| 57<br>(China + | 187.75                | 31.10             | WAV            | 16 kHz, Mono  |
| Germany)       |                       |                   |                |               |

#### Snore sound data information



Results of GPU and CPU based systems for processing SnS data

#### • Result

We set 1 CPU (with Python2.7, numpy 1.10.4 and scipy 0.17 packages) for processing 1 subject's data as our baseline. Result show that the GPU based system is almost  $4.6 \times$  faster than the CPU implementation. However, the speed-up decreases when increasing the data size. We think that this result should be caused by the fact that, the transmission of data is not hidden by other computations, as will be a real-world application.

\* Jian Guo, Kun Qian, Huijie Xu, Christoph Janott, Bjorn Schuller, Satoshi Matsuoka, "GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data", In proceedings of 5th IEEE Global Conference on Consumer Electronics (GCCE 2016), October 11-14, 2016.

### TSUBAME3.0 Container-Based Fine-grained Spatial Resource Allocations of Fat Nodes



Resource Isolation via UGE Containers (future Docker etc.)

| Job | Allocated Resource                      |  |  |  |
|-----|-----------------------------------------|--|--|--|
| 1   | CPU 2Cores, NIC0, GPU1, 32GB Mem        |  |  |  |
| 2   | CPU 8 Cores, 64GB Mem                   |  |  |  |
| 3   | CPU 4 Cores, GPU0、16GB Mem              |  |  |  |
| 4   | CPU 8 Cores, 64GB Mem                   |  |  |  |
| 5   | CPU 4 Cores, NIC2&3, GPU2&3, 48G<br>Mem |  |  |  |

Container configuration and deployment tied to Univa Grid Engine

# Background

A kind of resource assignment fragmentation

Multi-GPU batch-queue systems have many idle GPUs despite having jobs waiting, due to the *scattered idle-GPU problem* [1].



Scenario: Job 1 requests two GPUs on one node but each node has only one unoccupied GPU left.Result: Job 1 cannot run and two GPUs are left idle.

[1] P. Markthub et al. "Using rCUDA to Reduce GPU Resource-Assignment Fragmentation Caused by Job Scheduler," PDCAT2014

# Idle-GPU Problem in Multi-GPU Batch-Queue Systems



51

# **Previous Solution & Problems**



Increased communication overhead

#### **Previous Solution [1]:**

- Enable the system to serve more jobs by creating a GPU proxy that links with a remote GPU.
- Proven to reduce job waiting times as much as 25%.

System can satisfy more jobs

#### **Problems:**

- Remote GPU execution overhead
- Network congestion

The execution times of GPU communication intensive applications (e.g. LAMMPS, SRAD) may increase more than 5 times!!!

[1] P. Markthub et al. "Using rCUDA to Reduce GPU Resource-Assignment Fragmentation Caused by Job Scheduler," PDCAT2014

# **New Solution Overview**



Migrate execution on a remote GPU to a local GPU when it becomes available can solve the performance problems

#### **Propose:**

#### Low-overhead remote GPU execution middleware

- **1. mrCUDA:** an extension of *rCUDA* [1] to enable remote-to-local GPU migration.
- **2. MRQ:** a heuristic extension of job scheduling algorithms to make the best out of mrCUDA.
- [1] F. Silla, "Is remote GPU virtualization useful?" http://rcuda.net/pub/rCUDA barna 15.pdf, September 2015.

# mrCUDA

**Objective:** Enable seamless and on-demand remote-to-local GPU migration on rCUDA

- rCUDA handles remote GPU execution, while mrCUDA handles GPU migration.
- GPU migration starts after mrCUDA receives a migration command via its special UNIX socket.



**Migration Algorithm** – a modified version of Replay Method [1]:

- Intercept all CUDA invocations.
- Before migration: Pass all intercepted calls to *rCUDA* while recording some CUDA calls (e.g. cudaMalloc). To recreate remote GPU's states on local GPU
- **During migration:** Replay the recorded calls in order and memsync GPU data.
- After migration: Pass all intercepted calls to *libcudart* without recording.

[1] A. Nukada et al. "NVCR: A transparent checkpoint-restart library for NVIDIA CUDA," IPDPWS2011

### **Case Study: Migrating remote CUDA Execution of LAMMPS**



<sup>\*2</sup> nodes, Tesla K20c, InfiniBand 4xFDR **\*x%:** migrate after finish x% of total iterations

# **GPU Occupancy Patterns**

Systems can server more jobs concurrently with MRQ.

MRQ uses the same scheduling policy as FCFS.

Jobs do not experience significant execution time expansion, mainly thanks to GPU migration.



# Open Source Release of EBD System Software (install on T3/Amazon/ABCI)

- mrCUDA
  - rCUDA extension enabling remoteto-local GPU migration
  - <u>https://github.com/EBD-</u> <u>CREST/mrCUDA</u>
  - GPU 3.0
  - Co-Funded by NVIDIA
- Huron FS (w/LLNL)
  - I/O Burst Buffer for Inter Cloud Environment
  - <u>https://github.com/EBD-</u> <u>CREST/cbb</u>
  - Apache License 2.0
  - Co-funded by Amazon

- ScaleGraph Python
  - Python Extension for ScaleGraph X10-based Distributed Graph Library
  - <u>https://github.com/EBD-</u> <u>CREST/scalegraphpython</u>
  - Eclipse Public License v1.0
- GPUSort
  - GPU-based Large-scale Sort
  - <u>https://github.com/EBD-</u> <u>CREST/gpusort</u>
  - MIT License
- Others, including dynamic graph store

Estimated Compute Resource Requirements for Deep Learning [Source: Preferred Network Japan Inc.]

To complete the learning phase in one day



**JST-REST** "Development and Integration of Artificial Intelligence Technologies for Innovation Acceleration"

Fast and cost-effective deep learning algorithm platform for video processing in social infrastructure

Principal Investigator:Koichi ShinodaCollaborators:Satoshi Matsuoka

Tsuyoshi Murata Rio Yokota Tokyo Institute of Technology (Members RWBC-OIL 1-1 and 2-1)

# Background

- Video processing in smart society for safety and security
  - Intelligent transport systems Drive recorder video
  - Security systems Surveillance camera video
- Deep learning
  - Much higher performance than before
  - IT giants with large computational resources has formed a monopoly

Problems:





- Real-time accurate recognition of small objects and their movement
- Edge-computing without heavy traffic on Internet
- Flexible framework for training which can adapt rapidly to the environmental changes

### Research team





# 4 Layers of Parallelism in DNN Training

- Hyper Parameter Search
  - Searching optimal network configurations and parameters
  - Often use evolutionary algorithms
- Data Parallelism
  - Split and parallelize the batch data
  - Synchronous, asynchronous, hybrid, ...
- Model Parallelism
  - Split and parallelize the layer calculations in forward/backward propagation
- ILP and other low level Parallelism
  - Parallelize the convolution operations etc. (in reality matrix multiply)



#### Parallelizing Deep Neural Network Training Data Parallel SGD(Stochastic Gradient Descent)



Example AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers Background Proposal

• In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN

We propose a empirical performance model for an ASGD deep learning system SPRINT which considers probability distribution of mini-batch size and staleness



Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016

### Performance Prediction of Future HW for CNN

Predicts the best performance with two future architectural extensions
 FP16: precision reduction to double the peak floating point performance
 EDR IB: 4xEDR InfiniBand (100Gbps) upgrade from FDR (56Gbps)

→ Not only flops, but also NW injection bandwidth is important for scalability

| Prediction of best parameters (average minibatch size $138\pm25\%$ ) |        |            |            |                        |  |  |  |
|----------------------------------------------------------------------|--------|------------|------------|------------------------|--|--|--|
|                                                                      | N_Node | N_Subbatch | Epoch Time | Average Minibatch Size |  |  |  |
| (Current HW)                                                         | 8      | 8          | 1779       | 165.1                  |  |  |  |
| FP16                                                                 | 7      | 22         | 1462       | 170.1                  |  |  |  |
| EDR IB                                                               | 12     | 11         | 1245       | 166.6                  |  |  |  |
| FP16 + EDR IB                                                        | 8      | 15         | 1128       | 171.5                  |  |  |  |

#### TSUBAME-KFC/DL ILSVRC2012 dataset deep learning Prediction of best parameters (average minibatch size 138±25%)





### METI AIST-AIRC ABCI



as the worlds first large-scale OPEN AI Infrastructure

- ABCI: <u>AI</u> Bridging <u>Cloud</u> Infrastructure
  - Top-Level SC compute & data capability for DNN (130<sup>200</sup> AI-Petaflops)
  - <u>Open Public & Dedicated</u> infrastructure for AI & Big Data Algorithms, Software and Applications
  - Platform to accelerate joint academic-industry R&D for AI in Japan



- < 3MW Power</li>
- < 1.1 Avg. PUE
- Operational 2017Q4 ~2018Q1



# The "Chicken or Egg Problem" of AI-HPC Infrastructures



- "On Premise" machines in clients => "Can't invest in big in AI machines unless we forecast good ROI. We don't have the experience in running on big machines."
- Public Clouds other than the giants => "Can't invest big in AI machines unless we forecast good ROI. We are cutthroat."
- Large scale supercomputer centers => "Can't invest big in AI machines unless we forecast good ROI. Can't sacrifice our existing clients and our machines are full"
- Thus the giants dominate, AI technologies, big data, and people stay behind the corporate firewalls…

# But Commercial Companies esp. the "AI Giants" are Leading AI R&D, are they not?

- Yes, but that is because their shot-term goals could harvest the low hanging fruits in DNN rejuvenated AI
- But AI/BD research is just beginning—— if we leave it to the interests of commercial companies, we cannot tackle difficult problems with no proven ROI
  - Very unhealthy for research
- This is different from more mature fields, such as pharmaceuticals or aerospace, where there is balanced investments and innovations in both academia/government and the industry



Meanwhile, Larry Page is planning to move its self-driving unit out of Google X, its

for human drivers.

A Google self-driving car on the road in Mountain View, C

### ABCI Prototype: AIST AI Cloud (AAIC) March 2017 (#3 June 2017 Green 500)

- 400x NVIDIA Tesla P100s and Infiniband EDR accelerate various AI workloads including ML (Machine Learning) and DL (Deep Learning).
- Advanced data analytics leveraged by 4PiB shared Big Data Storage and Apache Spark w/ its ecosystem.







### The "Real" ABCI – 2018Q1

#### • Extreme computing power

- w/ >130 AI-PFlops for AI/ML especially DNN
- x1 million speedup over high-end PC: 1 Day training for 3000-Year DNN training job
- TSUBAME-KFC (1.4 AI-Pflops) x 90 users (T2 avg)

#### • Big Data and HPC converged modern design

- For advanced data analytics (Big Data) and scientific simulation (HPC), etc.
- Leverage Tokyo Tech's "TSUBAME3" design, <u>but differences/enhancements</u> <u>being AI/BD centric</u>
- Ultra high BW & Low latency memory, network, and storage
  - For accelerating various AI/BD workloads
  - Data-centric architecture, optimizes data movement
- Big Data/AI and HPC SW Stack Convergence
  - Incl. results from JST-CREST EBD
  - Wide contributions from the PC Cluster community desirable.
- Ultra-Green (PUE<1.1), High Thermal (60KW) Rack
  - Custom, warehouse-like IDC building and internal pods
  - Final "commoditization" of HPC technologies into Clouds









### ABCI Cloud Infrastructure

#### Ultra-dense IDC design from ground-up

- Custom inexpensive lightweight "warehouse" building w/ substantial ABCI AI-IDC CG Image earthquake tolerance
- x20 thermal density of standard IDC
- Extreme green
  - Ambient warm liquid cooling, large Li-ion battery storage, and highefficiency power supplies, etc.
  - Commoditizing supercomputer cooling technologies Clouds (60KW/rack)
- Cloud ecosystem
  - Wide-ranging Big Data and HPC standard software stacks
- Advanced cloud-based operation
  - Incl. dynamic deployment, container-based virtualized provisioning, multitenant partitioning, and automatic failure recovery, etc.
  - Joining HPC and Cloud Software stack for real

#### Final piece in the commoditization of HPC (into IDC)





### ABCI Cloud Data Center "Commoditizing 60KW/rack Supercomputer".



• Single Floor, inexpensive build

- Hard concrete floor 2 tonnes/m2 weight tolerance for racks and cooling pods
- Number of Racks

• Initial: 90

- Max: 144
- Power Capacity
  - 3.25 MW (MAX)
- Cooling Capacity
  - 3.2 MW (Minimum in Summer)

#### Data Center Image

# Implementing 60KW cooling in Cloud IDC – Cooling Pods



Flat concrete slab – 2 tonnes/m2 weight tolerance



# ABCI Procurement Benchmarks



- Big Data Benchmarks
  - (SPEC CPU Rate)
  - Graph 500
  - MinuteSort
  - Node Local Storage I/O
  - Parallel FS I/O

# No traditional HPC Simulation Benchmarks Except SPECCPU

- AI/ML Benchmarks
  - Low precision GEMM
    - CNN Kernel, defines "AI-Flops"
  - Single Node CNN
    - AlexNet => RESNET?
    - ILSVRC2012 Dataset
  - Multi-Node CNN
    - Caffe+MPI (could allow other MPI-enabled frameworks)
  - Large Memory CNN
    - Convnet on Chainer
  - RNN / LSTM
    - OpenNMT RNN ( w/NICT UCL)
      - (collaboration

# **Basic Requirements for AI Cloud System**



#### Application

- Easy use of various ML/DL/Graph frameworks from Python, Jupyter Notebook, R, etc.
- ✓ Web-based applications and services provision

#### System Software

- ✓ HPC-oriented techniques for numerical libraries, BD Algorithm kernels, etc.
- $\checkmark\,$  Supporting long running jobs / workflow for DL
- Accelerated I/O and secure data access to large data sets
- ✓ User-customized environment based on Linux containers for easy deployment and reproducibility

#### Hardware

**0S** 

Modern supercomputing facilities based on commodity components

# Fujitsu Deep Learning Processor (DLUTM) Fujitsu





Supercomputer K technologies

- Architecture designed for Deep Learning
- High performance HBM2 memory
- Low power design

**DLU<sup>™</sup>** features

→ Goal: 10x Performance/Watt compared to others



Massively parallel : Apply supercomputer interconnect technology

- → Ability to handle large scale neural networks
- → TOFU Network derivative for massive scaling

Designed for Scalable Learning, technically superior to Google TPU2

"Exascale" Al possible in 1H2019





# Co-Design of BD/ML/AI with HPC using BD/ML/AI

Acceleration and Scaling of

BD/ML/AI via HPC and

**Technologies and** 

Infrastructures

### - for survival of HPC

Accelerating Conventional HPC Apps



Optimizing System Software and Ops



Future Big Data AI Supercomputer Design



Big Data Al-Oriented Supercomput

ers

ABCI: World's first and largest open 100 Peta AI-Flops AI Supercomputer, Fall 2017, for co-design <u>Mutual and Semi-</u> <u>Automated Co-</u> <u>Acceleration of</u> <u>HPC and BD/ML/AI</u>

Acceleration Scaling, and Control of HPC via BD/ML/AI and future SC designs Big Data and ML/AI Apps and Methodologies

Image and Video

Large Scale Graphs



**Robots / Drones** 

#### Sonar collects data from the HPC Center and applications, allowing users to access it with secure permissions (Slide courtesy Todd Gamblin @ LLNL)



LLNL-PRES-730739



#### We combine neural networks with queue simulation to predict resource utilization (Slide courtesy Todd Gamblin @ LLNL)

LLNL-PRES-730739





#### Power optimization using Deep Q-Network • Background

Kento Teranishi

Power optimization by frequency control in existing research  $P = f(x_1, x_2, ...)$  $T_{exe} = g(x_1, x_2, ...)$ Performance counter Frequency Temperature Frequency,... Detailed analysis is necessary Use Deep Learning for analysis. Low versatility Objective Implement the computer control system using Deep Q-Network. Counter Power Frequency Deep Q-Network (DQN) Frequency control Temperature Deep reinforcement learning etc. Calculate action value function Q from neural network Used for game playing AI, robot car, AlphaGO.

# We are implementing the US AI&BD strategies already ... in Japan, at AIRC w/ABCI

- Strategy 5: Develop shared public datasets and environments for AI training and testing. The depth, quality, and accuracy of training datasets and resources significantly affect AI performance. Researchers need to develop high quality datasets and environments and enable responsible access to high-quality datasets as well as to testing and training resources.
- Strategy 6: Measure and evaluate AI technologies through standards and benchmarks. Essential to advancements in AI are standards, benchmarks, testbeds, and community engagement that guide and evaluate progress in AI. Additional research is needed to develop a broad spectrum of evaluative techniques.

THE NATIONAL ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT STRATEGIC PLAN

National Science and Technology Council

Networking and Information Technology Research and Development Subcommittee

October 2016



What is worse: Moore's Law will end in the 2020's

- Much of underlying IT performance growth due to Moore's law
  - "LSI: x2 transistors in 1~1.5 years"
  - Causing qualitative "leaps" in IT and societal innovations
  - The main reason we have supercomputers and Google...
- •But this is slowing down & ending, by mid 2020s...!!!
  - End of Lithography shrinks
  - End of Dennard scaling
  - End of Fab Economics

The curse of <u>constant</u> <u>transistor power</u> shall



Gordon Moore

- How do we sustain "performance growth" beyond the "end of Moore"?
  - Not just one-time speed bumps
  - Will affect all aspects of IT, including BD/AI/ML/IoT, not just HPC
  - End of IT as we know it

### 20 year Eras towards of End of Moore's Law



flat performance

Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

Need to realize the next 20-year era of supercomputing

The "curse of constant transistor power"

- Ignorance of this is like ignoring global warming -
- Systems people have been telling the algorithm people that "FLOPS will be free, bandwidth is important, so devise algorithms under that assumption"
- This will certainly be true until exascale in 2020...
- But when Moore's Law ends in 2025-2030, constant transistor power (esp. for logic) = FLOPS will no longer be free!
- So algorithms that simply increase arithmetic intensity will no longer scale beyond that point
- Like countering global warming need disruptive change in computing in HW-SW-Alg-Apps etc. for the next 20 year era

### Performance growth via <u>data-centric computing:</u> <u>"From FLOPS to BYTES"</u>

- Identify the new parameter(s) for scaling over time
- Because data-related parameters (e.g. capacity and bandwidth) will still likely continue to grow towards 2040s
- Can grow transistor# for compute, but CANNOT use them AT THE SAME TIME(Dark Silicon) => <u>multiple computing units specialized to type of data</u>
- <u>Continued capacity growth</u>: 3D stacking (esp. direct silicon layering) and low power NVM (e.g. ReRAM)
- <u>Continued BW growth</u>: Data movement energy will be <u>capped constant</u> by dense 3D design and advanced optics from silicon photonics technologies
- Almost back to the old "vector" days(?), but no free lunch latency still problem, locality still important, need <u>general algorithmic acceleration</u> <u>thru data capacity and bandwidth</u>, not FLOPS

Many Core Era







Flops-Centric Algorithms and Apps

Flops-Centric System Software



Event

Hardware/Software System APIs Flops-Centric Massively Parallel Architecture



Transistor Lithography Scaling (CMOS Logic Circuits, DRAM/SRAM) ~2025 M-P Extinction



Bytes-Centric Algorithms and Apps

**Bytes-Centric System Software** 

Hardware/Software System APIs

Data-Centric Heterogeneous Architecture

Novel Devices + CMOS (Dark Silicon) (Nanophotonics, Non-Volatile Devices etc.) Post-Moore is NOT a More-Moore device as a panacea

Device & arch. advances improving data-related parameters over time

Runtime "Rebooting Computing" in terms of devices, architectures, software.New memory Devices PC-RAM Algorithms, and ReRAM applications necessary STT-MRAM => Co-Design even 3D architecture more important fabrication c.f. Exascale

