Hybrid Parallel Programming for Massive Graph Analysis

**Hybrid** **Parallel** **Programming**

**for** **Massive** **Graph** **Analysis**

KKameshhMdd Maddurii

KMadduri@lbl.gov

Computational Research Division

Lawrence Berkeley National Laboratory

SIAM Annual Meeting 2010

July 12 12, 2010

**Hybrid** **Parallel** **Programming**

Large‐scale graph analysis utilizing

•Clusters of x86 multicore

processors p

–MPI + OpenMP/UPC

• CPU+GPU

–MPI + OpenCL

• FPGAs FPGAs, accelerators

–Host code + accelerator code

Why hybrid programming?

•Traditional sources of

per**for**mance f

improvement are flat‐

li lining i

•We need new

algorithms that

exploit large on‐chip

memory, shared

caches, and high

DRAM bandwidth

Image source: Herb Sutter, “The Free Lunch is Over”, Dr. Dobb’s Journal, 2009.

This talk: Two case studies

•MPI + OpenMP on shared‐memory multicore

processor clusters

–**Graph** analytics on online social network crawls,

synthetic “power‐law” power‐law random graphs

– Traversal and simplification of a DNA fragment

assembly string graph arising in a de novo short short‐

read genome assembly algorithm

Characterizing Large‐scale

graph‐theoretic computations

RRandom/Global d /Gl b l EEnumerate t all ll contacts t t of f K within ithi X hops h

memory accesses

Find all events in the past six months

similar to event “Y”

Locality

Characteristics

Streaming data/

Local computation p

Computational

Complexity

O(N)

O(N log N)

O(N2 O(N log N)

)

Enumerate all friends of K

List today’s top trending events

10 4 10 6 10 8 10 12

Peta+

Data size (N: number of vertices/edges)

**Parallel**ization Strategy

RRandom/Global d /Gl b l

memory accesses

Locality

Characteristics

Streaming data/

Local computation p

Computational

Complexity

O(N)

O(N log N)

O(N2 O(N log N)

)

Replicate the graph

on each node

Partition the network,

aggressively attempt to

minimize communication

Partition the network

10 4 10 6 10 8 10 12

Peta+

Data size (N: number of vertices/edges)

Minimizing Communication

• Irregular and memory‐intensive graph problems:

Intra Intra‐ and Inter Inter‐node node communication (+ I/O costs, costs

memory latency) costs typically dominate local

computational p complexity p y

•Key to parallel per**for**mance: Enhance data locality,

avoid superfluous p inter‐node communication

**Graph** +

data structures

– Avoid a P‐way partitioning of the graph

–Create PM P/M G replicas 4-way partitioning

M G

Core

Cache

Core Core Core

Cache Cache Cache 3 replicas

Memory Controller Hub

DRAM capacity M P

P=12, M G/M P = 4

Real‐world data

Assembled a collection of graphs **for** algorithm per**for**mance analysis, from

some of the largest publicly‐available network data sets.

N Name # vertices ti # edges d TType

Amazon-2003 473.30 K 3.50 M co-purchaser

eu-2005 862.00 K 19.23 M www

Flickr 1.86 M 22.60 M social

wiki-Talk 2.40 M 5.02 M collab

orkut 307M 3.07 M 223 223.00 00 M social

cit-Patents 3.77 M 16.50 M cite

Livejournal 5.28 M 77.40 M social

uk-2002 18.50 M 198.10 M www

USA-road 23.90 M 29.00 M Transp.

webbase-2001 118.14 M 1.02 B www

“2D” **Graph** Partitioning Strategy

• Tuned **for** graphs with unbalanced degree

distributions and incremental updates

–Sort vertices by degree

– Form roughly M MG/M G/M P local communities around “high‐ high

degree” vertices & partition adjacencies

– Reorder vertices by degree, assign contiguous chunks to

each of the MG/MP nodes

– Assign ownership of any remaining low‐degree vertices to

processes

• Comparison: 1D p‐way partitioning, 1D p‐way

partitioning p gwith

vertices shuffled

**Parallel** Breadth‐First Search Implementation

source

vertex

5

8

0 7 3 4 6 9

2

• Expensive preprocessing partitioning + reordering step,

currently untimed

5 2

0 1 4 7 3 9

6

1

8

4: “high-degree”

Two vertex partitions

X x x

x x x

x x x x

x x x

x x

x x

x x x

x

x

x x x

**Parallel** BFS Implementation

• Concurrency in each phase limited by size of

ffrontier ti array

• Local computation: inspecting adjacencies,

creating a list of unvisited vertices

• **Parallel** communication step: All‐to‐all

exchange of frontier vertices

– Potentially P2 y exchanges g

– Partitioning, replication, and reordering

significantly reduce number of messages

Single‐node Multicore Optimizations

1. Software prefetching on Intel Nehalem (supports 32 loads and 20 stores in

flight)

– Speculative loads of index array and adjacencies of frontier vertices will

reduce compulsory cache misses.

2. Aligning adjacency lists to optimize memory accesses

– 16‐byte aligned loads and stores are faster.

– Alignment helps reduce cache misses due to fragmentation

– 16‐byte aligned non‐temporal stores (during creation of new frontier) are fast.

3. SIMD SSE integer intrinsics to process “high‐degree” vertex adjacencies.

4. Fast atomics (BFS is lock‐free w/ low contention, and CAS‐based intrinsics

have very y low overhead) )

5. Hugepage support (significant TLB miss reduction)

6. NUMA‐aware memory allocation exploiting first‐touch policy

**Parallel** Per**for**mance

• 32 nodes of NERSC’s Carver system

Billion Edgges

trraversed/seecond

– dual‐socket, quad‐core Intel Nehalem 2.67 GHz processor node

– 24 GB DDR3 1333 MHz memory per node, or roughly 768 TB aggregate memory

Orkut crawl: 3.07M vertices, 227M edges

12 **Hybrid** 12

10

8

6

4

2

0

1D partitioning

1D partitioning +

Randomization

**Parallel** Strategy

Single‐node per**for**mance: 300‐500 M traversed edges/second.

10

8

6

4

2

0

Synthetic RMAT network: 2 billion

vertices, 32 billion edges

Genome Assembly Preliminaries

nucleotide

contigs

ACACGTGTGCACTACTGCACTCTACTCCACTGACTA

“Scaffold”

the contigs

DNA sequences/

reads

ACATCGTCTG

Align

the reads

Genome assembler

TCGCGCTGAA

Genome

(collection of

llong strings) i )

Sequencer

Sample

De novo Genome Assembly

• Genome Assembly: “a big

jigsaw puzzle puzzle”

• De novo: Latin expression

meaning “from from the

beginning”

– No prior p reference

organism

– Computationally falls within

th the NP NP‐hard h dclass l of f

DNA extraction

Fragment

the DNA

Clone into vectors Isolate vector DNA

problems Sequence the library

Genome Assembly

CTCTAGCTCTAA AAGTCTCTAA

AGGTCTCTAA AAGCTATCTAA

CTCTAGCTCTAAGGTCTCTAACTAAGCTAATCTAA

Eulerian path‐based strategies

•Break up the (short) reads into overlapping

strings ti of f length l thkk.

k = 5

ACGTTATATATTCTA ACGTT CGTTA GTTAT

TTATA ….. TTCTA

CCATGATATATTCTA CCATG CATGA ATGAT

TGATA ….. TTCTA

• CConstruct t ta de d BBruijngraph ij h ( (a di directed t d graph h

representing overlap between strings)

de Bruijn graphs

• Each (k‐1)‐mer represents a node in the graph

• Edge exists between node a to b iff there exists a k‐mer such

that its prefix is a and suffix is b.

AAGACTCCGACTGGGACTTT

ACTCCGACTGGGACTTTGAC

CGA

CCG

TCC

CTC

TGA

AAG AGA GAC ACT CTT TTT

• TTraverse th the graph h (if possible, ibl identifying id tif i an EEulerian l i path) th) tto

**for**m contigs.

GGA

GGG

• However, correct assembly is just one of the many possible

, y j y p

Eulerian paths.

TGG

CTG

Steps in the de Bruijn graph‐based

assembly y scheme

1 Preprocessing

FASTQ input data

Sequences after

error resolution

2 Kmer spectrum

Determine

appropriate pp p

value of k to use

6

5

4

Compute and

memory-intensive

y

3

Scaffolding

Error resolution + further

graph compaction

Vertex/edge compaction

(lossless trans**for**mations)

Preliminary de Bruijn graph

construction

**Graph** construction

•Store edges only, represent vertices (kmers)

implicitly implicitly.

• Distributed graph representation

• Sort by start vertex

•Edge storage **for**mat:

Read ‘x’:

ACTAGGC

CTAGGCA

Store edge (ACTAGGCA), orientation,

edge direction, edge id (y), originating read id (x), edge count

2 bits per nucleotide

Vertex compaction

•High percentage of unique kmers

Try compacting kmers from same read first

–If kmer length is k, potentially k‐times space

reduction! d i !

ACTAG CTAGG TAGGA AGGAC

ACTAGGAC

• **Parallel**ization: computation can be done

locally by sorting by read ID, ID traversing unit‐

cardinality kmers.

Summary of various steps and **Analysis**

A metagenomic data set (140 million reads, 20G bp), k = 45.

Step Memory Approach used **Parallel**ism &

footprint Computational

kernels

1. Preprocessing minimal Streaming file read and “Pleasantly

write write, kmer merging parallel” parallel , I/O

intensive

2. Kmer spectrum ~ 200 GB 3 local sorts, 1 AlltoAll **Parallel** sort,

communication steps. p AlltoAllv

3. **Graph**

construction

Gap

compaction

~ 320 GB Two sorts Fully local

computation

4. **Graph** ~ 60 GGB 3+ 3 local oca so sorts, ts, 2 AlltoAll to AlltoAllv to + local oca

communication steps +

local graph traversal

computation

5. Error detection ~ 35 GB Connected components + Intensive

AlltoAll communication

6. Scaffolding ? GB Euler tours over

Mostly local

components

computation

**Parallel** Implementation Details

•Data set under consideration requires 320 GB

f**for** in‐memory i processing i

– NERSC Franklin system [Cray XT4, 2.3 GHz quad‐

core OOpterons, t 8 GB memory per node] d]

– Experimented with 64 nodes (256‐way parallelism)

and 128 nodes (512 (512‐way) way)

•MPI across nodes + OpenMP within a node

• Local sort: multicore‐parallel radix sort

• Global sort: bin data in parallel + AlltoAll

p

comm. + local sort + AlltoAll comm.

(secondss)

Executtion

time

**Parallel** Per**for**mance

128 nodes: 213 seconds 64 nodes: 340 seconds

160

140

120

100

80

60

40

20

0

Assembly step

160

140

120

100

80

60

40

20

0

Assembly step

Preprocessing

Kmer Freq.

**Graph** construct

**Graph** compact

• Comparison: Velvet (open‐source serial code) takes ~ 12 hours

on a 500 GB machine.

Talk Summary

• Two examples of “hybrid” parallel programming **for** analyzing

large‐scale graphs

– Up to 3x faster with hybrid approaches on 32 nodes

• Two different types of graphs, the strategies to achieve high

per**for**mance differs

– Social and in**for**mation networks: low diameter, difficult to generate

balanced partitions with low edge cuts

– DNA fragment f tstring ti graph: h O(N) diameter, di t multiple lti l connected td

components

• Single‐node multicore optimizations + communication

optimization (reducing data volume and number of messages

in All‐to‐all exchange).

Acknowledgments

BERKELEY PAR LAB

Thank you!

Questions?

KMadduri@lbl KMadduri@lbl.gov

gov