Outline Introduction au cours 1 Introduction au cours Compilation et optimisations de codes Des p'tites boucles, toujours des p'tites boucles Exemples de spécicités architecturales 2 3 Intérêts et problèmes simple : variantes Fusion avec décalage
Ó Ü Ù Ð Å Ò Ñ ÖÐ Ñ ÒÖ Ø ÕÙ ÔÓÙÖÐ ÓÑÔ Ø ÓÒº Á ÓÙ Ð ÕÙ Ú Ð ÒØÑ Ò Ñ ÖÐ ÙÖ Ñ Ü Ñ Ð ³ÙÒ Ñ ÒÒ³ Ý ÒØÕÙ O(V 2 E) ÓÙ ÓÒØÖ ÒØ Ô Ö Ó ³ ÓÖÐÓ º Ë Ü ÓÑÔÐ Ü Ø O(VElog(V))º г Ð ÓÖ Ø Ñ ÓÙØ¹Ó ¹ ÐØ Ö ÓÑÔÐ Ü Ø O(E Ö ÔÓ ÒÙÐ Ô Ö Ó ³ ÓÖÐÓ µº Ð ÓÖ Ø Ñ Ä Ö ÓÒ Ø 2 Å Ò Ñ ÖÐ ÒÓÑ Ö ÓÒØÖ ÒØ ÔÓÙÖÐ ÓÑÔ Ø ÓÒº ÕÙ Ú Ð ÒØÑ Ñ Ò ÖÐ ÒÓÑ Ö ³ Ö ÔÓ ÒÙÐºÎ Ö ÒØ ) Ú Ö ÓÒ µóù
Ê Ø Ñ Ò Ø ÖÙ Ø (u,v) Ê Ø Ñ Ò ÓÒØ ÓÒr : V ZØ ÐÐ ÕÙ ÔÓÙÖØÓÙØ Öe = w(e) ÒÓÑ Ö Ö ØÖ ÙÖг Öeºd(u) ÙÖ Ð³ÓÔ Ö Ø ÙÖuº ÓÖÖ ÔÓÒ Ò Ð Ö Ø Ñ Ò º Ö Ô Ö٠غ È Ö Ó ³ ÓÖÐÓ Φ(G) ÙÖ Ñ Ü Ñ Ð ³ÙÒ Ñ Ò Ò Ö ØÖ º w r (e) = w(e)+r(v) r(u) +1 ÒÓÑ Ö ÔÓ Ø Ö ØÖ µº ÙØÌÖÓÙÚ ÖrØ ÐÕÙ Φ(G r ) Ó ØÑ Ò Ñ Ðº
ij Ð ÓÖ Ø Ñ Ä Ö ÓÒ ØË Ü ½»¾µ W(u,v) = min{w(p) u P v}º ÇÒ ÐÙÐ D(u,v) = max{d(p) u P v Øw(P) W(u,v)}º = ÐÓÝ ¹Ï Ö ÐÐ ÐÐ¹Ô Ö ÓÖØ Ø¹Ô Ø µ O(V 3 )º ÈÖÓÔÖ Ø ½Φ(G) = D(u,v)ÔÓÙÖ ÖØ Ò u Øvº ÈÖÓÔÖ Ø ¾Φ(G) c (D(u,v) > c W(u,v) 1)º ÈÖÓÔÖ Ø W r (u,v) = W(u,v)+r(v) r(u) ØD r (u,v) D(u,v)º =
ij Ð ÓÖ Ø Ñ Ä Ö ÓÒ ØË Ü ¾»¾µ ÁÐ Ü Ø ÙÒÖ Ø Ñ Ò rø ÐÕÙ φ(g r ÐÓÖ ÕÙ D(u,v) > ÓÒÐÙ ÓÒ cº ) c r(v) r(u)+w(u,v) 1 ÆÓÙÚ ÙÜ Ö Ø ³ ÓÖÐÓ µ uú Ö v ÔÓ W(u,v) 1 ÐÓÖ ÕÙ D(u,v) > Ð ÓÖ Ø Ñ c ÒÔÐÙ Ö Ò Ø Ùܵº Ò Ø º ÐÐÑ Ò¹ ÓÖ O(VE) = Ü Ø Ò ³ÙÒÖ Ø Ñ Ò Ô ÖÙ Ø ÔÓ ØÖ Ø Ñ ÒØ O(V 3 )º ÓÑÔÐ Ü Ø ØÓØ Ð O(V 3 ÑÔÐÓ ÙÜÔÖ Ñ Ö Ø Ô Ô ÖÖ Ö ÓØÓÑ ÕÙ ÙÖ Ð ÕÙ ÒØ Ø D(u,v)ÔÖ ¹ ÐÙÐ º log(v))º Ñ Ð ÓÖ Ø ÓÒÔÓ Ð ÒO(VElog(V))º
¾ºÊ Ô Ø ÖV 1 Ó ½ºÈÓÙÖØÓÙØ ÓÑÑ Øv V r(v) = Î Ö ÒØ ÓÔØ Ñ º µ ÐÙÐ Ö (v)ð ÙÖ Ñ Ü Ñ Ð ³ÙÒ Ñ Ò Ò Ö ØÖ Ò G r ³ ÜØÖ Ñ Ø vº µèóùöøóùøvø ÐÕÙ (v) > c r(v) = r(v)+1º ºË Φ(G r ) > c ÑÔÓ Ð ³ ØØ Ò Ö c ÒÓÒr ØÐ Ö Ø Ñ Ò Ö º
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. Initial circuit : φ(g) = 24 7 7 7 3 3 3 3
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. First retiming to apply targeting φ(g) = 13 +1 7 7 7 +1 +1 3 3 3 3
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. First retiming applied : φ(g) = 17 +1 7 7 7 +1 +1 3 3 3 3
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. Second retiming to apply targeting φ(g) = 13 +1 7 7 7 +1 3 +1 3 3 3
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. Second retiming applied : φ(g) = 17 +1 7 7 7 +1 3 +1 3 3 3
Leiserson-Saxe circuit retiming (1991) Delay operators : move registers from out-going to in-going edges. Retiming changes number of registers & clock period of the circuit. Complete optimal retiming : φ(g) = 13 +2 7 7 7 +2 +1 3 +1 3 3 3 Nice O(VE log(v )) algorithm for clock period minimization. Similar concepts used in software pipelining and loop shifting.
(1 1 p )ΦÓÔØ(G)+λ pº µλ Ê ØÓÙÖ ÙÖÐ Ô Ô Ð Ò ÐÓ Ð (1 1 p )d(p )+λ p ÓÒλ Ð Ð Ä Ö ÓÒ¹Ë Ü ÓÑÔ Ø ÓÒÔ ÖÐ Ø µ Ú s(v) ÙÖ Ñ Üº ³ÙÒ Ñ Ò v Ò Ö ØÖ Ò G r σ(v,k) = s(v)+(r(v)+k)φ(g r )ÓÖ ÓÒÒ Ò Ñ ÒØ ΦÓÔØ(G)º µèóùöóö ÓºÓÔØ Ñ Ðσ (v,k) = s(v)+λ s(v) < λ ØÓÙØ Ñ ÒP u n Ò Ö ØÖ Ò G r Ú Ö (r(v)+k) Ú 1u d(p) s(u n )+d(u n ) s(u 1 ) < λ +max{d(u) u V}º λ ÓÒΦÓÔØ(G) λ +max{d(u) 1 u V}º λ (2 1 p )λ p +max{d(u) 1 u V}
Programmation linéaire min 11t + 1u 2t + 3u 5 3t + 2u 4 5t + u 12 t >=, u >= = max 5x + 4y + 12z 2x + 3y + 5z 11 3x + 2y + z 1 x, y, z Problème 1 du dopé acheter la proportion la moins chère de dopants t et u (de prix respectifs 11 et 1) pour un apport susant en 3 éléments de base (resp. 5, 4 et 12) connaissant l'apport dans chaque dopant. Problème 2 du traquant vendre les trois éléments de base au prix le plus cher, tout en étant moins cher que chacun des dopants.
Programmation linéaire Théorème de dualité min{c.x Ax b, x } = max{y.b ya c, y } Complexité Solution rationnelle : temps polynomial (L. Khachiyan) Solution entière : NP-complet et inégalité seulement. Algorithmes Algorithmes du simplexe, algorithmes de réduction de base en petites dimensions (Lenstra), programmation linéaire paramétrée (second membre).
S(G r S(G ÆÓØ ÙÖÐ Ö ØÖ Ù Ò ÖÙ Ø µ r ) = e E w r(e) = u v e w(e)+r(v) r(u) = S(G)+ v V r(v)( Ò Ö (v) ÓÙØ Ö (v)) ) ÒÓÑ Ö Ö ØÖ ÔÖ Ð Ô Örº min{ r(v)c(v) e = (u,v) E, w(e)+r(v) r(u) } v V ÈÖÓ Ö ÑÑ Ð Ò Ö ÒÒÓÑ Ö ÒØ Ö max{ e E f(e)w(e) v V, v e Ö Ô µ Ø ÓÐÙ Ð ÒØ ÑÔ ÔÓÐÝÒÓÑ Ð f(e) e f(e) = c(v), e E, f(e) } Ù Ð ³ÙÒ Ð ÓÖ Ø Ñ ÓØ Ñ ØÖ ÓÒØÖ ÒØ Ñ ØÖ ³ Ò Ò v
Γ(r) Σ(f)ÕÙ Ð ÕÙ Ó ÒØr Øf (u,v), r(v) r(u)+w(e)º Ê Ø Ñ Ò ÓÒØ ÓÒr : V ZØ ÐÐ ÕÙ e = ÔÔÖÓ ÔÖ Ñ Ð ¹ Ù Ð ÓÒØ ÓÒf : E ZØ ÐÐ ÕÙ v ÐÓØÔÓ Ø ÙÒ ÓÒ ÖÙ Ø µ ÒØÖ ÒØ Ú Ö ÙÒ Ö ÓÖØ ÒØ º V, f(e) =. v e ÇÒ ØÔ ÖÙÒÑ Ñ ÒÓÑ Ö Ö ØÖ ÙÒ Ö v. e f(e)º ÈÖ Ò Ô ÓÒÒ Ö Ó Ø Γ(r) ØΣ(f) ÙÜÖ Ø Ñ Ò Ø ÓØ ÔÓÙÖÕÙ ÁÐ Ü Ø r ØfØ Ð ÕÙ Γ(r) = ÇÒ ÒØÖ Ò ÙÒ ÓÑÑ Ø ÙØ ÒØ Ó ÕÙ³ÓÒ Ò ÓÖØ º Σ(f)º
r Ê Ø Ñ Ò ÇÒÔÓ v r (e) = w ÇÒ Ö Ñ Ò Ñ ÖΓ(r) = ÆÓØ Ñ Ü Ñ Ø ÓÒ ÆÈ¹ÓÑÔÐ Ø Ù Ò ÓÖØµ Ü ÑÔÐ Ñ Ò Ñ Ø ÓÒ ÙÒÓÑ Ö ³ Ö ÔÓ ÒÙÐ (e) > Øv r e E v r(e)º (e) = 1 w r (e) = º ÐÓØÔÓ Ø ÇÒÔÓ z f (e) = f(e) = Øz ÇÒ Ö Ñ Ü Ñ ÖΣ(f) = f (e) = 1 f(e)w r (e) ÒÓÒº e E z f(e)º ÈÖÓÔÖ Ø ½ e Ef(e)w(e) = e E f(e)w r (e) Σ(f)Ò Ô Ò Ô rº ÈÖÓÔÖ Ø ¾v r (e) z f (e) Ò ÓÒ ÓÖÑ Ø ic(e) ÈÖÓÔÖ Ø r Øf Γ(r) Σ(f) ØÓÔØ Ñ Ð e = v r (e) z f (e) E,ic(e) = º º
w (e) r ÓÐÓÖ Ø ÓÒ Ö ÙØ i f,r (e) = ÔÓÙÖØÓÙØ ÖeºÇÒ Incolore Vert Vert i(e)= Ñ Ò Ö i f,r ÓÐÓÖ Ð Ö ÔÓÙÖ Ò ÕÙ Ö ÕÙ ÐÐ (e)ô ÙØ ÖÓ ØÖ º 1 Noir i(e)= Vert Vert Ö ÑÑ ÓÒ ÓÖÑ Ø º i(e)= Noir 1 Noir i(e)= Rouge f(e) Ö ÖÓ ØÖ i f,r ÇÒÔ ÖØ Ù ÓØ Ø Ù Ð ÒÙÐ Ö ÒÓÒÓÒ ÓÖÑ Ò Ø ÙÜ Ö ÔÓ ÒÙÐ ÒÓ Ö µºçò Ó Ø (e) ÙÖØÓÙØ ÖÒÓÒÓÒ ÓÖÑ ÓÒ ÖÚ Öi f,r (e) = ÙÖÐ ÙØÖ º
Ä ÑÑ Å ÒØÝ (V,E)ÙÒ Ö Ô ÓÖ ÒØ ÓÒØÐ Ö ÓÒØÒÓ Ö Ú ÖØ ÖÓÙ ÓÙ µáð Ü Ø ÙÒÝÐ ÓÒØ Ò ÒØe µáð Ü Ø ÙÒÓ¹ÝÐ ÓÒØ Ò ÒØe Ò Ö ÒÓÐÓÖ Ø ÓÒØØÓÙ Ð Ö ÒÓ Ö ÓÒØ Ò Ð Ñ Ñ Ò ØØÓÙ Ð Ö Ú ÖØ Ò Ð Ò ÓÒØÖ Ö º Ò Ö ÖÓÙ ÓÒØØÓÙ Ð Ö ÖÓÙ Ò Ð ÙÜ Ò º ËÓ ØG = ÒÓÐÓÖ ºËÓ Øe ÙÒ ÖÒÓ ÖºÍÒ ¾ÔÖÓÔÓ Ø ÓÒ Ù Ú ÒØ ØÚÖ ÑÓÒ ØÖ Ø ÓÒÓÒ ØÖÙØ Ú Ô Ö ÑÔÐ Ñ ÖÕÙ Ò Ù Ú ÒØÔ ÖØ Ö e Ð Ö ÒÓ Ö Ò Ð Ò e ÒÓ Ö ÓÒØ Ò Ð Ñ Ñ Ò ØØÓÙ Ð Ö Ú ÖØ Ò Ð Ò ÓÒØÖ Ö º Ð Ö Ú ÖØ Ò Ð Ò ÓÒØÖ Ö Ð Ö
ÕÙ Ô ÖÑ Ø Ö Ò Ö e ÓÒ ÓÖÑ Ò Ö Ö ³ ÖÒÓÒÓÒ ÓÖÑ º Ò Ô ÖÐ Ó¹ÝÐ ¾ºÊ ÓÐÓÖ ÖÐ Ö Ò ÓÒØ ÓÒ Ù Ö ÑÑ ÓÒ ÓÖÑ Ø Ø º ÓÑÔÐ Ü Ø O( E ( E + V ))ºÈÓ Ð Ø Ö ÓÙØ Ö Ö ³ ÓÖÐÓ º ÒÓÙÚ ÙÜr Øf ÓÐÓÖ ÖÐ Ö Ò ÓÒØ ÓÒ Ù Ö ÑÑ ÓÒ ÓÖÑ Ø ÔÓÙÖr=f = Ð ÓÖ Ø Ñ ½º Ó ÖÙÒ ÖÒÓÒÓÒ ÓÖÑ e = (u,v) Ø Ö ÙÐ ÑÑ Å ÒØÝ Ì ÒØÕÙ³ Ð Ü Ø Ö ÒÓÒÓÒ ÓÖÑ ÓÖ ÒØ ÓÑÑ e Ö Ôº Ò Ò ÒÚ Ö e µ ÒÖ Ñ ÒØ ÖÐ Ð ÓÑÑ Ø Ù ÓÙ ¹ Ò Ñ Ð ÓÒØ Ò ÒØv µ µ ÒÖ Ñ ÒØ Ö Ö Ôº Ö Ñ ÒØ ÖµÐ ÓØ Ö ÙÝÐ ÕÙ ÓÒØ
ÔÖ ÙÜ Ø Ô È Ø Ø Ü ÑÔÐ d c b d c b a Ê ÙÐØ Ø ÙÖÙÒ Ü ÑÔÐ +1 a +1 Flot unitaire Arc non conforme (noir)
Outline Introduction au cours 1 Introduction au cours Compilation et optimisations de codes Des p'tites boucles, toujours des p'tites boucles Exemples de spécicités architecturales 2 3 Intérêts et problèmes simple : variantes Fusion avec décalage
et durées de vie. Ex : Maxlive Maxlive = maximum number of simultaneous live values. Dependence graph + xed allocation. kernel a b c a b c 1 time
et durées de vie. Ex : Maxlive Maxlive = maximum number of simultaneous live values. Dependence graph + xed allocation. Maxlive = 3 (after b and before c) : kernel a b c a b c 1 time a b c a b c a b c
et durées de vie. Ex : Maxlive Maxlive = maximum number of simultaneous live values. Dependence graph + xed allocation. Maxlive = 3 (after b and before c) : kernel a b c a b c 1 time a b c a b c a b c Maxlive = 2, same allocation, b delayed by 1 : a c a b c a b c
et durées de vie. Ex : Maxlive Maxlive = maximum number of simultaneous live values. Dependence graph + xed allocation. Maxlive = 3 (after b and before c) : kernel a b c a b c 1 time a b c a b c a b c Maxlive = 2, same allocation, b delayed by 1 : a c a b c a b c Retiming changes both Maxlive and FIFO sizes (if used).
FIFO sizes : general constraints More accurate use of memory For each e = (u, v) E, dene t out (u) and t in (e) such that : if t σ(u, i) + t out (u), the result of (u, i) is available. if t σ(v, i) + t in (e), the value was already read by (v, i). By denition of delays d, d(e) > t out (u) t in (e).
FIFO sizes : general constraints More accurate use of memory For each e = (u, v) E, dene t out (u) and t in (e) such that : if t σ(u, i) + t out (u), the result of (u, i) is available. if t σ(v, i) + t in (e), the value was already read by (v, i). By denition of delays d, d(e) > t out (u) t in (e). Reminder for a modulo schedule σ(u, i) = λ.i + ρ u, the dependence constraint is expressed as : λ.w(e) + ρ v ρ u d(e)
FIFO sizes : general constraints More accurate use of memory For each e = (u, v) E, dene t out (u) and t in (e) such that : if t σ(u, i) + t out (u), the result of (u, i) is available. if t σ(v, i) + t in (e), the value was already read by (v, i). By denition of delays d, d(e) > t out (u) t in (e). Reminder for a modulo schedule σ(u, i) = λ.i + ρ u, the dependence constraint is expressed as : λ.w(e) + ρ v ρ u d(e) FIFO at least k + 1 locations if : σ(u, i + k) + t out (u) σ(v, i + w(e)) + t in (e), i.e., λ.k + ρ u + t in (e) λ.w(e) + ρ v + t out (u) FIFO of size at least 1 + w(e) + ρ v ρ u+t out (u) t in (e) λ.
Linear system Introduction au cours Constraints With retiming r(u) : ρ u = λ.r(u) + s u, with s u < λ (s u is xed). r(v) r(u) d(e)+s u s v λ w(e) = d 1 (e). 1 + w(e) + ρ v ρ u+t out (u) t in (e) λ = r(v) r(u) + d 2 (e).
Linear system Introduction au cours Constraints With retiming r(u) : ρ u = λ.r(u) + s u, with s u < λ (s u is xed). r(v) r(u) d(e)+s u s v λ w(e) = d 1 (e). 1 + w(e) + ρ v ρ u+t out (u) t in (e) λ = r(v) r(u) + d 2 (e). Linear program min{ u V M(u) r(v) r(u) d 1 (e), M(u) r(v) r(u) + d 2 (e)}
Linear system Introduction au cours Constraints With retiming r(u) : ρ u = λ.r(u) + s u, with s u < λ (s u is xed). r(v) r(u) d(e)+s u s v λ w(e) = d 1 (e). 1 + w(e) + ρ v ρ u+t out (u) t in (e) λ = r(v) r(u) + d 2 (e). Linear program min{ u V M(u) r(v) r(u) d 1 (e), M(u) r(v) r(u) + d 2 (e)} Variant Add new vertex u for each u and r(u ) = M(u) + r(u) : min{ u V r(u ) r(u) r(v) r(u) d 1 (e), r(u ) r(v) d 2 (e)} Connection matrix = totally unimodular = polynomial problem
With shared storage, e.g., register bank : Maxlive Need to consider each time t modulo λ, actually all σ(v, i) + t in (e). Produced at time t : prod(e, t) = {i N σ(u, i) + t out (u) t}. Consumed at t : cons(e, t) = {i N σ(v, i + w(e)) + t in (e) t}. Live along e at t : live(e, t) = prod(e, t) cons(e, t). Live at t live(u, t) = prod(e, t) min e=(u,v) cons(e, t). Note : a) cons(e, t) is minimal when σ(v, i + w(e)) + t in (e) is maximal (last reader) ; b) v forced to be last reader with similar constraints : σ(v, i + w(e)) + t in (e) σ(v, i + w(e )) + t in (e ).
Inequalities with retiming Constraints σ(u, i) + t out (u) t i i t t out(u) s u λ r(u). σ(v, i + w(e)) + t in (e) t i i t t in (e) s v λ w(e) r(v). live(e, t) = w r (e) + s(e, t) with w r (e) = w(e) + r(v) r(u) and s(e, t) = t t out(u) s u λ t t in (e) s v λ.
Inequalities with retiming Constraints σ(u, i) + t out (u) t i i t t out(u) s u λ r(u). σ(v, i + w(e)) + t in (e) t i i t t in (e) s v λ w(e) r(v). live(e, t) = w r (e) + s(e, t) with w r (e) = w(e) + r(v) r(u) and s(e, t) = t t out(u) s u λ t t in (e) s v λ. Linear program Opt 1 = ninimize max t [..λ[ u V max e=(u,v) (w r (e) + s(e, t)) = min{m r(v) r(u) d 1 (e), e E M(u, t) r(v) r(u) + w(e) + s(e, t), u V, t M u V M(u, t), t} Strongly NP-complete.
Inequalities with retiming Constraints σ(u, i) + t out (u) t i i t t out(u) s u λ r(u). σ(v, i + w(e)) + t in (e) t i i t t in (e) s v λ w(e) r(v). live(e, t) = w r (e) + s(e, t) with w r (e) = w(e) + r(v) r(u) and s(e, t) = t t out(u) s u λ t t in (e) s v λ. Linear program Opt 1 = ninimize max t [..λ[ u V max e=(u,v) (w r (e) + s(e, t)) = min{m r(v) r(u) d 1 (e), e E M(u, t) r(v) r(u) + w(e) + s(e, t), u V, t M u V M(u, t), t} Strongly NP-complete. If only one last reader (along e u ) for each u : polynomial because max t u (w r(e u ) + s(e u, t)) = u w r(e u ) + max t u s(e u, t).
Approximation up to V live(e, t) = w r (e) + s(e, t), s(e, t) = t t out(u) s u λ t t in (e) s v λ. t out (u) + s u = λ.α u + β u with β u < λ. t in (e) + s v = λ.δ e + γ e with γ e < λ. s(e, t) = α u + δ e + s (e, t) with s (e, t) = t β u λ t γ e λ.
Approximation up to V live(e, t) = w r (e) + s(e, t), s(e, t) = t t out(u) s u λ t t in (e) s v λ. t out (u) + s u = λ.α u + β u with β u < λ. t in (e) + s v = λ.δ e + γ e with γ e < λ. s(e, t) = α u + δ e + s (e, t) with s (e, t) = t β u λ t γ e λ. if β u = γ e, s (e, t) =. Let s min (e) =. if β u < γ e, s (e, t) = or 1. Let s min (e) =. if β u > γ e, s (e, t) = 1 or. Let s min (e) = 1.
Approximation up to V live(e, t) = w r (e) + s(e, t), s(e, t) = t t out(u) s u λ t t in (e) s v λ. t out (u) + s u = λ.α u + β u with β u < λ. t in (e) + s v = λ.δ e + γ e with γ e < λ. s(e, t) = α u + δ e + s (e, t) with s (e, t) = t β u λ t γ e λ. if β u = γ e, s (e, t) =. Let s min (e) =. if β u < γ e, s (e, t) = or 1. Let s min (e) =. if β u > γ e, s (e, t) = 1 or. Let s min (e) = 1. This leads to the polynomially-solvable linear program : Opt 2 = min{ u V M(u) r(v) r(u) d 1(e), e E, M(u) r(v) r(u) + w(e) + α u + δ e + s min (e), u V, e = (u, v)} Opt 2 Opt 1 Opt 2 + V
Outline Introduction au cours Intérêts et problèmes simple : variantes Fusion avec décalage 1 Introduction au cours Compilation et optimisations de codes Des p'tites boucles, toujours des p'tites boucles Exemples de spécicités architecturales 2 3 Intérêts et problèmes simple : variantes Fusion avec décalage
Loop fusion : interest and problems Intérêts et problèmes simple : variantes Fusion avec décalage Why? How? When? Increase size of basic blocks. Useful for both spatial and temporal locality. Useful for array contraction. Be careful of over-fusion. Polynomial algorithms? NP-completeness? Heuristics? Always dicult to nd a cost model... But complementary tool for more general transformations.
A few references for the basics Intérêts et problèmes simple : variantes Fusion avec décalage Kennedy and McKinley. Typed Fusion with Applications to Parallel and Sequential Code Generation. McKinley and Kennedy. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. Roth and Kennedy. Loop Fusion in High Performance Fortran.. On the Complexity of Loop Fusion.
Array contraction Introduction au cours Intérêts et problèmes simple : variantes Fusion avec décalage Goal : reduce the size of a local array, possibly into a scalar. do i=2,n a(i) = d(i) + 1 b(i) = a(i)/2 c(i) = b(i) + b(i-1) enddo do i=2,n a = d(i) + 1 b(i) = a/2 c(i) = b(i) + b(i-1) enddo
Array assignments and forall loops Intérêts et problèmes simple : variantes Fusion avec décalage A = B + C B = D + 1 A[1:n] = A[:n-1] + A[2:n+1] equivalent to : forall(i=,n+1) A(i) = B(i) + C(i) endforall forall(i=,n+1) B(i) = D(i) + 1 endforall forall(i=1,n) A(i) = A(i-1) + A(i+1) endforall and, then, to : doall(i=,n+1) A(i) = B(i) + C(i) enddoall doall(i=,n+1) B(i) = D(i) + 1 enddoall doall(i=,n+1) A'(i) = A(i) enddoall doall(i=1,n) A(i) = A'(i-1) + A'(i+1) enddoall
Partial loop distribution Intérêts et problèmes simple : variantes Fusion avec décalage DO i=1,n A(i) = 2*A(i) + 1 B(i) = C(i-1) + A(i) C(i) = C(i-1) + G(i) D(i) = D(i-1) + A(i) + C(i-1) E(i) = E(i-1) + B(i) F(i) = D(i) + B(i-1) ENDDO 1 1 1 C B E 1 1 A D F 1
Intérêts et problèmes simple : variantes Fusion avec décalage DOPAR i=1,n A(i) = 2*A(i) + 1 ENDDOPAR DOSEQ i=1,n C(i) = C(i-1) + G(i) ENDDOSEQ DOPAR i=1,n B(i) = C(i-1) + A(i) ENDDOPAR DOSEQ i=1,n E(i) = E(i-1) + B(i) ENDDOSEQ DOSEQ i=1,n D(i) = D(i-1) + A(i) + C(i-1) ENDDOSEQ DOPAR i=1,n F(i) = D(i) + B(i-1) ENDDOPAR How to get the following? DOSEQ i=1,n C(i) = C(i-1) + G(i) ENDDOSEQ DOPAR i=1,n A(i) = 2*A(i) + 1 B(i) = C(i-1) + A(i) ENDDOPAR DOSEQ i=1,n D(i) = D(i-1) + A(i) + C(i-1) E(i) = E(i-1) + B(i) ENDDOSEQ DOPAR i=1,n F(i) = D(i) + B(i-1) ENDDOPAR
Outline Introduction au cours Intérêts et problèmes simple : variantes Fusion avec décalage 1 Introduction au cours Compilation et optimisations de codes Des p'tites boucles, toujours des p'tites boucles Exemples de spécicités architecturales 2 3 Intérêts et problèmes simple : variantes Fusion avec décalage
Intérêts et problèmes simple : variantes Fusion avec décalage Simple loop fusion/distribution DO i=2, n a(i) = f(i) b(i) = g(i) c(i) = a(i-1) + b(i) d(i) = a(i) + b(i-1) e(i) = d(i-1) + d(i) ENDDO DOPAR i=2, n a(i) = f(i) b(i) = g(i) ENDDOPAR DOPAR i=2, n c(i) = a(i-1) + b(i) d(i) = a(i) + b(i-1) ENDDOPAR DOPAR i=2, n e(i) = d(i-1) + d(i) ENDDO A C B D E 1 prevents fusion simple precedence C 2 = max(1 + 1, 1 + ) Easy : compute longest dependence paths. A 3 = 2 + 1 B D E 1 2 = max(1 + 1, 1 + )
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2.
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after).
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after). Fixed number of chains : polynomial O(d N d ) for d chains of length at most N. Dynamic programming.
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after). Fixed number of chains : polynomial O(d N d ) for d chains of length at most N. Dynamic programming. Two colors, no fusion-preventing edges : polynomial O(V + E).
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after). Fixed number of chains : polynomial O(d N d ) for d chains of length at most N. Dynamic programming. Two colors, no fusion-preventing edges : polynomial O(V + E). Three colors, no fusion-preventing edges : NP-complete.
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after). Fixed number of chains : polynomial O(d N d ) for d chains of length at most N. Dynamic programming. Two colors, no fusion-preventing edges : polynomial O(V + E). Three colors, no fusion-preventing edges : NP-complete. Two colors, fusion-preventing edges : NP-complete.
Maximal unordered typed fusion Intérêts et problèmes simple : variantes Fusion avec décalage Constraint : two loops of dierent types (colors) cannot be fused. Arbitrary number of colors : NP-complete (McKinley & Kennedy, 1994), even for chains of length 2. Ordered typed fusion : polynomial O(T (V + E)) (see after). Fixed number of chains : polynomial O(d N d ) for d chains of length at most N. Dynamic programming. Two colors, no fusion-preventing edges : polynomial O(V + E). Three colors, no fusion-preventing edges : NP-complete. Two colors, fusion-preventing edges : NP-complete. Two colors, fusion-prev. edges for one color : NP-complete. Ex : partial loop distribution for parallelism detection.
Ordered typed fusion Introduction au cours Intérêts et problèmes simple : variantes Fusion avec décalage a b c (a) original graph, (b) fuse black then white, (c) fuse white then black. Optimization of type T : W(v) : minimal number of loops of type T that have to be placed before v. W (v) = if v has no predecessor. For e = (u, v), w(e) = 1 if T (u) = T and either T (v) T or e is fusion-preventing. Otherwise w(e) =.
Intérêts et problèmes simple : variantes Fusion avec décalage 1 1 1 1 1 1 1 1 1 1 1 2 a b a d (a) weighted graph for fusing black, (b) fusion of black, (c) weighted graph for fusing white after black, (d) fusion of white. 1 1 1 1 1 1 1 1 1 1 2 2 a b c d (a) weighted graph for fusing white, (b) fusion of white, (c) weighted graph for fusing black after white, (d) fusion of black.