casesandybridge

sandybridge  时间:2021-03-27  阅读:()
PLQCDlibraryforLatticeQCDonmulti-coremachinesA.
Abdel-Rehim,aC.
Alexandrou,a,bN.
Anastopoulos,cG.
Koutsou,aI.
LiabotisdandN.
PapadopouloucaTheCyprusInstitute,CaSToRC,20KonstantinouKavaStreet,2121Aglantzia,Nicosia,CyprusbDepartmentofPhysics,UniversityofCyprus,P.
O.
Box20537,1678Nicosia,CypruscComputingSystemsLaboratory,SchoolofElectricalandComputerEngineering,NationalTechnicalUniversityofAthens,ZografouCampus,15773Zografou,Athens,GreecedGreekResearchandTechnologyNetwork,56MesogionAv.
,11527,Athens,GreeceE-mail:a.
abdel-rehim@cyi.
ac.
cy,c.
alexandrou@cyi.
ac.
cy,g.
koutsou@cyi.
ac.
cy,anastop@cslab.
ece.
ntua.
gr,iliaboti@grnet.
gr,nikela@cslab.
ece.
ntua.
grPLQCDisastand–alonesoftwarelibrarydevelopedunderPRACEforlatticeQCD.
Itpro-videsanimplementationoftheDiracoperatorforWilsontypefermionsandfewefcientlin-earsolvers.
Thelibraryisoptimizedformulti-coremachinesusingahybridparallelizationwithOpenMP+MPI.
ThemainobjectivesofthelibraryistoprovideascalableimplementationoftheDiracoperatorforefcientcomputationofthequarkpropagator.
Inthiscontribution,adescrip-tionofthePLQCDlibraryisgiventogetherwithsomebenchmarkresults.
31stInternationalSymposiumonLatticeFieldTheoryLATTICE2013July29August3,2013Mainz,GermanySpeaker.
cCopyrightownedbytheauthor(s)underthetermsoftheCreativeCommonsAttribution-NonCommercial-ShareAlikeLicence.
http://pos.
sissa.
it/arXiv:1405.
0700v1[hep-lat]4May2014PLQCDA.
Abdel-Rehim1.
IntroductionComputerhardwareforcommodityclustersaswellassupercomputershasevolvedtremen-douslyinthelastfewyears.
Nowadaysatypicalcomputenodehasbetween16and64coresandpossiblyanacceleratorsuchasaGraphicsProcessingUnit(GPU)orlatelyanIntelManyIntegratedCore(MIC)card.
Thistrendofpackingmanylow-poweredbutmassivelyparallelpro-cessingunitsisexpectedtocontinueassupercomputingtechnologypursuestheExascaleregime.
Thecurrenttechnologytrendsindicatethatbandwidthtomainmemorywillcontinuetolagbehindcomputationalpower,whichrequiresarethinkingofthedesignoflatticeQCDcodessuchthattheycanefcientlyrunonsucharchitectures.
Takingthisintoaccount,PRACE[1]allocatedresourcesforcommunitycodescalingactivitiesinmanycomputationallyintensiveareasincludinglatticeQCD.
TheworkpresentedherewasdevelopedunderPRACEfocusingonscalingcodesformulti-coremachines.
Theworkwepresentdealswithcommunitycodes,andmorespecicallyoncertaincomputationallyintensivekernelsinthesecodes,inordertoimprovetheirscalingandperformanceformulti-corearchitectures.
WehavecarriedoutoptimizationworkonthetmLQCD[2,3]codeandhavedevelopedanewhybridMPI/OpenMPlibrary(PLQCD)withoptimizedimplementationsoftheWilsonDirackernelandaselectedsetoflinearsolvers.
OurpartnersinthisprojecthavealsoperformedoptimizationworkfortheMolecularDynamicsintegratorsusedinHybridMonteCarlocodes,andalsoforLandaugaugexing.
ThiswasdonewithintheChromasoftwaresuite[4]andwillnotbediscussedhere(See[5]formoreinformation).
Manyothercommunitycodesofcourseexistbutwerenotconsideredinthiswork(See[6]foranoverview).
Inwhatfollows,wewillrstpresenttheworkcarriedoutforthecaseofPLQCD,whereweim-plementedtheWilsonDiracoperatorandassociatedlinearalgebrafunctionsusingMPI+OpenMP.
Inadditiontousingthishybridapproachforparallelism,wealsoimplementadditionaloptimiza-tionssuchasoverlappingcommunicationandcomputation,usingcompilerintrinsicsforvector-izationaswellasimplementingthenewAdvancedVectorInstructions[7](AVXforIntelorQPXforBlue/GeneQ)thatbecamerecentlyavailableinnewgenerationofprocessorssuchastheIntelSandy-Bridge.
TheworkdoneforthecaseofthetmLQCDpackagewillthenbepresented,whereweimplementedsomenewefcientlinearsolvers,inparticularthosebasedondeationsuchastheEigCGsolver[8],forwhichwewillgivesomebenchmarkresults.
2.
DiracoperatoroptimizationsAkeycomponentofthelatticeDiracoperatoristhehoppingpartgivenbyEq.
2.
1.
ψ(x)=3∑=0[U(x)(1γ)φ(x+e)+U(xe)(1+γ)φ(xe)],(2.
1)where,U(x)isthegaugelinkmatrixinthedirectionatsitex,γaretheDiracmatricesandeisaunitvectorinthedirection.
φandψaretheinputandoutputspinorsrespectively.
Equation2.
1canbere-writtenintermsoftwoauxiliaryeldsθ+(x)=(1γ)φ(x)andθ(x)=U(x)(1+γ)φ(x)asψ(x)=3∑=0[U(x)θ+(x+e)+θ(xe)].
(2.
2)2PLQCDA.
Abdel-RehimBecauseofthestructureoftheγmatrices,onlytheuppertwospincomponentsofθ±needtobecomputedbecausethelowertwospincomponentsarerelatedtotheupperones[9].
Inthefollowingwedescribesomeoftheoptimizationsperformedforthehoppingmatrix.
2.
1HybridparallelizationwithMPIandOpenMPOpenMPprovidesasimpleapproachformulti-threadingsinceitisimplementedascompilerdirectives.
Onecanincrementallyaddmulti-threadingtothecodeandalsousethesamecodewithmulti-threadingturnedonandoff.
Sincethemaincomponentinthehoppingmatrix(Diracoperator)isalarge"forloop"overlatticesites,itisnaturaltousethefor-loopparallelconstructofopenMP.
TheperformanceofthehybridcodeisthentestedagainstthepureMPIversion.
Weperformaweakscalingtestbyxingthelocalvolumepercore(orthread)andincreasethenumberofMPIprocesses.
ThetestwasdoneontheHoppermachineatNERSCwhichisaCrayXE6[10].
Eachcomputenodehas2twelve-coreAMD'MagnyCours'at2.
1-GHzsuchthateach6coressharethesamecache.
WendperformancefortheHybridversionismaximumwhenassigningatmost6threadsperMPIprocesssuchthatthese6OMPthreadssharethesameL3cache.
InFig.
1weshowtheperformanceofthepureMPIandtheMPI+openMPwith6threadsperMPIprocessforatotalnumberofcoresupto49,152cores.
FromtheseresultswerstnoticethatusingOpenMPleadstoaslightdegradationinperformanceascomparedtothepureMPIcase.
However,asweseeinthecasewithlocalvolumeof124,thehybridapproachperformsbetteraswegotoalargenumberofcores.
Similarbehaviorhasbeenalsoobservedforothercodesfromdifferentcomputationalsciences(seethecasestudiesonHopper[11]).
Figure1:WeakscalingtestforthehoppingmatrixonaCrayXE6machinewithlocallatticevolumepercore84(left)and124(right).
2.
2OverlappingcommunicationwithcomputationTypicallyinlatticecodesonerstcomputestheauxiliaryhalf-spinoreldsθ±asgiveninEquation(2.
2)andthencommunicatestheirvaluesontheboundariesbetweenneighboringpro-cessesinthe+anddirections.
Inablockingcommunicationscheme,computationhaltsuntilcommunicationoftheboundariescompletes.
Analternativeapproachistooverlapcommunica-tionswithcomputationsbydividingthelatticesitesintobulksites,forwhichnearestneighborsare3PLQCDA.
Abdel-Rehimavailablelocally,andboundarysites,forwhichthenearestneighborsarelocatedonneighboringprocesses,andthereforecanonlybeoperateduponaftercommunication.
Theorderofoperationsforcomputingtheresultψisthendoneasfollows:Computeθ+andbegincommunicatingthemtotheneighboringMPIprocessinthedirection.
ComputeθandbegincommunicatingthemtotheneighboringMPIprocessinthe+direction.
Computetheresultψ(x)onthebulksiteswhiletheneighborsarebeingcommunicated.
Waitforthecommunicationsinthedirectionstonish,thencomputethecontributions∑3=0[U(x)θ+(x+e)]totheresultontheboundarysites.
Waitforthecommunicationsinthe+directionstonish,thencomputethecontributions∑3=0[θ(xe)]ontheboundarysites.
Communicationisdoneusingnon-blockingMPIfunctionsMPI_Isend,MPI_IrecvandMPI_Wait.
Apossibledrawbackofthisapproachisthatonewillaccessψ(x)andU(x)inanunorderedfash-iondifferentfromtheorderitisstoredinmemory.
This,however,canbecircumventedpartiallybyusinghintsinthecodeforprefetching.
Wehavetestedtheeffectofprefetchingincaseofsequen-tialandrandomaccessofspinorandlinkelds.
Thetestwasdoneusingaseparatebenchmarkkernelcodewhichisolatesthelink-spinormultiplication.
AscanbeseeninFig.
3,prefetchingbecomesimportantforalargenumberofsites,i.
e.
whendata(spinorsandlinks)cannottinthecachememory,whichisatypicalsituationforlatticecalculations.
Itisalsonotedthataccessingthesitesrandomlyreducestheperformance,aswouldbeexpected.
Inthiscaseonecanimprovethesituationbydeningapointerarray,e.
g.
forthespinorsψ(i)=&ψ(x[i])wherex[i]isthesitetobeaccessedatstepiintheloopsuchasweshowinpseudo-codeinFig.
2.
Thesepointerscanbedenedapriori.
ThisimprovesthepredictiveabilityofthehardwareasisshowninFig.
3wherewecomparethedifferentprefetchingandaddressingschemes.
Sequentialaccessfor(i=0;iSandyBridgeprocessorsandlaterbyAMDintheirBull-dozerprocessor.
The16XMMregistersofSSE3arenow256-bitwideandknownasYMMregisters.
AVX-capableoatingpointunitsareabletoperformon4doubleprecisionoatingpointnumbersor8singleprecision.
Implementingtheseextensionsinthevectorizedpartsoflatticecodeshasthepotentialofprovidingagainofuptoafactor2inanidealsituation,althoughinprac-ticethisdependsonthelayoutoflatticedata.
Weprovidedanimplementationoftheseextensionsusinginlineintrinsics.
InthisimplementationasingleSU(3)matrixmultipliestwoSU(3)vectorssimultaneously.
Againofaboutafactorof1.
5isachievedforthehoppingmatrixinthetmLQCDcodeindoubleprecisionasshowninFig.
(4).
Forillustration,acodesnippetformultiplyingtwocomplexnumbersbytwocomplexnumbersusingAVXisshowninFig.
(5).
3.
EigCGsolverforTwisted-MassfermionsTwisted-MassfermionsoffertheadvantageofautomaticO(a)improvementwhentunedtomaximaltwist[12].
Withinthisdevelopmentworkwehaveaddedanincrementaldeationalgo-rithm,knownasEigCG,tothetmLQCDpackage.
Numericaltestsshowedaconsiderablespeed-upofthesolutionofthelinearsystemsonthelargestvolumessimulatedbytheEuropeanTwistedMassCollaboration(ETMC).
Forillustration,weshowinFig.
(6)thetimetosolutionwithEigCGonaTwisted-Masscongurationwith2+1+1dynamicalavorswithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
Inthiscasethetotalnumberofeigenvectorsdeatedwas300whichwasbuiltincrementallybycomputing10eigenvectorsduringthesolutionoftherst30right-handsidesusingasearchsubspaceofsize60.
Allsystemsaresolvedindoubleprecisiontorelativetoleranceof108.
5PLQCDA.
Abdel-RehimFigure4:Comparingtheperformanceofthehop-pingmatrixoftmLQCDusingSSE3andAVXindoubleprecisiononanIntelSandyBridgeprocessor.
#include/*t0:a+b*I,e+f*Iandt1:c+d*I,g+h*I*return:(ac-bd)+(ad+bc)*I,*(eg-fh)+(eh+fg)*I*/staticinline__m256dcomplex_mul_regs_256(__m256dt0,__m256dt1){__m256dt2;t2=t1;t1=_mm256_unpacklo_pd(t1,t1);t2=_mm256_unpackhi_pd(t2,t2);t1=_mm256_mul_pd(t1,t0);t2=_mm256_mul_pd(t2,t0);t2=_mm256_shuffle_pd(t2,t2,5);t1=_mm256_addsub_pd(t1,t2);returnt1;}Figure5:MultiplyingtwocomplexnumbersbytwocomplexnumberoftypedoubleusingAVXin-structions.
Figure6:Solutiontimeperprocessfortherst35right-handsidesusingIncrementalEigCGascomparedtoCGonaTwisted-Masscongurationwithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
4.
ConclusionsandSummaryWehavecarriedoutdevelopmenteffortforafewselectedkernelsusedinlatticeQCD.
TherstoftheseeffortsincludedthedevelopmentofahybridMPI/OpenMPlibrarywhichincludesparallelizedkernelsfortheWilsonDiracoperatorandfewassociatedsolvers.
Anumberofparal-lelizationstrategieshavebeeninvestigated,suchasforoverlappingcommunicationwithcomputa-tions.
ThecodehasbeenshowntoscalefairlywellontheCrayXE6.
Intermsofsingleprocessperformance,wecarriedoutinitialvectorizationeffortsforAVXwhereweseeanimprovementof1.
5comparedtotheideal2.
Inadditionwehaveinvestigatedseveraldata-orderingandassociatedprefetchingstrategies.
ForthecaseoftmLQCD,themainsoftwarecodeoftheETMCcollaboration,wehaveimple-6PLQCDA.
Abdel-Rehimmentedanefcientlinearsolverwhichincrementallydeatedthetwisted-massDiracoperatortogiveaspeed-upofabout3timeswhenenoughright-hand-sidesarerequired.
Thisisalreadyinuseinproductionprojects,suchasinRefs.
[14]and[13].
Allcodesarepubliclyavailable.
PLQCDisavailablethroughtheHPCFORGEwebsiteattheSwissNationalSupercomputingCentre(CSCS)wheremoreinformationisavailablewithinthecodedocumentation.
OurEigCGimplementationintmLQCDisavailableviagit-hub.
AcknowledgementsThistalkwasapartofacodingsessionsponsoredpartiallybythePRACE-2IPproject,aspartofthe"CommunityCodesDevelopment"WorkPackage8.
PRACE-2IPisa7thFrameworkEUfundedproject(http://www.
prace-ri.
eu/,grantagreementnumber:RI-283493).
Wewouldliketothanktheorganizersofthe2013Latticemeetingfortheirstrongsupporttomakethecodingsessionasuccessandprovideallorganizationsupport.
WewouldliketothankC.
Urbach,A.
Deuzmann,B.
Kostrzewa,HubertSimma,S.
Krieg,andL.
Scorzatoforverystimulatingdiscussionsduringthedevelopmentofthisproject.
WeacknowledgethecomputingresourcesfromTier-0machinesofPRACEincludingJUQUEENandCuriemachinesaswellastheTodimachineatCSCS.
WealsoacknowledgethecomputingsupportfromNERSCandtheHoppermachine.
References[1]http://www.
prace-ri.
eu/.
[2]K.
JansenandC.
Urbach,Comput.
Phys.
Commun.
180,2717(2009),[arXiv:0905.
3331].
[3]ETMCollaboration,https://github.
com/etmc/tmLQCD.
[4]http://usqcd.
jlab.
org/usqcd-docs/chroma/.
[5]SeethepublicdeliverableD8.
3onthePRACEwebsiteunderPRACE-2IP.
[6]A.
Deuzeman,PoS(LATTICE2013).
[7]SeetheIntelDevelopermanual.
[8]A.
StathopoulosandK.
Orginos,Computinganddeatingeigenvalueswhilesolvingmultipleright-handsidelinearsystemswithanapplicationtoquantumchromodynamics,SIAMJ.
Sci.
Comput.
2010;32(1):439–462,[arXiv:0707.
0131].
[9]SeeforexamplethedocumentationoftheDDHMCcodebyM.
L¨uscher.
[10]TheHopperCrayXE6machineatNERSC.
[11]SeedocumentationforcombiningMPIandopenMPontheNERSCwebsite.
[12]R.
Frezzottietal.
[AlphaCollaboration],LatticeQCDwithachirallytwistedmassterm,JHEP0108,058(2001)[hep-lat/0101001].
[13]C.
Alexandrou,M.
Constantinou,S.
Dinter,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1309.
7768[hep-lat].
[14]A.
Abdel-Rehim,C.
Alexandrou,M.
Constantinou,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1310.
6339[hep-lat].
7

数脉科技:六月优惠促销,免备案香港物理服务器,E3-1230v2处理器16G内存,350元/月

数脉科技六月优惠促销发布了!数脉科技对香港自营机房的香港服务器进行超低价促销,可选择30M、50M、100Mbps的优质bgp网络。更大带宽可在选购时选择同样享受优惠,目前仅提供HKBGP、阿里云产品,香港CN2、产品优惠码续费有效,仅限新购,每个客户可使用于一个订单。新客户可以立减400元,或者选择对应的机器用相应的优惠码,有需要的朋友可以尝试一下。点击进入:数脉科技官方网站地址数脉科技是一家成...

wordpress投资主题模版 白银黄金贵金属金融投资网站主题

wordpress投资主题模版是一套适合白银、黄金、贵金属投资网站主题模板,绿色大气金融投资类网站主题,专业高级自适应多设备企业CMS建站主题 完善的外贸企业建站功能模块 + 高效通用的后台自定义设置,简洁大气的网站风格设计 + 更利于SEO搜索优化和站点收录排名!点击进入:wordpress投资主题模版安装环境:运行环境:PHP 7.0+, MYSQL 5.6 ( 最低主机需求 )最新兼容:完美...

Digital-VM80美元新加坡和日本独立服务器

Digital-VM商家的暑期活动促销,这个商家提供有多个数据中心独立服务器、VPS主机产品。最低配置月付80美元,支持带宽、流量和IP的自定义配置。Digital-VM,是2019年新成立的商家,主要从事日本东京、新加坡、美国洛杉矶、荷兰阿姆斯特丹、西班牙马德里、挪威奥斯陆、丹麦哥本哈根数据中心的KVM架构VPS产品销售,分为大硬盘型(1Gbps带宽端口、分配较大的硬盘)和大带宽型(10Gbps...

sandybridge为你推荐
云爆发云出十里未及孤村什么意思中老铁路一带一路的火车是什么火车梦之队官网梦之队是哪个国家的?甲骨文不满赔偿工作不满半年被辞退,请问赔偿金是怎么算的?刘祚天DJ这个职业怎么样?地陷裂口天上顿时露出一个大窟窿地上也裂开了,一到黑幽幽的深沟可以用什么四字词语来?百花百游百花蛇草的作用seo优化工具SEO优化工具哪个好用点啊?www.se333se.com米奇网www.qvod333.com 看电影的效果好不?广告法中国的广告法有哪些。
科迈动态域名 日本软银 免费ftp空间 sockscap 贵州电信宽带测速 空间服务商 商务主机 linux空间 股票老左 徐正曦 空间技术网 搜索引擎提交入口 江苏双线服务器 便宜空间 php服务器 免费个人主页 测试网速命令 闪讯网 电信主机托管 japanese50m咸熟 更多