casesandybridge

sandybridge  时间:2021-03-27  阅读:()
PLQCDlibraryforLatticeQCDonmulti-coremachinesA.
Abdel-Rehim,aC.
Alexandrou,a,bN.
Anastopoulos,cG.
Koutsou,aI.
LiabotisdandN.
PapadopouloucaTheCyprusInstitute,CaSToRC,20KonstantinouKavaStreet,2121Aglantzia,Nicosia,CyprusbDepartmentofPhysics,UniversityofCyprus,P.
O.
Box20537,1678Nicosia,CypruscComputingSystemsLaboratory,SchoolofElectricalandComputerEngineering,NationalTechnicalUniversityofAthens,ZografouCampus,15773Zografou,Athens,GreecedGreekResearchandTechnologyNetwork,56MesogionAv.
,11527,Athens,GreeceE-mail:a.
abdel-rehim@cyi.
ac.
cy,c.
alexandrou@cyi.
ac.
cy,g.
koutsou@cyi.
ac.
cy,anastop@cslab.
ece.
ntua.
gr,iliaboti@grnet.
gr,nikela@cslab.
ece.
ntua.
grPLQCDisastand–alonesoftwarelibrarydevelopedunderPRACEforlatticeQCD.
Itpro-videsanimplementationoftheDiracoperatorforWilsontypefermionsandfewefcientlin-earsolvers.
Thelibraryisoptimizedformulti-coremachinesusingahybridparallelizationwithOpenMP+MPI.
ThemainobjectivesofthelibraryistoprovideascalableimplementationoftheDiracoperatorforefcientcomputationofthequarkpropagator.
Inthiscontribution,adescrip-tionofthePLQCDlibraryisgiventogetherwithsomebenchmarkresults.
31stInternationalSymposiumonLatticeFieldTheoryLATTICE2013July29August3,2013Mainz,GermanySpeaker.
cCopyrightownedbytheauthor(s)underthetermsoftheCreativeCommonsAttribution-NonCommercial-ShareAlikeLicence.
http://pos.
sissa.
it/arXiv:1405.
0700v1[hep-lat]4May2014PLQCDA.
Abdel-Rehim1.
IntroductionComputerhardwareforcommodityclustersaswellassupercomputershasevolvedtremen-douslyinthelastfewyears.
Nowadaysatypicalcomputenodehasbetween16and64coresandpossiblyanacceleratorsuchasaGraphicsProcessingUnit(GPU)orlatelyanIntelManyIntegratedCore(MIC)card.
Thistrendofpackingmanylow-poweredbutmassivelyparallelpro-cessingunitsisexpectedtocontinueassupercomputingtechnologypursuestheExascaleregime.
Thecurrenttechnologytrendsindicatethatbandwidthtomainmemorywillcontinuetolagbehindcomputationalpower,whichrequiresarethinkingofthedesignoflatticeQCDcodessuchthattheycanefcientlyrunonsucharchitectures.
Takingthisintoaccount,PRACE[1]allocatedresourcesforcommunitycodescalingactivitiesinmanycomputationallyintensiveareasincludinglatticeQCD.
TheworkpresentedherewasdevelopedunderPRACEfocusingonscalingcodesformulti-coremachines.
Theworkwepresentdealswithcommunitycodes,andmorespecicallyoncertaincomputationallyintensivekernelsinthesecodes,inordertoimprovetheirscalingandperformanceformulti-corearchitectures.
WehavecarriedoutoptimizationworkonthetmLQCD[2,3]codeandhavedevelopedanewhybridMPI/OpenMPlibrary(PLQCD)withoptimizedimplementationsoftheWilsonDirackernelandaselectedsetoflinearsolvers.
OurpartnersinthisprojecthavealsoperformedoptimizationworkfortheMolecularDynamicsintegratorsusedinHybridMonteCarlocodes,andalsoforLandaugaugexing.
ThiswasdonewithintheChromasoftwaresuite[4]andwillnotbediscussedhere(See[5]formoreinformation).
Manyothercommunitycodesofcourseexistbutwerenotconsideredinthiswork(See[6]foranoverview).
Inwhatfollows,wewillrstpresenttheworkcarriedoutforthecaseofPLQCD,whereweim-plementedtheWilsonDiracoperatorandassociatedlinearalgebrafunctionsusingMPI+OpenMP.
Inadditiontousingthishybridapproachforparallelism,wealsoimplementadditionaloptimiza-tionssuchasoverlappingcommunicationandcomputation,usingcompilerintrinsicsforvector-izationaswellasimplementingthenewAdvancedVectorInstructions[7](AVXforIntelorQPXforBlue/GeneQ)thatbecamerecentlyavailableinnewgenerationofprocessorssuchastheIntelSandy-Bridge.
TheworkdoneforthecaseofthetmLQCDpackagewillthenbepresented,whereweimplementedsomenewefcientlinearsolvers,inparticularthosebasedondeationsuchastheEigCGsolver[8],forwhichwewillgivesomebenchmarkresults.
2.
DiracoperatoroptimizationsAkeycomponentofthelatticeDiracoperatoristhehoppingpartgivenbyEq.
2.
1.
ψ(x)=3∑=0[U(x)(1γ)φ(x+e)+U(xe)(1+γ)φ(xe)],(2.
1)where,U(x)isthegaugelinkmatrixinthedirectionatsitex,γaretheDiracmatricesandeisaunitvectorinthedirection.
φandψaretheinputandoutputspinorsrespectively.
Equation2.
1canbere-writtenintermsoftwoauxiliaryeldsθ+(x)=(1γ)φ(x)andθ(x)=U(x)(1+γ)φ(x)asψ(x)=3∑=0[U(x)θ+(x+e)+θ(xe)].
(2.
2)2PLQCDA.
Abdel-RehimBecauseofthestructureoftheγmatrices,onlytheuppertwospincomponentsofθ±needtobecomputedbecausethelowertwospincomponentsarerelatedtotheupperones[9].
Inthefollowingwedescribesomeoftheoptimizationsperformedforthehoppingmatrix.
2.
1HybridparallelizationwithMPIandOpenMPOpenMPprovidesasimpleapproachformulti-threadingsinceitisimplementedascompilerdirectives.
Onecanincrementallyaddmulti-threadingtothecodeandalsousethesamecodewithmulti-threadingturnedonandoff.
Sincethemaincomponentinthehoppingmatrix(Diracoperator)isalarge"forloop"overlatticesites,itisnaturaltousethefor-loopparallelconstructofopenMP.
TheperformanceofthehybridcodeisthentestedagainstthepureMPIversion.
Weperformaweakscalingtestbyxingthelocalvolumepercore(orthread)andincreasethenumberofMPIprocesses.
ThetestwasdoneontheHoppermachineatNERSCwhichisaCrayXE6[10].
Eachcomputenodehas2twelve-coreAMD'MagnyCours'at2.
1-GHzsuchthateach6coressharethesamecache.
WendperformancefortheHybridversionismaximumwhenassigningatmost6threadsperMPIprocesssuchthatthese6OMPthreadssharethesameL3cache.
InFig.
1weshowtheperformanceofthepureMPIandtheMPI+openMPwith6threadsperMPIprocessforatotalnumberofcoresupto49,152cores.
FromtheseresultswerstnoticethatusingOpenMPleadstoaslightdegradationinperformanceascomparedtothepureMPIcase.
However,asweseeinthecasewithlocalvolumeof124,thehybridapproachperformsbetteraswegotoalargenumberofcores.
Similarbehaviorhasbeenalsoobservedforothercodesfromdifferentcomputationalsciences(seethecasestudiesonHopper[11]).
Figure1:WeakscalingtestforthehoppingmatrixonaCrayXE6machinewithlocallatticevolumepercore84(left)and124(right).
2.
2OverlappingcommunicationwithcomputationTypicallyinlatticecodesonerstcomputestheauxiliaryhalf-spinoreldsθ±asgiveninEquation(2.
2)andthencommunicatestheirvaluesontheboundariesbetweenneighboringpro-cessesinthe+anddirections.
Inablockingcommunicationscheme,computationhaltsuntilcommunicationoftheboundariescompletes.
Analternativeapproachistooverlapcommunica-tionswithcomputationsbydividingthelatticesitesintobulksites,forwhichnearestneighborsare3PLQCDA.
Abdel-Rehimavailablelocally,andboundarysites,forwhichthenearestneighborsarelocatedonneighboringprocesses,andthereforecanonlybeoperateduponaftercommunication.
Theorderofoperationsforcomputingtheresultψisthendoneasfollows:Computeθ+andbegincommunicatingthemtotheneighboringMPIprocessinthedirection.
ComputeθandbegincommunicatingthemtotheneighboringMPIprocessinthe+direction.
Computetheresultψ(x)onthebulksiteswhiletheneighborsarebeingcommunicated.
Waitforthecommunicationsinthedirectionstonish,thencomputethecontributions∑3=0[U(x)θ+(x+e)]totheresultontheboundarysites.
Waitforthecommunicationsinthe+directionstonish,thencomputethecontributions∑3=0[θ(xe)]ontheboundarysites.
Communicationisdoneusingnon-blockingMPIfunctionsMPI_Isend,MPI_IrecvandMPI_Wait.
Apossibledrawbackofthisapproachisthatonewillaccessψ(x)andU(x)inanunorderedfash-iondifferentfromtheorderitisstoredinmemory.
This,however,canbecircumventedpartiallybyusinghintsinthecodeforprefetching.
Wehavetestedtheeffectofprefetchingincaseofsequen-tialandrandomaccessofspinorandlinkelds.
Thetestwasdoneusingaseparatebenchmarkkernelcodewhichisolatesthelink-spinormultiplication.
AscanbeseeninFig.
3,prefetchingbecomesimportantforalargenumberofsites,i.
e.
whendata(spinorsandlinks)cannottinthecachememory,whichisatypicalsituationforlatticecalculations.
Itisalsonotedthataccessingthesitesrandomlyreducestheperformance,aswouldbeexpected.
Inthiscaseonecanimprovethesituationbydeningapointerarray,e.
g.
forthespinorsψ(i)=&ψ(x[i])wherex[i]isthesitetobeaccessedatstepiintheloopsuchasweshowinpseudo-codeinFig.
2.
Thesepointerscanbedenedapriori.
ThisimprovesthepredictiveabilityofthehardwareasisshowninFig.
3wherewecomparethedifferentprefetchingandaddressingschemes.
Sequentialaccessfor(i=0;iSandyBridgeprocessorsandlaterbyAMDintheirBull-dozerprocessor.
The16XMMregistersofSSE3arenow256-bitwideandknownasYMMregisters.
AVX-capableoatingpointunitsareabletoperformon4doubleprecisionoatingpointnumbersor8singleprecision.
Implementingtheseextensionsinthevectorizedpartsoflatticecodeshasthepotentialofprovidingagainofuptoafactor2inanidealsituation,althoughinprac-ticethisdependsonthelayoutoflatticedata.
Weprovidedanimplementationoftheseextensionsusinginlineintrinsics.
InthisimplementationasingleSU(3)matrixmultipliestwoSU(3)vectorssimultaneously.
Againofaboutafactorof1.
5isachievedforthehoppingmatrixinthetmLQCDcodeindoubleprecisionasshowninFig.
(4).
Forillustration,acodesnippetformultiplyingtwocomplexnumbersbytwocomplexnumbersusingAVXisshowninFig.
(5).
3.
EigCGsolverforTwisted-MassfermionsTwisted-MassfermionsoffertheadvantageofautomaticO(a)improvementwhentunedtomaximaltwist[12].
Withinthisdevelopmentworkwehaveaddedanincrementaldeationalgo-rithm,knownasEigCG,tothetmLQCDpackage.
Numericaltestsshowedaconsiderablespeed-upofthesolutionofthelinearsystemsonthelargestvolumessimulatedbytheEuropeanTwistedMassCollaboration(ETMC).
Forillustration,weshowinFig.
(6)thetimetosolutionwithEigCGonaTwisted-Masscongurationwith2+1+1dynamicalavorswithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
Inthiscasethetotalnumberofeigenvectorsdeatedwas300whichwasbuiltincrementallybycomputing10eigenvectorsduringthesolutionoftherst30right-handsidesusingasearchsubspaceofsize60.
Allsystemsaresolvedindoubleprecisiontorelativetoleranceof108.
5PLQCDA.
Abdel-RehimFigure4:Comparingtheperformanceofthehop-pingmatrixoftmLQCDusingSSE3andAVXindoubleprecisiononanIntelSandyBridgeprocessor.
#include/*t0:a+b*I,e+f*Iandt1:c+d*I,g+h*I*return:(ac-bd)+(ad+bc)*I,*(eg-fh)+(eh+fg)*I*/staticinline__m256dcomplex_mul_regs_256(__m256dt0,__m256dt1){__m256dt2;t2=t1;t1=_mm256_unpacklo_pd(t1,t1);t2=_mm256_unpackhi_pd(t2,t2);t1=_mm256_mul_pd(t1,t0);t2=_mm256_mul_pd(t2,t0);t2=_mm256_shuffle_pd(t2,t2,5);t1=_mm256_addsub_pd(t1,t2);returnt1;}Figure5:MultiplyingtwocomplexnumbersbytwocomplexnumberoftypedoubleusingAVXin-structions.
Figure6:Solutiontimeperprocessfortherst35right-handsidesusingIncrementalEigCGascomparedtoCGonaTwisted-Masscongurationwithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
4.
ConclusionsandSummaryWehavecarriedoutdevelopmenteffortforafewselectedkernelsusedinlatticeQCD.
TherstoftheseeffortsincludedthedevelopmentofahybridMPI/OpenMPlibrarywhichincludesparallelizedkernelsfortheWilsonDiracoperatorandfewassociatedsolvers.
Anumberofparal-lelizationstrategieshavebeeninvestigated,suchasforoverlappingcommunicationwithcomputa-tions.
ThecodehasbeenshowntoscalefairlywellontheCrayXE6.
Intermsofsingleprocessperformance,wecarriedoutinitialvectorizationeffortsforAVXwhereweseeanimprovementof1.
5comparedtotheideal2.
Inadditionwehaveinvestigatedseveraldata-orderingandassociatedprefetchingstrategies.
ForthecaseoftmLQCD,themainsoftwarecodeoftheETMCcollaboration,wehaveimple-6PLQCDA.
Abdel-Rehimmentedanefcientlinearsolverwhichincrementallydeatedthetwisted-massDiracoperatortogiveaspeed-upofabout3timeswhenenoughright-hand-sidesarerequired.
Thisisalreadyinuseinproductionprojects,suchasinRefs.
[14]and[13].
Allcodesarepubliclyavailable.
PLQCDisavailablethroughtheHPCFORGEwebsiteattheSwissNationalSupercomputingCentre(CSCS)wheremoreinformationisavailablewithinthecodedocumentation.
OurEigCGimplementationintmLQCDisavailableviagit-hub.
AcknowledgementsThistalkwasapartofacodingsessionsponsoredpartiallybythePRACE-2IPproject,aspartofthe"CommunityCodesDevelopment"WorkPackage8.
PRACE-2IPisa7thFrameworkEUfundedproject(http://www.
prace-ri.
eu/,grantagreementnumber:RI-283493).
Wewouldliketothanktheorganizersofthe2013Latticemeetingfortheirstrongsupporttomakethecodingsessionasuccessandprovideallorganizationsupport.
WewouldliketothankC.
Urbach,A.
Deuzmann,B.
Kostrzewa,HubertSimma,S.
Krieg,andL.
Scorzatoforverystimulatingdiscussionsduringthedevelopmentofthisproject.
WeacknowledgethecomputingresourcesfromTier-0machinesofPRACEincludingJUQUEENandCuriemachinesaswellastheTodimachineatCSCS.
WealsoacknowledgethecomputingsupportfromNERSCandtheHoppermachine.
References[1]http://www.
prace-ri.
eu/.
[2]K.
JansenandC.
Urbach,Comput.
Phys.
Commun.
180,2717(2009),[arXiv:0905.
3331].
[3]ETMCollaboration,https://github.
com/etmc/tmLQCD.
[4]http://usqcd.
jlab.
org/usqcd-docs/chroma/.
[5]SeethepublicdeliverableD8.
3onthePRACEwebsiteunderPRACE-2IP.
[6]A.
Deuzeman,PoS(LATTICE2013).
[7]SeetheIntelDevelopermanual.
[8]A.
StathopoulosandK.
Orginos,Computinganddeatingeigenvalueswhilesolvingmultipleright-handsidelinearsystemswithanapplicationtoquantumchromodynamics,SIAMJ.
Sci.
Comput.
2010;32(1):439–462,[arXiv:0707.
0131].
[9]SeeforexamplethedocumentationoftheDDHMCcodebyM.
L¨uscher.
[10]TheHopperCrayXE6machineatNERSC.
[11]SeedocumentationforcombiningMPIandopenMPontheNERSCwebsite.
[12]R.
Frezzottietal.
[AlphaCollaboration],LatticeQCDwithachirallytwistedmassterm,JHEP0108,058(2001)[hep-lat/0101001].
[13]C.
Alexandrou,M.
Constantinou,S.
Dinter,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1309.
7768[hep-lat].
[14]A.
Abdel-Rehim,C.
Alexandrou,M.
Constantinou,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1310.
6339[hep-lat].
7

华纳云E5处理器16G内存100Mbps688元/月

近日华纳云商家正式上线了美国服务器产品,这次美国机房上线的产品包括美国云服务器、美国独立服务器、美国高防御服务器以及美国高防云服务器等产品,新产品上线华纳云推出了史上优惠力度最高的特价优惠活动,美国云服务器低至3折,1核心1G内存5Mbps带宽低至24元/月,20G ddos高防御服务器低至688元/月,年付周期再送2个月、两年送4个月、三年送6个月,终身续费同价,有需要的朋友可以关注一下。华纳云...

Megalayer优化带宽和VPS主机主机方案策略 15M CN2优化带宽和30M全向带宽

Megalayer 商家主营业务是以独立服务器和站群服务器的,后来也陆续的有新增香港、菲律宾数据中心的VPS主机产品。由于其线路的丰富,还是深受一些用户喜欢的,有CN2优化直连线路,有全向国际线路,以及针对欧美的国际线路。这次有看到商家也有新增美国机房的VPS主机,也有包括15M带宽CN2优化带宽以及30M带宽的全向线路。Megalayer 商家提供的美国机房VPS产品,提供的配置方案也是比较多,...

无忧云-河南洛阳BGP,CEPH集群分布式存储,数据安全可靠,活动期间月付大优惠!

 无忧云怎么样?无忧云服务器好不好?无忧云值不值得购买?无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点...

sandybridge为你推荐
newworldtheworld中文是什么意思今日油条联通大王卡看今日头条免流量吗?商标注册流程及费用注册商标的流程是什么,大概需要多少费用?百度关键词价格查询百度推广关键词怎么扣费?psbc.com95580是什么诈骗信息不点网址就安全吧!月神谭有没有什么好看的小说?拒绝言情小说!www.119mm.comwww.993mm+com精品集!www.544qq.COM跪求:天时达T092怎么下载QQwww.ijinshan.com桌面上多了一个IE图标,打开后就链接到009dh.com这个网站,这个图标怎么删掉啊?www.99vv1.comwww.in9.com是什么网站啊?
100m虚拟主机 cn域名价格 免费vps simcentric 加勒比群岛 光棍节日志 网通代理服务器 租空间 网通ip dux 域名转向 admit的用法 php空间推荐 免费吧 网通服务器托管 idc查询 万网空间购买 七夕快乐英语 国内域名 深圳域名 更多