casesandybridge

sandybridge  时间:2021-03-27  阅读:()
PLQCDlibraryforLatticeQCDonmulti-coremachinesA.
Abdel-Rehim,aC.
Alexandrou,a,bN.
Anastopoulos,cG.
Koutsou,aI.
LiabotisdandN.
PapadopouloucaTheCyprusInstitute,CaSToRC,20KonstantinouKavaStreet,2121Aglantzia,Nicosia,CyprusbDepartmentofPhysics,UniversityofCyprus,P.
O.
Box20537,1678Nicosia,CypruscComputingSystemsLaboratory,SchoolofElectricalandComputerEngineering,NationalTechnicalUniversityofAthens,ZografouCampus,15773Zografou,Athens,GreecedGreekResearchandTechnologyNetwork,56MesogionAv.
,11527,Athens,GreeceE-mail:a.
abdel-rehim@cyi.
ac.
cy,c.
alexandrou@cyi.
ac.
cy,g.
koutsou@cyi.
ac.
cy,anastop@cslab.
ece.
ntua.
gr,iliaboti@grnet.
gr,nikela@cslab.
ece.
ntua.
grPLQCDisastand–alonesoftwarelibrarydevelopedunderPRACEforlatticeQCD.
Itpro-videsanimplementationoftheDiracoperatorforWilsontypefermionsandfewefcientlin-earsolvers.
Thelibraryisoptimizedformulti-coremachinesusingahybridparallelizationwithOpenMP+MPI.
ThemainobjectivesofthelibraryistoprovideascalableimplementationoftheDiracoperatorforefcientcomputationofthequarkpropagator.
Inthiscontribution,adescrip-tionofthePLQCDlibraryisgiventogetherwithsomebenchmarkresults.
31stInternationalSymposiumonLatticeFieldTheoryLATTICE2013July29August3,2013Mainz,GermanySpeaker.
cCopyrightownedbytheauthor(s)underthetermsoftheCreativeCommonsAttribution-NonCommercial-ShareAlikeLicence.
http://pos.
sissa.
it/arXiv:1405.
0700v1[hep-lat]4May2014PLQCDA.
Abdel-Rehim1.
IntroductionComputerhardwareforcommodityclustersaswellassupercomputershasevolvedtremen-douslyinthelastfewyears.
Nowadaysatypicalcomputenodehasbetween16and64coresandpossiblyanacceleratorsuchasaGraphicsProcessingUnit(GPU)orlatelyanIntelManyIntegratedCore(MIC)card.
Thistrendofpackingmanylow-poweredbutmassivelyparallelpro-cessingunitsisexpectedtocontinueassupercomputingtechnologypursuestheExascaleregime.
Thecurrenttechnologytrendsindicatethatbandwidthtomainmemorywillcontinuetolagbehindcomputationalpower,whichrequiresarethinkingofthedesignoflatticeQCDcodessuchthattheycanefcientlyrunonsucharchitectures.
Takingthisintoaccount,PRACE[1]allocatedresourcesforcommunitycodescalingactivitiesinmanycomputationallyintensiveareasincludinglatticeQCD.
TheworkpresentedherewasdevelopedunderPRACEfocusingonscalingcodesformulti-coremachines.
Theworkwepresentdealswithcommunitycodes,andmorespecicallyoncertaincomputationallyintensivekernelsinthesecodes,inordertoimprovetheirscalingandperformanceformulti-corearchitectures.
WehavecarriedoutoptimizationworkonthetmLQCD[2,3]codeandhavedevelopedanewhybridMPI/OpenMPlibrary(PLQCD)withoptimizedimplementationsoftheWilsonDirackernelandaselectedsetoflinearsolvers.
OurpartnersinthisprojecthavealsoperformedoptimizationworkfortheMolecularDynamicsintegratorsusedinHybridMonteCarlocodes,andalsoforLandaugaugexing.
ThiswasdonewithintheChromasoftwaresuite[4]andwillnotbediscussedhere(See[5]formoreinformation).
Manyothercommunitycodesofcourseexistbutwerenotconsideredinthiswork(See[6]foranoverview).
Inwhatfollows,wewillrstpresenttheworkcarriedoutforthecaseofPLQCD,whereweim-plementedtheWilsonDiracoperatorandassociatedlinearalgebrafunctionsusingMPI+OpenMP.
Inadditiontousingthishybridapproachforparallelism,wealsoimplementadditionaloptimiza-tionssuchasoverlappingcommunicationandcomputation,usingcompilerintrinsicsforvector-izationaswellasimplementingthenewAdvancedVectorInstructions[7](AVXforIntelorQPXforBlue/GeneQ)thatbecamerecentlyavailableinnewgenerationofprocessorssuchastheIntelSandy-Bridge.
TheworkdoneforthecaseofthetmLQCDpackagewillthenbepresented,whereweimplementedsomenewefcientlinearsolvers,inparticularthosebasedondeationsuchastheEigCGsolver[8],forwhichwewillgivesomebenchmarkresults.
2.
DiracoperatoroptimizationsAkeycomponentofthelatticeDiracoperatoristhehoppingpartgivenbyEq.
2.
1.
ψ(x)=3∑=0[U(x)(1γ)φ(x+e)+U(xe)(1+γ)φ(xe)],(2.
1)where,U(x)isthegaugelinkmatrixinthedirectionatsitex,γaretheDiracmatricesandeisaunitvectorinthedirection.
φandψaretheinputandoutputspinorsrespectively.
Equation2.
1canbere-writtenintermsoftwoauxiliaryeldsθ+(x)=(1γ)φ(x)andθ(x)=U(x)(1+γ)φ(x)asψ(x)=3∑=0[U(x)θ+(x+e)+θ(xe)].
(2.
2)2PLQCDA.
Abdel-RehimBecauseofthestructureoftheγmatrices,onlytheuppertwospincomponentsofθ±needtobecomputedbecausethelowertwospincomponentsarerelatedtotheupperones[9].
Inthefollowingwedescribesomeoftheoptimizationsperformedforthehoppingmatrix.
2.
1HybridparallelizationwithMPIandOpenMPOpenMPprovidesasimpleapproachformulti-threadingsinceitisimplementedascompilerdirectives.
Onecanincrementallyaddmulti-threadingtothecodeandalsousethesamecodewithmulti-threadingturnedonandoff.
Sincethemaincomponentinthehoppingmatrix(Diracoperator)isalarge"forloop"overlatticesites,itisnaturaltousethefor-loopparallelconstructofopenMP.
TheperformanceofthehybridcodeisthentestedagainstthepureMPIversion.
Weperformaweakscalingtestbyxingthelocalvolumepercore(orthread)andincreasethenumberofMPIprocesses.
ThetestwasdoneontheHoppermachineatNERSCwhichisaCrayXE6[10].
Eachcomputenodehas2twelve-coreAMD'MagnyCours'at2.
1-GHzsuchthateach6coressharethesamecache.
WendperformancefortheHybridversionismaximumwhenassigningatmost6threadsperMPIprocesssuchthatthese6OMPthreadssharethesameL3cache.
InFig.
1weshowtheperformanceofthepureMPIandtheMPI+openMPwith6threadsperMPIprocessforatotalnumberofcoresupto49,152cores.
FromtheseresultswerstnoticethatusingOpenMPleadstoaslightdegradationinperformanceascomparedtothepureMPIcase.
However,asweseeinthecasewithlocalvolumeof124,thehybridapproachperformsbetteraswegotoalargenumberofcores.
Similarbehaviorhasbeenalsoobservedforothercodesfromdifferentcomputationalsciences(seethecasestudiesonHopper[11]).
Figure1:WeakscalingtestforthehoppingmatrixonaCrayXE6machinewithlocallatticevolumepercore84(left)and124(right).
2.
2OverlappingcommunicationwithcomputationTypicallyinlatticecodesonerstcomputestheauxiliaryhalf-spinoreldsθ±asgiveninEquation(2.
2)andthencommunicatestheirvaluesontheboundariesbetweenneighboringpro-cessesinthe+anddirections.
Inablockingcommunicationscheme,computationhaltsuntilcommunicationoftheboundariescompletes.
Analternativeapproachistooverlapcommunica-tionswithcomputationsbydividingthelatticesitesintobulksites,forwhichnearestneighborsare3PLQCDA.
Abdel-Rehimavailablelocally,andboundarysites,forwhichthenearestneighborsarelocatedonneighboringprocesses,andthereforecanonlybeoperateduponaftercommunication.
Theorderofoperationsforcomputingtheresultψisthendoneasfollows:Computeθ+andbegincommunicatingthemtotheneighboringMPIprocessinthedirection.
ComputeθandbegincommunicatingthemtotheneighboringMPIprocessinthe+direction.
Computetheresultψ(x)onthebulksiteswhiletheneighborsarebeingcommunicated.
Waitforthecommunicationsinthedirectionstonish,thencomputethecontributions∑3=0[U(x)θ+(x+e)]totheresultontheboundarysites.
Waitforthecommunicationsinthe+directionstonish,thencomputethecontributions∑3=0[θ(xe)]ontheboundarysites.
Communicationisdoneusingnon-blockingMPIfunctionsMPI_Isend,MPI_IrecvandMPI_Wait.
Apossibledrawbackofthisapproachisthatonewillaccessψ(x)andU(x)inanunorderedfash-iondifferentfromtheorderitisstoredinmemory.
This,however,canbecircumventedpartiallybyusinghintsinthecodeforprefetching.
Wehavetestedtheeffectofprefetchingincaseofsequen-tialandrandomaccessofspinorandlinkelds.
Thetestwasdoneusingaseparatebenchmarkkernelcodewhichisolatesthelink-spinormultiplication.
AscanbeseeninFig.
3,prefetchingbecomesimportantforalargenumberofsites,i.
e.
whendata(spinorsandlinks)cannottinthecachememory,whichisatypicalsituationforlatticecalculations.
Itisalsonotedthataccessingthesitesrandomlyreducestheperformance,aswouldbeexpected.
Inthiscaseonecanimprovethesituationbydeningapointerarray,e.
g.
forthespinorsψ(i)=&ψ(x[i])wherex[i]isthesitetobeaccessedatstepiintheloopsuchasweshowinpseudo-codeinFig.
2.
Thesepointerscanbedenedapriori.
ThisimprovesthepredictiveabilityofthehardwareasisshowninFig.
3wherewecomparethedifferentprefetchingandaddressingschemes.
Sequentialaccessfor(i=0;iSandyBridgeprocessorsandlaterbyAMDintheirBull-dozerprocessor.
The16XMMregistersofSSE3arenow256-bitwideandknownasYMMregisters.
AVX-capableoatingpointunitsareabletoperformon4doubleprecisionoatingpointnumbersor8singleprecision.
Implementingtheseextensionsinthevectorizedpartsoflatticecodeshasthepotentialofprovidingagainofuptoafactor2inanidealsituation,althoughinprac-ticethisdependsonthelayoutoflatticedata.
Weprovidedanimplementationoftheseextensionsusinginlineintrinsics.
InthisimplementationasingleSU(3)matrixmultipliestwoSU(3)vectorssimultaneously.
Againofaboutafactorof1.
5isachievedforthehoppingmatrixinthetmLQCDcodeindoubleprecisionasshowninFig.
(4).
Forillustration,acodesnippetformultiplyingtwocomplexnumbersbytwocomplexnumbersusingAVXisshowninFig.
(5).
3.
EigCGsolverforTwisted-MassfermionsTwisted-MassfermionsoffertheadvantageofautomaticO(a)improvementwhentunedtomaximaltwist[12].
Withinthisdevelopmentworkwehaveaddedanincrementaldeationalgo-rithm,knownasEigCG,tothetmLQCDpackage.
Numericaltestsshowedaconsiderablespeed-upofthesolutionofthelinearsystemsonthelargestvolumessimulatedbytheEuropeanTwistedMassCollaboration(ETMC).
Forillustration,weshowinFig.
(6)thetimetosolutionwithEigCGonaTwisted-Masscongurationwith2+1+1dynamicalavorswithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
Inthiscasethetotalnumberofeigenvectorsdeatedwas300whichwasbuiltincrementallybycomputing10eigenvectorsduringthesolutionoftherst30right-handsidesusingasearchsubspaceofsize60.
Allsystemsaresolvedindoubleprecisiontorelativetoleranceof108.
5PLQCDA.
Abdel-RehimFigure4:Comparingtheperformanceofthehop-pingmatrixoftmLQCDusingSSE3andAVXindoubleprecisiononanIntelSandyBridgeprocessor.
#include/*t0:a+b*I,e+f*Iandt1:c+d*I,g+h*I*return:(ac-bd)+(ad+bc)*I,*(eg-fh)+(eh+fg)*I*/staticinline__m256dcomplex_mul_regs_256(__m256dt0,__m256dt1){__m256dt2;t2=t1;t1=_mm256_unpacklo_pd(t1,t1);t2=_mm256_unpackhi_pd(t2,t2);t1=_mm256_mul_pd(t1,t0);t2=_mm256_mul_pd(t2,t0);t2=_mm256_shuffle_pd(t2,t2,5);t1=_mm256_addsub_pd(t1,t2);returnt1;}Figure5:MultiplyingtwocomplexnumbersbytwocomplexnumberoftypedoubleusingAVXin-structions.
Figure6:Solutiontimeperprocessfortherst35right-handsidesusingIncrementalEigCGascomparedtoCGonaTwisted-Masscongurationwithlatticesize483*96atβ=2.
1,andpionmass≈230MeV.
4.
ConclusionsandSummaryWehavecarriedoutdevelopmenteffortforafewselectedkernelsusedinlatticeQCD.
TherstoftheseeffortsincludedthedevelopmentofahybridMPI/OpenMPlibrarywhichincludesparallelizedkernelsfortheWilsonDiracoperatorandfewassociatedsolvers.
Anumberofparal-lelizationstrategieshavebeeninvestigated,suchasforoverlappingcommunicationwithcomputa-tions.
ThecodehasbeenshowntoscalefairlywellontheCrayXE6.
Intermsofsingleprocessperformance,wecarriedoutinitialvectorizationeffortsforAVXwhereweseeanimprovementof1.
5comparedtotheideal2.
Inadditionwehaveinvestigatedseveraldata-orderingandassociatedprefetchingstrategies.
ForthecaseoftmLQCD,themainsoftwarecodeoftheETMCcollaboration,wehaveimple-6PLQCDA.
Abdel-Rehimmentedanefcientlinearsolverwhichincrementallydeatedthetwisted-massDiracoperatortogiveaspeed-upofabout3timeswhenenoughright-hand-sidesarerequired.
Thisisalreadyinuseinproductionprojects,suchasinRefs.
[14]and[13].
Allcodesarepubliclyavailable.
PLQCDisavailablethroughtheHPCFORGEwebsiteattheSwissNationalSupercomputingCentre(CSCS)wheremoreinformationisavailablewithinthecodedocumentation.
OurEigCGimplementationintmLQCDisavailableviagit-hub.
AcknowledgementsThistalkwasapartofacodingsessionsponsoredpartiallybythePRACE-2IPproject,aspartofthe"CommunityCodesDevelopment"WorkPackage8.
PRACE-2IPisa7thFrameworkEUfundedproject(http://www.
prace-ri.
eu/,grantagreementnumber:RI-283493).
Wewouldliketothanktheorganizersofthe2013Latticemeetingfortheirstrongsupporttomakethecodingsessionasuccessandprovideallorganizationsupport.
WewouldliketothankC.
Urbach,A.
Deuzmann,B.
Kostrzewa,HubertSimma,S.
Krieg,andL.
Scorzatoforverystimulatingdiscussionsduringthedevelopmentofthisproject.
WeacknowledgethecomputingresourcesfromTier-0machinesofPRACEincludingJUQUEENandCuriemachinesaswellastheTodimachineatCSCS.
WealsoacknowledgethecomputingsupportfromNERSCandtheHoppermachine.
References[1]http://www.
prace-ri.
eu/.
[2]K.
JansenandC.
Urbach,Comput.
Phys.
Commun.
180,2717(2009),[arXiv:0905.
3331].
[3]ETMCollaboration,https://github.
com/etmc/tmLQCD.
[4]http://usqcd.
jlab.
org/usqcd-docs/chroma/.
[5]SeethepublicdeliverableD8.
3onthePRACEwebsiteunderPRACE-2IP.
[6]A.
Deuzeman,PoS(LATTICE2013).
[7]SeetheIntelDevelopermanual.
[8]A.
StathopoulosandK.
Orginos,Computinganddeatingeigenvalueswhilesolvingmultipleright-handsidelinearsystemswithanapplicationtoquantumchromodynamics,SIAMJ.
Sci.
Comput.
2010;32(1):439–462,[arXiv:0707.
0131].
[9]SeeforexamplethedocumentationoftheDDHMCcodebyM.
L¨uscher.
[10]TheHopperCrayXE6machineatNERSC.
[11]SeedocumentationforcombiningMPIandopenMPontheNERSCwebsite.
[12]R.
Frezzottietal.
[AlphaCollaboration],LatticeQCDwithachirallytwistedmassterm,JHEP0108,058(2001)[hep-lat/0101001].
[13]C.
Alexandrou,M.
Constantinou,S.
Dinter,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1309.
7768[hep-lat].
[14]A.
Abdel-Rehim,C.
Alexandrou,M.
Constantinou,V.
Drach,K.
Hadjiyiannakou,K.
Jansen,G.
KoutsouandA.
Vaquero,arXiv:1310.
6339[hep-lat].
7

HostWebis:美国/法国便宜服务器,100Mbps不限流量,高配置大硬盘,$44/月起

hostwebis怎么样?hostwebis昨天在webhosting发布了几款美国高配置大硬盘机器,但报价需要联系客服。看了下该商家的其它产品,发现几款美国服务器、法国服务器还比较实惠,100Mbps不限流量,高配置大硬盘,$44/月起,有兴趣的可以关注一下。HostWebis是一家国外主机品牌,官网宣称1998年就成立了,根据目标市场的不同,以不同品牌名称提供网络托管服务。2003年,通过与W...

ShineServers(5美元/月)荷兰VPS、阿联酋VPS首月五折/1核1G/50GB硬盘/3TB流量/1Gbps带宽

优惠码50SSDOFF 首月5折50WHTSSD 年付5折15OFF 85折优惠,可循环使用荷兰VPSCPU内存SSD带宽IPv4价格购买1核1G50G1Gbps/3TB1个$ 9.10/月链接2核2G80G1Gbps/5TB1个$ 12.70/月链接2核3G100G1Gbps/7TB1个$ 16.30/月链接3核4G150G1Gbps/10TB1个$ 18.10/月链接阿联酋VPSCPU内存SS...

香港服务器租用多少钱一个月?影响香港服务器租用价格因素

香港服务器租用多少钱一个月?香港服务器受到很多朋友的青睐,其中免备案成为其特色之一。很多用户想了解香港云服务器价格多少钱,也有同行询问香港服务器的租赁价格,一些实际用户想要了解香港服务器的市场。虽然价格是关注的焦点,但价格并不是香港服务器的全部选择。今天小编介绍了一些影响香港服务器租赁价格的因素,以及在香港租一个月的服务器要花多少钱。影响香港服务器租赁价格的因素:1.香港机房选择香港机房相当于选择...

sandybridge为你推荐
易烊千玺弟弟创魔方世界纪录王俊凯.易烊千玺编舞吉尼斯记录云爆发什么是蒸汽云爆炸?要具备那些条件?对对塔今儿老师给推荐了一个叫对对塔的学习网站,看起来挺不错的,有用过的人吗?管不管用?哪些功能比较好啊?巨星prince去世Whitney Houston因什么去世的?关键字关键词标签里写多少个关键词为最好巫正刚阿迪三叶草彩虹板鞋的鞋带怎么穿?详细点,最后有图解。高分求www.7788dy.com回家的诱惑 哪个网站更新的最快啊汴京清谈都城汴京,数百万家,尽仰石炭,无一燃薪者的翻译官人放题求日本放题系列电影,要全集越多越好,求给力铂金血痕为什么我有红血痕?
顶级域名 域名投资 长沙服务器租用 郑州服务器租用 3322免费域名 免费申请域名和空间 华为云服务 冰山互联 fdcservers fastdomain softbank官网 日志分析软件 密码泄露 回程路由 美国php空间 gg广告 上海域名 40g硬盘 工信部icp备案号 爱奇艺会员免费试用 更多