Entire29ff.com

29ff.com  时间:2021-03-20  阅读:()
RegularExpressionsTheCompleteTutorialJanGoyvaertsRegularExpressions:TheCompleteTutorialJanGoyvaertsCopyright2006,2007JanGoyvaerts.
Allrightsreserved.
LastupdatedJuly2007.
Nopartofthisbookshallbereproduced,storedinaretrievalsystem,ortransmittedbyanymeans,electronic,mechanical,photocopying,recording,orotherwise,withoutwrittenpermissionfromtheauthor.
Thisbookispublishedexclusivelyathttp://www.
regular-expressions.
info/print.
htmlEveryefforthasbeenmadetomakethisbookascompleteandasaccurateaspossible,butnowarrantyorfitnessisimplied.
Theinformationisprovidedonan"asis"basis.
Theauthorandthepublishershallhaveneitherliabilitynorresponsibilitytoanypersonorentitywithrespecttoanylossordamagesarisingfromtheinformationcontainedinthisbook.
iTableofContentsTutorial.
11.
RegularExpressionTutorial32.
LiteralCharacters.
53.
FirstLookatHowaRegexEngineWorksInternally74.
CharacterClassesorCharacterSets.
95.
TheDotMatches(Almost)AnyCharacter.
136.
StartofStringandEndofStringAnchors.
157.
WordBoundaries.
188.
AlternationwithTheVerticalBarorPipeSymbol.
219.
OptionalItems.
2310.
RepetitionwithStarandPlus2411.
UseRoundBracketsforGrouping.
2712.
NamedCapturingGroups3113.
UnicodeRegularExpressions.
3314.
RegexMatchingModes4215.
PossessiveQuantifiers4416.
AtomicGrouping4717.
LookaheadandLookbehindZero-WidthAssertions.
4918.
TestingTheSamePartofaStringforMoreThanOneRequirement5219.
ContinuingatTheEndofThePreviousMatch.
5420.
If-Then-ElseConditionalsinRegularExpressions5621.
XMLSchemaCharacterClasses5922.
POSIXBracketExpressions6123.
AddingCommentstoRegularExpressions.
6524.
Free-SpacingRegularExpressions.
66Examples.
671.
SampleRegularExpressions.
692.
MatchingFloatingPointNumberswithaRegularExpression723.
HowtoFindorValidateanEmailAddress.
734.
MatchingaValidDate765.
MatchingWholeLinesofText.
776.
DeletingDuplicateLinesFromaFile.
788.
FindTwoWordsNearEachOther.
799.
RunawayRegularExpressions:CatastrophicBacktracking.
8010.
RepeatingaCapturingGroupvs.
CapturingaRepeatedGroup.
85Tools&Languages.
871.
SpecializedToolsandUtilitiesforWorkingwithRegularExpressions892.
UsingRegularExpressionswithDelphifor.
NETandWin32.
91ii3.
EditPadPro:ConvenientTextEditorwithFullRegularExpressionSupport924.
WhatIsgrep955.
UsingRegularExpressionsinJava976.
JavaDemoApplicationusingRegularExpressions.
1007.
UsingRegularExpressionswithJavaScriptandECMAScript.
1078.
JavaScriptRegExpExample:RegularExpressionTester.
1099.
MySQLRegularExpressionswithTheREGEXPOperator.
11010.
UsingRegularExpressionswithTheMicrosoft.
NETFramework11111.
C#DemoApplication.
11412.
OracleDatabase10gRegularExpressions.
12113.
ThePCREOpenSourceRegexLibrary.
12314.
Perl'sRichSupportforRegularExpressions.
12415.
PHPProvidesThreeSetsofRegularExpressionFunctions12616.
POSIXBasicRegularExpressions.
12917.
PostgreSQLHasThreeRegularExpressionFlavors13118.
PowerGREP:TakinggrepBeyondTheCommandLine13319.
Python'sreModule.
13520.
HowtoUseRegularExpressionsinREALbasic.
13921.
RegexBuddy:YourPerfectCompanionforWorkingwithRegularExpressions.
14222.
UsingRegularExpressionswithRuby.
14523.
TclHasThreeRegularExpressionFlavors14724.
VBScript'sRegularExpressionSupport.
15125.
VBScriptRegExpExample:RegularExpressionTester15426.
HowtoUseRegularExpressionsinVisualBasic.
15627.
XMLSchemaRegularExpressions.
157Reference.
1591.
BasicSyntaxReference.
1612.
AdvancedSyntaxReference.
1663.
UnicodeSyntaxReference1704.
SyntaxReferenceforSpecificRegexFlavors.
1715.
RegularExpressionFlavorComparison.
1736.
ReplacementTextReference.
182iiiIntroductionAregularexpression(regexorregexpforshort)isaspecialtextstringfordescribingasearchpattern.
Youcanthinkofregularexpressionsaswildcardsonsteroids.
Youareprobablyfamiliarwithwildcardnotationssuchas*.
txttofindalltextfilesinafilemanager.
Theregexequivalentis.
*\.
txt.
Butyoucandomuchmorewithregularexpressions.
InatexteditorlikeEditPadProoraspecializedtextprocessingtoollikePowerGREP,youcouldusetheregularexpression\b[A-Z0-9.
A-Z0-9.
-]+\.
[A-Z]{2,4}\btosearchforanemailaddress.
Anyemailaddress,tobeexact.
Averysimilarregularexpression(replacethefirst\bwith^andthelastonewith$)canbeusedbyaprogrammertocheckiftheuserenteredaproperlyformattedemailaddress.
Injustonelineofcode,whetherthatcodeiswritteninPerl,PHP,Java,a.
NETlanguageoramultitudeofotherlanguages.
CompleteRegularExpressionTutorialDonotworryiftheaboveexampleorthequickstartmakelittlesensetoyou.
Anynon-trivialregexlooksdauntingtoanybodynotfamiliarwiththem.
Butwithjustabitofexperience,youwillsoonbeabletocraftyourownregularexpressionslikeyouhaveneverdoneanythingelse.
Thetutorialinthisbookexplainseverythingbitbybit.
Thistutorialisquiteuniquebecauseitnotonlyexplainstheregexsyntax,butalsodescribesindetailhowtheregexengineactuallygoesaboutitswork.
Youwilllearnquitealot,evenifyouhavealreadybeenusingregularexpressionsforsometime.
Thiswillhelpyoutounderstandquicklywhyaparticularregexdoesnotdowhatyouinitiallyexpected,savingyoulotsofguessworkandheadscratchingwhenwritingmorecomplexregexes.
Applications&LanguagesThatSupportRegexesTherearemanysoftwareapplicationsandprogramminglanguagesthatsupportregularexpressions.
Ifyouareaprogrammer,youcansaveyourselflotsoftimeandeffort.
Youcanoftenaccomplishwithasingleregularexpressioninoneorafewlinesofcodewhatwouldotherwisetakedozensorhundreds.
NotOnlyforProgrammersIfyouarenotaprogrammer,youuseregularexpressionsinmanysituationsjustaswell.
Theywillmakefindinginformationaloteasier.
Youcanusetheminpowerfulsearchandreplaceoperationstoquicklymakechangesacrosslargenumbersoffiles.
Asimpleexampleisgr[ae]ywhichwillfindbothspellingsofthewordgreyinoneoperation,insteadoftwo.
Therearemanytexteditorsandsearchandreplacetoolswithdecentregexsupport.
Part1Tutorial31.
RegularExpressionTutorialInthistutorial,Iwillteachyouallyouneedtoknowtobeabletocraftpowerfultime-savingregularexpressions.
Iwillstartwiththemostbasicconcepts,sothatyoucanfollowthistutorialevenifyouknownothingatallaboutregularexpressionsyet.
ButIwillnotstopthere.
Iwillalsoexplainhowaregularexpressionengineworksontheinside,andalertyouattheconsequences.
Thiswillhelpyoutounderstandquicklywhyaparticularregexdoesnotdowhatyouinitiallyexpected.
Itwillsaveyoulotsofguessworkandheadscratchingwhenyouneedtowritemorecomplexregexes.
WhatRegularExpressionsAreExactly-TerminologyBasically,aregularexpressionisapatterndescribingacertainamountoftext.
Theirnamecomesfromthemathematicaltheoryonwhichtheyarebased.
Butwewillnotdigintothat.
Sincemostpeopleincludingmyselfarelazytotype,youwillusuallyfindthenameabbreviatedtoregexorregexp.
Ipreferregex,becauseitiseasytopronouncetheplural"regexes".
Inthisbook,regularexpressionsareprintedbetweenguillemots:regex.
Theyclearlyseparatethepatternfromthesurroundingtextandpunctuation.
Thisfirstexampleisactuallyaperfectlyvalidregex.
Itisthemostbasicpattern,simplymatchingtheliteraltextregex".
A"match"isthepieceoftext,orsequenceofbytesorcharactersthatpatternwasfoundtocorrespondtobytheregexprocessingsoftware.
Matchesareindicatedbydoublequotationmarks,withtheleftoneatthebaseoftheline.
\b[A-Z0-9.
A-Z0-9.
-]+\.
[A-Z]{2,4}\bisamorecomplexpattern.
Itdescribesaseriesofletters,digits,dots,underscores,percentagesignsandhyphens,followedbyanatsign,followedbyanotherseriesofletters,digitsandhyphens,finallyfollowedbyasingledotandbetweentwoandfourletters.
Inotherwords:thispatterndescribesanemailaddress.
Withtheaboveregularexpressionpattern,youcansearchthroughatextfiletofindemailaddresses,orverifyifagivenstringlookslikeanemailaddress.
Inthistutorial,Iwillusetheterm"string"toindicatethetextthatIamapplyingtheregularexpressionto.
Iwillindicatestringsusingregulardoublequotes.
Theterm"string"or"characterstring"isusedbyprogrammerstoindicateasequenceofcharacters.
Inpractice,youcanuseregularexpressionswithwhateverdatayoucanaccessusingtheapplicationorprogramminglanguageyouareworkingwith.
DifferentRegularExpressionEnginesAregularexpression"engine"isapieceofsoftwarethatcanprocessregularexpressions,tryingtomatchthepatterntothegivenstring.
Usually,theengineispartofalargerapplicationandyoudonotaccesstheenginedirectly.
Rather,theapplicationwillinvokeitforyouwhenneeded,makingsuretherightregularexpressionisappliedtotherightfileordata.
Asusualinthesoftwareworld,differentregularexpressionenginesarenotfullycompatiblewitheachother.
Itisnotpossibletodescribeeverykindofengineandregularexpressionsyntax(or"flavor")inthistutorial.
IwillfocusontheregexflavorusedbyPerl5,forthesimplereasonthatthisregexflavoristhemostpopular4one,anddeservedlyso.
Manymorerecentregexenginesareverysimilar,butnotidentical,totheoneofPerl5.
ExamplesaretheopensourcePCREengine(usedinmanytoolsandlanguageslikePHP),the.
NETregularexpressionlibrary,andtheregularexpressionpackageincludedwithversion1.
4andlateroftheJavaJDK.
Iwillpointouttoyouwheneverdifferencesinregexflavorsareimportant,andwhichfeaturesarespecifictothePerl-derivativesmentionedabove.
GiveRegexesaFirstTryYoucaneasilytrythefollowingyourselfinatexteditorthatsupportsregularexpressions,suchasEditPadPro.
Ifyoudonothavesuchaneditor,youcandownloadthefreeevaluationversionofEditPadPrototrythisout.
EditPadPro'sregexengineisfullyfunctionalinthedemoversion.
Asaquicktest,copyandpastethetextofthispageintoEditPadPro.
ThenselectSearch|ShowSearchPanelfromthemenu.
Inthesearchpanethatappearsnearthebottom,typeinregexintheboxlabeled"SearchText".
Markthe"Regularexpression"checkbox,andclicktheFindFirstbutton.
Thisistheleftmostbuttononthesearchpanel.
SeehowEditPadPro'sregexenginefindsthefirstmatch.
ClicktheFindNextbutton,whichsitsnexttotheFindFirstbutton,tofindfurthermatches.
Whentherearenofurthermatches,theFindNextbutton'siconwillflashbriefly.
Nowtrytosearchusingtheregexreg(ularexpressions|ex(p|es)).
Thisregexwillfindallnames,singularandplural,Ihaveusedonthispagetosay"regex".
Ifweonlyhadplaintextsearch,wewouldhaveneeded5searches.
Withregexes,weneedjustonesearch.
RegexessaveyoutimewhenusingatoollikeEditPadPro.
SelectCountMatchesintheSearchmenutoseehowmanytimesthisregularexpressioncanmatchthefileyouhaveopeninEditPadPro.
Ifyouareaprogrammer,yoursoftwarewillrunfastersinceevenasimpleregexengineapplyingtheaboveregexoncewilloutperformastateoftheartplaintextsearchalgorithmsearchingthroughthedatafivetimes.
Regularexpressionsalsoreducedevelopmenttime.
Witharegexengine,ittakesonlyoneline(e.
g.
inPerl,PHP,Javaor.
NET)oracoupleoflines(e.
g.
inCusingPCRE)ofcodeto,say,checkiftheuser'sinputlookslikeavalidemailaddress.
52.
LiteralCharactersThemostbasicregularexpressionconsistsofasingleliteralcharacter,e.
g.
:a.
Itwillmatchthefirstoccurrenceofthatcharacterinthestring.
Ifthestringis"Jackisaboy",itwillmatchthea"afterthe"J".
Thefactthatthis"a"isinthemiddleoftheworddoesnotmattertotheregexengine.
Ifitmatterstoyou,youwillneedtotellthattotheregexenginebyusingwordboundaries.
Wewillgettothatlater.
Thisregexcanmatchtheseconda"too.
Itwillonlydosowhenyoutelltheregexenginetostartsearchingthroughthestringafterthefirstmatch.
Inatexteditor,youcandosobyusingits"FindNext"or"SearchForward"function.
Inaprogramminglanguage,thereisusuallyaseparatefunctionthatyoucancalltocontinuesearchingthroughthestringafterthepreviousmatch.
Similarly,theregexcatwillmatchcat"in"Aboutcatsanddogs".
Thisregularexpressionconsistsofaseriesofthreeliteralcharacters.
Thisislikesayingtotheregexengine:findac,immediatelyfollowedbyana,immediatelyfollowedbyat.
Notethatregexenginesarecasesensitivebydefault.
catdoesnotmatch"Cat",unlessyoutelltheregexenginetoignoredifferencesincase.
SpecialCharactersBecausewewanttodomorethansimplysearchforliteralpiecesoftext,weneedtoreservecertaincharactersforspecialuse.
Intheregexflavorsdiscussedinthistutorial,thereare11characterswithspecialmeanings:theopeningsquarebracket[,thebackslash\,thecaret^,thedollarsign$,theperiodordot.
,theverticalbarorpipesymbol|,thequestionmark,theasteriskorstar*,theplussign+,theopeningroundbracket(andtheclosingroundbracket).
Thesespecialcharactersareoftencalled"metacharacters".
Ifyouwanttouseanyofthesecharactersasaliteralinaregex,youneedtoescapethemwithabackslash.
Ifyouwanttomatch1+1=2",thecorrectregexis1\+1=2.
Otherwise,theplussignwillhaveaspecialmeaning.
Notethat1+1=2,withthebackslashomitted,isavalidregex.
Soyouwillnotgetanerrormessage.
Butitwillnotmatch"1+1=2".
Itwouldmatch111=2"in"123+111=234",duetothespecialmeaningofthepluscharacter.
Ifyouforgettoescapeaspecialcharacterwhereitsuseisnotallowed,suchasin+1,thenyouwillgetanerrormessage.
Mostregularexpressionflavorstreatthebrace{asaliteralcharacter,unlessitispartofarepetitionoperatorlike{1,3}.
Soyougenerallydonotneedtoescapeitwithabackslash,thoughyoucandosoifyouwant.
Anexceptiontothisruleisthejava.
util.
regexpackage:itrequiresallliteralbracestobeescaped.
Allothercharactersshouldnotbeescapedwithabackslash.
Thatisbecausethebackslashisalsoaspecialcharacter.
Thebackslashincombinationwithaliteralcharactercancreatearegextokenwithaspecialmeaning.
E.
g.
\dwillmatchasingledigitfrom0to9.
6Escapingasinglemetacharacterwithabackslashworksinallregularexpressionflavors.
Manyflavorsalsosupportthe\Q.
.
.
\Eescapesequence.
Allthecharactersbetweenthe\Qandthe\Eareinterpretedasliteralcharacters.
E.
g.
\Q*\d+*\Ematchestheliteraltext*\d+*".
The\Emaybeomittedattheendoftheregex,so\Q*\d+*isthesameas\Q*\d+*\E.
ThissyntaxissupportedbytheJGsoftengine,PerlandPCRE,bothinsideandoutsidecharacterclasses.
Javasupportsitoutsidecharacterclassesonly,andquantifiesitasonetoken.
SpecialCharactersandProgrammingLanguagesIfyouareaprogrammer,youmaybesurprisedthatcharacterslikethesinglequoteanddoublequotearenotspecialcharacters.
Thatiscorrect.
WhenusingaregularexpressionorgreptoollikePowerGREPorthesearchfunctionofatexteditorlikeEditPadPro,youshouldnotescapeorrepeatthequotecharacterslikeyoudoinaprogramminglanguage.
Inyoursourcecode,youhavetokeepinmindwhichcharactersgetspecialtreatmentinsidestringsbyyourprogramminglanguage.
Thatisbecausethosecharacterswillbeprocessedbythecompiler,beforetheregexlibraryseesthestring.
Sotheregex1\+1=2mustbewrittenas"1\\+1=2"inC++code.
TheC++compilerwillturntheescapedbackslashinthesourcecodeintoasinglebackslashinthestringthatispassedontotheregexlibrary.
Tomatchc:\temp",youneedtousetheregexc:\\temp.
AsastringinC++sourcecode,thisregexbecomes"c:\\\\temp".
Fourbackslashestomatchasingleoneindeed.
Seethetoolsandlanguagessectioninthisbookformoreinformationonhowtouseregularexpressionsinvariousprogramminglanguages.
Non-PrintableCharactersYoucanusespecialcharactersequencestoputnon-printablecharactersinyourregularexpression.
Use\ttomatchatabcharacter(ASCII0x09),\rforcarriagereturn(0x0D)and\nforlinefeed(0x0A).
Moreexoticnon-printablesare\a(bell,0x07),\e(escape,0x1B),\f(formfeed,0x0C)and\v(verticaltab,0x0B).
RememberthatWindowstextfilesuse"\r\n"toterminatelines,whileUNIXtextfilesuse"\n".
YoucanincludeanycharacterinyourregularexpressionifyouknowitshexadecimalASCIIorANSIcodeforthecharactersetthatyouareworkingwith.
IntheLatin-1characterset,thecopyrightsymbolischaracter0xA9.
Sotosearchforthecopyrightsymbol,youcanuse\xA9.
Anotherwaytosearchforatabistouse\x09.
Notethattheleadingzeroisrequired.
Mostregexflavorsalsosupportthetokens\cAthrough\cZtoinsertASCIIcontrolcharacters.
Theletterafterthebackslashisalwaysalowercasec.
ThesecondletterisanuppercaseletterAthroughZ,toindicateControl+AthroughControl+Z.
Theseareequivalentto\x01through\x1A(26decimal).
E.
g.
\cMmatchesacarriagereturn,justlike\rand\x0D.
InXMLSchemaregularexpressions,\cisashorthandcharacterclassthatmatchesanycharacterallowedinanXMLname.
IfyourregularexpressionenginesupportsUnicode,use\uFFFFratherthan\xFFtoinsertaUnicodecharacter.
Theeurocurrencysignoccupiescodepoint0x20AC.
Ifyoucannottypeitonyourkeyboard,youcaninsertitintoaregularexpressionwith\u20AC.
73.
FirstLookatHowaRegexEngineWorksInternallyKnowinghowtheregexengineworkswillenableyoutocraftbetterregexesmoreeasily.
Itwillhelpyouunderstandquicklywhyaparticularregexdoesnotdowhatyouinitiallyexpected.
Thiswillsaveyoulotsofguessworkandheadscratchingwhenyouneedtowritemorecomplexregexes.
Therearetwokindsofregularexpressionengines:text-directedengines,andregex-directedengines.
JeffreyFriedlcallsthemDFAandNFAengines,respectively.
Alltheregexflavorstreatedinthistutorialarebasedonregex-directedengines.
Thisisbecausecertainveryusefulfeatures,suchaslazyquantifiersandbackreferences,canonlybeimplementedinregex-directedengines.
Nosurprisethatthiskindofengineismorepopular.
Notabletoolsthatusetext-directedenginesareawk,egrep,flex,lex,MySQLandProcmail.
Forawkandegrep,thereareafewversionsofthesetoolsthatusearegex-directedengine.
Youcaneasilyfindoutwhethertheregexflavoryouintendtousehasatext-directedorregex-directedengine.
Ifbackreferencesand/orlazyquantifiersareavailable,youcanbecertaintheengineisregex-directed.
Youcandothetestbyapplyingtheregexregex|regexnottothestring"regexnot".
Iftheresultingmatchisonlyregex",theengineisregex-directed.
Iftheresultisregexnot",thenitistext-directed.
Thereasonbehindthisisthattheregex-directedengineis"eager".
Inthistutorial,afterintroducinganewregextoken,Iwillexplainstepbystephowtheregexengineactuallyprocessesthattoken.
Thisinsidelookmayseemabitlong-windedatcertaintimes.
Butunderstandinghowtheregexengineworkswillenableyoutouseitsfullpowerandhelpyouavoidcommonmistakes.
TheRegex-DirectedEngineAlwaysReturnstheLeftmostMatchThisisaveryimportantpointtounderstand:aregex-directedenginewillalwaysreturntheleftmostmatch,evenifa"better"matchcouldbefoundlater.
Whenapplyingaregextoastring,theenginewillstartatthefirstcharacterofthestring.
Itwilltryallpossiblepermutationsoftheregularexpressionatthefirstcharacter.
Onlyifallpossibilitieshavebeentriedandfoundtofail,willtheenginecontinuewiththesecondcharacterinthetext.
Again,itwilltryallpossiblepermutationsoftheregex,inexactlythesameorder.
Theresultisthattheregex-directedenginewillreturntheleftmostmatch.
Whenapplyingcatto"Hecapturedacatfishforhiscat.
",theenginewilltrytomatchthefirsttokenintheregexctothefirstcharacterinthematch"H".
Thisfails.
Therearenootherpossiblepermutationsofthisregex,becauseitmerelyconsistsofasequenceofliteralcharacters.
Sotheregexenginetriestomatchthecwiththe"e".
Thisfailstoo,asdoesmatchingthecwiththespace.
Arrivingatthe4thcharacterinthematch,cmatchesc".
Theenginewillthentrytomatchthesecondtokenatothe5thcharacter,a".
Thissucceedstoo.
Butthen,tfailstomatch"p".
Atthatpoint,theengineknowstheregexcannotbematchedstartingatthe4thcharacterinthematch.
Soitwillcontinuewiththe5th:"a".
Again,cfailstomatchhereandtheenginecarrieson.
Atthe15thcharacterinthematch,cagainmatchesc".
Theenginethenproceedstoattempttomatchtheremainderoftheregexatcharacter15andfindsthatamatchesa"andtmatchest".
Theentireregularexpressioncouldbematchedstartingatcharacter15.
Theengineis"eager"toreportamatch.
Itwillthereforereportthefirstthreelettersofcatfishasavalidmatch.
Theengineneverproceedsbeyondthispointtoseeifthereareany"better"matches.
Thefirstmatchisconsideredgoodenough.
8Inthisfirstexampleoftheengine'sinternals,ourregexenginesimplyappearstoworklikearegulartextsearchroutine.
Atext-directedenginewouldhavereturnedthesameresulttoo.
However,itisimportantthatyoucanfollowthestepstheenginetakesinyourmind.
Infollowingexamples,thewaytheengineworkswillhaveaprofoundimpactonthematchesitwillfind.
Someoftheresultsmaybesurprising.
Buttheyarealwayslogicalandpredetermined,onceyouknowhowtheengineworks.
94.
CharacterClassesorCharacterSetsWitha"characterclass",alsocalled"characterset",youcantelltheregexenginetomatchonlyoneoutofseveralcharacters.
Simplyplacethecharactersyouwanttomatchbetweensquarebrackets.
Ifyouwanttomatchanaorane,use[ae].
Youcouldusethisingr[ae]ytomatcheithergray"orgrey".
VeryusefulifyoudonotknowwhetherthedocumentyouaresearchingthroughiswritteninAmericanorBritishEnglish.
Acharacterclassmatchesonlyasinglecharacter.
gr[ae]ywillnotmatch"graay","graey"oranysuchthing.
Theorderofthecharactersinsideacharacterclassdoesnotmatter.
Theresultsareidentical.
Youcanuseahypheninsideacharacterclasstospecifyarangeofcharacters.
[0-9]matchesasingledigitbetween0and9.
Youcanusemorethanonerange.
[0-9a-fA-F]matchesasinglehexadecimaldigit,caseinsensitively.
Youcancombinerangesandsinglecharacters.
[0-9a-fxA-FX]matchesahexadecimaldigitortheletterX.
Again,theorderofthecharactersandtherangesdoesnotmatter.
UsefulApplicationsFindaword,evenifitismisspelled,suchassep[ae]r[ae]teorli[cs]en[cs]e.
Findanidentifierinaprogramminglanguagewith[A-Za-z_][A-Za-z_0-9]*.
FindaC-stylehexadecimalnumberwith0[xX][A-Fa-f0-9]+.
NegatedCharacterClassesTypingacaretaftertheopeningsquarebracketwillnegatethecharacterclass.
Theresultisthatthecharacterclasswillmatchanycharacterthatisnotinthecharacterclass.
Unlikethedot,negatedcharacterclassesalsomatch(invisible)linebreakcharacters.
Itisimportanttorememberthatanegatedcharacterclassstillmustmatchacharacter.
q[^u]doesnotmean:"aqnotfollowedbyau".
Itmeans:"aqfollowedbyacharacterthatisnotau".
Itwillnotmatchtheqinthestring"Iraq".
Itwillmatchtheqandthespaceaftertheqin"Iraqisacountry".
Indeed:thespacewillbepartoftheoverallmatch,becauseitisthe"characterthatisnotau"thatismatchedbythenegatedcharacterclassintheaboveregexp.
Ifyouwanttheregextomatchtheq,andonlytheq,inbothstrings,youneedtousenegativelookahead:q(!
u).
Butwewillgettothatlater.
MetacharactersInsideCharacterClassesNotethattheonlyspecialcharactersormetacharactersinsideacharacterclassaretheclosingbracket(]),thebackslash(\),thecaret(^)andthehyphen(-).
Theusualmetacharactersarenormalcharactersinsideacharacterclass,anddonotneedtobeescapedbyabackslash.
Tosearchforastarorplus,use[+*].
Yourregexwillworkfineifyouescapetheregularmetacharactersinsideacharacterclass,butdoingsosignificantlyreducesreadability.
10Toincludeabackslashasacharacterwithoutanyspecialmeaninginsideacharacterclass,youhavetoescapeitwithanotherbackslash.
[\\x]matchesabackslashoranx.
Theclosingbracket(]),thecaret(^)andthehyphen(-)canbeincludedbyescapingthemwithabackslash,orbyplacingtheminapositionwheretheydonottakeontheirspecialmeaning.
Irecommendthelattermethod,sinceitimprovesreadability.
Toincludeacaret,placeitanywhereexceptrightaftertheopeningbracket.
[x^]matchesanxoracaret.
Youcanputtheclosingbracketrightaftertheopeningbracket,orthenegatingcaret.
[]x]matchesaclosingbracketoranx.
[^]x]matchesanycharacterthatisnotaclosingbracketoranx.
Thehyphencanbeincludedrightaftertheopeningbracket,orrightbeforetheclosingbracket,orrightafterthenegatingcaret.
Both[-x]and[x-]matchanxorahyphen.
Youcanuseallnon-printablecharactersincharacterclassesjustlikeyoucanusethemoutsideofcharacterclasses.
E.
g.
[$\u20AC]matchesadollaroreurosign,assumingyourregexflavorsupportsUnicode.
TheJGsoftengine,PerlandPCREalsosupportthe\Q.
.
.
\Esequenceinsidecharacterclassestoescapeastringofcharacters.
E.
g.
[\Q[-]\E]matchesor]".
POSIXregularexpressionstreatthebackslashasaliteralcharacterinsidecharacterclasses.
Thismeansyoucan'tusebackslashestoescapetheclosingbracket(]),thecaret(^)andthehyphen(-).
Tousethesecharacters,positionthemasexplainedaboveinthissection.
ThisalsomeansthatspecialtokenslikeshorthandsarenotavailableinPOSIXregularexpressions.
SeethetutorialtopiconPOSIXbracketexpressionsformoreinformation.
ShorthandCharacterClassesSincecertaincharacterclassesareusedoften,aseriesofshorthandcharacterclassesareavailable.
\disshortfor[0-9].
\wstandsfor"wordcharacter".
Exactlywhichcharactersitmatchesdiffersbetweenregexflavors.
Inallflavors,itwillinclude[A-Za-z].
Inmost,theunderscoreanddigitsarealsoincluded.
Insomeflavors,wordcharactersfromotherlanguagesmayalsomatch.
Thebestwaytofindoutistodoacoupleoftestswiththeregexflavoryouareusing.
Inthescreenshot,youcanseethecharactersmatchedby\winRegexBuddyusingvariousscripts.
\sstandsfor"whitespacecharacter".
Again,whichcharactersthisactuallyincludes,dependsontheregexflavor.
Inallflavorsdiscussedinthistutorial,itincludes[\t].
Thatis:\swillmatchaspaceoratab.
In11mostflavors,italsoincludesacarriagereturnoralinefeedasin[\t\r\n].
Someflavorsincludeadditional,rarelyusednon-printablecharacterssuchasverticaltabandformfeed.
Shorthandcharacterclassescanbeusedbothinsideandoutsidethesquarebrackets.
\s\dmatchesawhitespacecharacterfollowedbyadigit.
[\s\d]matchesasinglecharacterthatiseitherwhitespaceoradigit.
Whenappliedto"1+2=3",theformerregexwillmatch2"(spacetwo),whilethelattermatches1"(one).
[\da-fA-F]matchesahexadecimaldigit,andisequivalentto[0-9a-fA-F].
NegatedShorthandCharacterClassesTheabovethreeshorthandsalsohavenegatedversions.
\Disthesameas[^\d],\Wisshortfor[^\w]and\Sistheequivalentof[^\s].
Becarefulwhenusingthenegatedshorthandsinsidesquarebrackets.
[\D\S]isnotthesameas[^\d\s].
Thelatterwillmatchanycharacterthatisnotadigitorwhitespace.
Soitwillmatchx",butnot"8".
Theformer,however,willmatchanycharacterthatiseithernotadigit,orisnotwhitespace.
Becauseadigitisnotwhitespace,andwhitespaceisnotadigit,[\D\S]willmatchanycharacter,digit,whitespaceorotherwise.
RepeatingCharacterClassesIfyourepeatacharacterclassbyusingtheor+operators,youwillrepeattheentirecharacterclass,andnotjustthecharacterthatitmatched.
Theregex[0-9]+canmatch837"aswellas222".
Ifyouwanttorepeatthematchedcharacter,ratherthantheclass,youwillneedtousebackreferences.
([0-9])\1+willmatch222"butnot"837".
Whenappliedtothestring"833337",itwillmatch3333"inthemiddleofthisstring.
Ifyoudonotwantthat,youneedtouselookaheadandlookbehind.
ButIdigress.
Ididnotyetexplainhowcharacterclassesworkinsidetheregexengine.
Letustakealookatthatfirst.
LookingInsideTheRegexEngineAsIalreadysaid:theorderofthecharactersinsideacharacterclassdoesnotmatter.
gr[ae]ywillmatchgrey"in"Ishishairgreyorgray",becausethatistheleftmostmatch.
Wealreadysawhowtheengineappliesaregexconsistingonlyofliteralcharacters.
Below,Iwillexplainhowitappliesaregexthathasmorethanonepermutation.
Thatis:gr[ae]ycanmatchbothgray"andgrey".
Nothingnoteworthyhappensforthefirsttwelvecharactersinthestring.
Theenginewillfailtomatchgateverystep,andcontinuewiththenextcharacterinthestring.
Whentheenginearrivesatthe13thcharacter,g"ismatched.
Theenginewillthentrytomatchtheremainderoftheregexwiththetext.
Thenexttokenintheregexistheliteralr,whichmatchesthenextcharacterinthetext.
Sothethirdtoken,[ae]isattemptedatthenextcharacterinthetext("e").
Thecharacterclassgivestheenginetwooptions:matchaormatche.
Itwillfirstattempttomatcha,andfail.
Butbecauseweareusingaregex-directedengine,itmustcontinuetryingtomatchalltheotherpermutationsoftheregexpatternbeforedecidingthattheregexcannotbematchedwiththetextstartingatcharacter13.
12Soitwillcontinuewiththeotheroption,andfindthatematchese".
Thelastregextokenisy,whichcanbematchedwiththefollowingcharacteraswell.
Theenginehasfoundacompletematchwiththetextstartingatcharacter13.
Itwillreturngrey"asthematchresult,andlooknofurther.
Again,theleftmostmatchwasreturned,eventhoughweputtheafirstinthecharacterclass,andgray"couldhavebeenmatchedinthestring.
Buttheenginesimplydidnotgetthatfar,becauseanotherequallyvalidmatchwasfoundtotheleftofit.
135.
TheDotMatches(Almost)AnyCharacterInregularexpressions,thedotorperiodisoneofthemostcommonlyusedmetacharacters.
Unfortunately,itisalsothemostcommonlymisusedmetacharacter.
Thedotmatchesasinglecharacter,withoutcaringwhatthatcharacteris.
Theonlyexceptionarenewlinecharacters.
Inallregexflavorsdiscussedinthistutorial,thedotwillnotmatchanewlinecharacterbydefault.
Sobydefault,thedotisshortforthenegatedcharacterclass[^\n](UNIXregexflavors)or[^\r\n](Windowsregexflavors).
Thisexceptionexistsmostlybecauseofhistoricreasons.
Thefirsttoolsthatusedregularexpressionswereline-based.
Theywouldreadafilelinebyline,andapplytheregularexpressionseparatelytoeachline.
Theeffectisthatwiththesetools,thestringcouldnevercontainnewlines,sothedotcouldnevermatchthem.
Moderntoolsandlanguagescanapplyregularexpressionstoverylargestringsorevenentirefiles.
Allregexflavorsdiscussedherehaveanoptiontomakethedotmatchallcharacters,includingnewlines.
InRegexBuddy,EditPadProorPowerGREP,yousimplytickthecheckboxlabeled"dotmatchesnewline".
InPerl,themodewherethedotalsomatchesnewlinesiscalled"single-linemode".
Thisisabitunfortunate,becauseitiseasytomixupthistermwith"multi-linemode".
Multi-linemodeonlyaffectsanchors,andsingle-linemodeonlyaffectsthedot.
Youcanactivatesingle-linemodebyaddingansaftertheregexcode,likethis:m/^regex$/s;.
OtherlanguagesandregexlibrarieshaveadoptedPerl'sterminology.
Whenusingtheregexclassesofthe.
NETframework,youactivatethismodebyspecifyingRegexOptions.
Singleline,suchasinRegex.
Match("string","regex",RegexOptions.
Singleline).
InallprogramminglanguagesandregexlibrariesIknow,activatingsingle-linemodehasnoeffectotherthanmakingthedotmatchnewlines.
Soifyouexposethisoptiontoyourusers,pleasegiveitaclearerlabellikewasdoneinRegexBuddy,EditPadProandPowerGREP.
JavaScriptandVBScriptdonothaveanoptiontomakethedotmatchlinebreakcharacters.
Inthoselanguages,youcanuseacharacterclasssuchas[\s\S]tomatchanycharacter.
Thischaractermatchesacharacterthatiseitherawhitespacecharacter(includinglinebreakcharacters),oracharacterthatisnotawhitespacecharacter.
Sinceallcharactersareeitherwhitespaceornon-whitespace,thischaracterclassmatchesanycharacter.
UseTheDotSparinglyThedotisaverypowerfulregexmetacharacter.
Itallowsyoutobelazy.
Putinadot,andeverythingwillmatchjustfinewhenyoutesttheregexonvaliddata.
Theproblemisthattheregexwillalsomatchincaseswhereitshouldnotmatch.
Ifyouarenewtoregularexpressions,someofthesecasesmaynotbesoobviousatfirst.
Iwillillustratethiswithasimpleexample.
Let'ssaywewanttomatchadateinmm/dd/yyformat,butwewanttoleavetheuserthechoiceofdateseparators.
Thequicksolutionis\d\d.
\d\d.
\d\d.
Seemsfineatfirst.
Itwillmatchadatelike02/12/03"justfine.
Troubleis:02512703"isalsoconsideredavaliddateby14thisregularexpression.
Inthismatch,thefirstdotmatched5",andthesecondmatched7".
Obviouslynotwhatweintended.
\d\d[-/.
]\d\d[-/.
]\d\disabettersolution.
Thisregexallowsadash,space,dotandforwardslashasdateseparators.
Rememberthatthedotisnotametacharacterinsideacharacterclass,sowedonotneedtoescapeitwithabackslash.
Thisregexisstillfarfromperfect.
Itmatches99/99/99"asavaliddate.
[0-1]\d[-/.
][0-3]\d[-/.
]\d\disastepahead,thoughitwillstillmatch19/39/99".
Howperfectyouwantyourregextobedependsonwhatyouwanttodowithit.
Ifyouarevalidatinguserinput,ithastobeperfect.
Ifyouareparsingdatafilesfromaknownsourcethatgeneratesitsfilesinthesamewayeverytime,ourlastattemptisprobablymorethansufficienttoparsethedatawithouterrors.
Youcanfindabetterregextomatchdatesintheexamplesection.
UseNegatedCharacterSetsInsteadoftheDotIwillexplainthisindepthwhenIpresentyoutherepeatoperatorsstarandplus,butthewarningisimportantenoughtomentionithereaswell.
Iwillillustratewithanexample.
Supposeyouwanttomatchadouble-quotedstring.
Soundseasy.
Wecanhaveanynumberofanycharacterbetweenthedoublequotes,so".
*"seemstodothetrickjustfine.
Thedotmatchesanycharacter,andthestarallowsthedottoberepeatedanynumberoftimes,includingzero.
Ifyoutestthisregexon"Puta"string"betweendoublequotes",itwillmatch"string""justfine.
Nowgoaheadandtestiton"Houston,wehaveaproblemwith"stringone"and"stringtwo".
Pleaserespond.
"Ouch.
Theregexmatches"stringone"and"stringtwo"".
Definitelynotwhatweintended.
Thereasonforthisisthatthestarisgreedy.
Inthedate-matchingexample,weimprovedourregexbyreplacingthedotwithacharacterclass.
Here,wewilldothesame.
Ouroriginaldefinitionofadouble-quotedstringwasfaulty.
Wedonotwantanynumberofanycharacterbetweenthequotes.
Wewantanynumberofcharactersthatarenotdoublequotesornewlinesbetweenthequotes.
Sotheproperregexis"[^"\r\n]*".
156.
StartofStringandEndofStringAnchorsThusfar,Ihaveexplainedliteralcharactersandcharacterclasses.
Inbothcases,puttingoneinaregexwillcausetheregexenginetotrytomatchasinglecharacter.
Anchorsareadifferentbreed.
Theydonotmatchanycharacteratall.
Instead,theymatchapositionbefore,afterorbetweencharacters.
Theycanbeusedto"anchor"theregexmatchatacertainposition.
Thecaret^matchesthepositionbeforethefirstcharacterinthestring.
Applying^ato"abc"matchesa".
^bwillnotmatch"abc"atall,becausethebcannotbematchedrightafterthestartofthestring,matchedby^.
Seebelowfortheinsideviewoftheregexengine.
Similarly,$matchesrightafterthelastcharacterinthestring.
c$matchesc"in"abc",whilea$doesnotmatchatall.
UsefulApplicationsWhenusingregularexpressionsinaprogramminglanguagetovalidateuserinput,usinganchorsisveryimportant.
Ifyouusethecodeif($input=~m/\d+/)inaPerlscripttoseeiftheuserenteredanintegernumber,itwillaccepttheinputeveniftheuserentered"qsdf4ghjk",because\d+matchesthe4.
Thecorrectregextouseis^\d+$.
Because"startofstring"mustbematchedbeforethematchof\d+,and"endofstring"mustbematchedrightafterit,theentirestringmustconsistofdigitsfor^\d+$tobeabletomatch.
Itiseasyfortheusertoaccidentallytypeinaspace.
WhenPerlreadsfromalinefromatextfile,thelinebreakwillalsobestoredinthevariable.
Sobeforevalidatinginput,itisgoodpracticetotrimleadingandtrailingwhitespace.
^\s+matchesleadingwhitespaceand\s+$matchestrailingwhitespace.
InPerl,youcoulduse$input=~s/^\s+|\s+$//g.
Handyuseofalternationand/gallowsustodothisinasinglelineofcode.
Using^and$asStartofLineandEndofLineAnchorsIfyouhaveastringconsistingofmultiplelines,like"firstline\nsecondline"(where\nindicatesalinebreak),itisoftendesirabletoworkwithlines,ratherthantheentirestring.
Therefore,alltheregexenginesdiscussedinthistutorialhavetheoptiontoexpandthemeaningofbothanchors.
^canthenmatchatthestartofthestring(beforethe"f"intheabovestring),aswellasaftereachlinebreak(between"\n"and"s").
Likewise,$willstillmatchattheendofthestring(afterthelast"e"),andalsobeforeeverylinebreak(between"e"and"\n").
IntexteditorslikeEditPadProorGNUEmacs,andregextoolslikePowerGREP,thecaretanddollaralwaysmatchatthestartandendofeachline.
Thismakessensebecausethoseapplicationsaredesignedtoworkwithentirefiles,ratherthanshortstrings.
Inallprogramminglanguagesandlibrariesdiscussedinthisbook,exceptRuby,youhavetoexplicitlyactivatethisextendedfunctionality.
Itistraditionallycalled"multi-linemode".
InPerl,youdothisbyaddinganmaftertheregexcode,likethis:m/^regex$/m;.
In.
NET,theanchorsmatchbeforeandafternewlineswhenyouspecifyRegexOptions.
Multiline,suchasinRegex.
Match("string","regex",RegexOptions.
Multiline).
16PermanentStartofStringandEndofStringAnchors\Aonlyevermatchesatthestartofthestring.
Likewise,\Zonlyevermatchesattheendofthestring.
Thesetwotokensnevermatchatlinebreaks.
Thisistrueinallregexflavorsdiscussedinthistutorial,evenwhenyouturnon"multilinemode".
InEditPadProandPowerGREP,wherethecaretanddollaralwaysmatchatthestartandendoflines,\Aand\Zonlymatchatthestartandtheendoftheentirefile.
Zero-LengthMatchesWesawthattheanchorsmatchataposition,ratherthanmatchingacharacter.
Thismeansthatwhenaregexonlyconsistsofoneormoreanchors,itcanresultinazero-lengthmatch.
Dependingonthesituation,thiscanbeveryusefulorundesirable.
Using^\d*$totestiftheuserenteredanumber(noticetheuseofthestarinsteadoftheplus),wouldcausethescripttoacceptanemptystringasavalidinput.
Seebelow.
However,matchingonlyapositioncanbeveryuseful.
Inemail,forexample,itiscommontoprependa"greaterthan"symbolandaspacetoeachlineofthequotedmessage.
InVB.
NET,wecaneasilydothiswithDimQuotedasString=Regex.
Replace(Original,RegexOptions.
Multiline).
Weareusingmulti-linemode,sotheregex^matchesatthestartofthequotedmessage,andaftereachnewline.
TheRegex.
Replacemethodwillremovetheregexmatchfromthestring,andinsertthereplacementstring(greaterthansymbolandaspace).
Sincethematchdoesnotincludeanycharacters,nothingisdeleted.
However,thematchdoesincludeastartingposition,andthereplacementstringisinsertedthere,justlikewewantit.
StringsEndingwithaLineBreakEventhough\Zand$onlymatchattheendofthestring(whentheoptionforthecaretanddollartomatchatembeddedlinebreaksisoff),thereisoneexception.
Ifthestringendswithalinebreak,then\Zand$willmatchatthepositionbeforethatlinebreak,ratherthanattheveryendofthestring.
This"enhancement"wasintroducedbyPerl,andiscopiedbymanyregexflavors,includingJava,.
NETandPCRE.
InPerl,whenreadingalinefromafile,theresultingstringwillendwithalinebreak.
Readingalinefromafilewiththetext"joe"resultsinthestring"joe\n".
Whenappliedtothisstring,both^[a-z]+$and\A[a-z]+\Zwillmatchjoe".
Ifyouonlywantamatchattheabsoluteveryendofthestring,use\z(lowercasezinsteadofuppercaseZ).
\A[a-z]+\zdoesnotmatch"joe\n".
\zmatchesafterthelinebreak,whichisnotmatchedbythecharacterclass.
LookingInsidetheRegexEngineLet'sseewhathappenswhenwetrytomatch^4$to"749\n486\n4"(where\nrepresentsanewlinecharacter)inmulti-linemode.
Asusual,theregexenginestartsatthefirstcharacter:"7".
Thefirsttokenintheregularexpressionis^.
Sincethistokenisazero-widthtoken,theenginedoesnottrytomatchitwiththecharacter,butratherwiththepositionbeforethecharacterthattheregexenginehasreachedsofar.
^indeedmatchesthepositionbefore"7".
Theenginethenadvancestothenextregextoken:4.
Sincetheprevioustokenwaszero-width,theregexenginedoesnotadvancetothenextcharacterinthestring.
Itremainsat"7".
4isaliteralcharacter,whichdoesnotmatch"7".
Therearenootherpermutationsofthe17regex,sotheenginestartsagainwiththefirstregextoken,atthenextcharacter:"4".
Thistime,^cannotmatchatthepositionbeforethe4.
Thispositionisprecededbyacharacter,andthatcharacterisnotanewline.
Theenginecontinuesat"9",andfailsagain.
Thenextattempt,at"\n",alsofails.
Again,thepositionbefore"\n"isprecededbyacharacter,"9",andthatcharacterisnotanewline.
Then,theregexenginearrivesatthesecond"4"inthestring.
The^canmatchatthepositionbeforethe"4",becauseitisprecededbyanewlinecharacter.
Again,theregexengineadvancestothenextregextoken,4,butdoesnotadvancethecharacterpositioninthestring.
4matches4",andtheengineadvancesboththeregextokenandthestringcharacter.
Nowtheengineattemptstomatch$atthepositionbefore(indeed:before)the"8".
Thedollarcannotmatchhere,becausethispositionisfollowedbyacharacter,andthatcharacterisnotanewline.
Yetagain,theenginemusttrytomatchthefirsttokenagain.
Previously,itwassuccessfullymatchedatthesecond"4",sotheenginecontinuesatthenextcharacter,"8",wherethecaretdoesnotmatch.
Sameatthesixandthenewline.
Finally,theregexenginetriestomatchthefirsttokenatthethird"4"inthestring.
Withsuccess.
Afterthat,theenginesuccessfullymatches4with4".
Thecurrentregextokenisadvancedto$,andthecurrentcharacterisadvancedtotheverylastpositioninthestring:thevoidafterthestring.
Noregextokenthatneedsacharactertomatchcanmatchhere.
Notevenanegatedcharacterclass.
However,wearetryingtomatchadollarsign,andthemightydollarisastrangebeast.
Itiszero-width,soitwilltrytomatchthepositionbeforethecurrentcharacter.
Itdoesnotmatterthatthis"character"isthevoidafterthestring.
Infact,thedollarwillcheckthecurrentcharacter.
Itmustbeeitheranewline,orthevoidafterthestring,for$tomatchthepositionbeforethecurrentcharacter.
Sincethatisthecaseaftertheexample,thedollarmatchessuccessfully.
Since$wasthelasttokenintheregex,theenginehasfoundasuccessfulmatch:thelast4"inthestring.
AnotherInsideLookEarlierImentionedthat^\d*$wouldsuccessfullymatchanemptystring.
Let'sseewhy.
Thereisonlyone"character"positioninanemptystring:thevoidafterthestring.
Thefirsttokenintheregexis^.
Itmatchesthepositionbeforethevoidafterthestring,becauseitisprecededbythevoidbeforethestring.
Thenexttokenis\d*.
Aswewillseelater,oneofthestar'seffectsisthatitmakesthe\d,inthiscase,optional.
Theenginewilltrytomatch\dwiththevoidafterthestring.
Thatfails,butthestarturnsthefailureofthe\dintoazero-widthsuccess.
Theenginewillproceedwiththenextregextoken,withoutadvancingthepositioninthestring.
Sotheenginearrivesat$,andthevoidafterthestring.
Wealreadysawthatthosematch.
Atthispoint,theentireregexhasmatchedtheemptystring,andtheenginereportssuccess.
CautionforProgrammersAregularexpressionsuchas$allbyitselfcanindeedmatchafterthestring.
Ifyouwouldquerytheengineforthecharacterposition,itwouldreturnthelengthofthestringifstringindicesarezero-based,orthelength+1ifstringindicesareone-basedinyourprogramminglanguage.
Ifyouwouldquerytheengineforthelengthofthematch,itwouldreturnzero.
WhatyouhavetowatchoutforisthatString[Regex.
MatchPosition]maycauseanaccessviolationorsegmentationfault,becauseMatchPositioncanpointtothevoidafterthestring.
Thiscanalsohappenwith^and^$ifthelastcharacterinthestringisanewline.
187.
WordBoundariesThemetacharacter\bisananchorlikethecaretandthedollarsign.
Itmatchesatapositionthatiscalleda"wordboundary".
Thismatchiszero-length.
Therearefourdifferentpositionsthatqualifyaswordboundaries:Beforethefirstcharacterinthestring,ifthefirstcharacterisawordcharacter.
Afterthelastcharacterinthestring,ifthelastcharacterisawordcharacter.
Betweenawordcharacterandanon-wordcharacterfollowingrightafterthewordcharacter.
Betweenanon-wordcharacterandawordcharacterfollowingrightafterthenon-wordcharacter.
Simplyput:\ballowsyoutoperforma"wholewordsonly"searchusingaregularexpressionintheformof\bword\b.
A"wordcharacter"isacharacterthatcanbeusedtoformwords.
Allcharactersthatarenot"wordcharacters"are"non-wordcharacters".
Theexactlistofcharactersisdifferentforeachregexflavor,butallwordcharactersarealwaysmatchedbytheshort-handcharacterclass\w.
Allnon-wordcharactersarealwaysmatchedby\W.
InPerlandtheotherregexflavorsdiscussedinthistutorial,thereisonlyonemetacharacterthatmatchesbothbeforeawordandafteraword.
Thisisbecauseanypositionbetweencharacterscanneverbebothatthestartandattheendofaword.
Usingonlyoneoperatormakesthingseasierforyou.
Notethat\wusuallyalsomatchesdigits.
So\b4\bcanbeusedtomatcha4thatisnotpartofalargernumber.
Thisregexwillnotmatch"44sheetsofa4".
Sosaying"\bmatchesbeforeandafteranalphanumericsequence"ismoreexactthansaying"beforeandafteraword".
NegatedWordBoundary\Bisthenegatedversionof\b.
\Bmatchesateverypositionwhere\bdoesnot.
Effectively,\Bmatchesatanypositionbetweentwowordcharactersaswellasatanypositionbetweentwonon-wordcharacters.
LookingInsidetheRegexEngineLet'sseewhathappenswhenweapplytheregex\bis\btothestring"Thisislandisbeautiful".
Theenginestartswiththefirsttoken\batthefirstcharacter"T".
Sincethistokeniszero-length,thepositionbeforethecharacterisinspected.
\bmatcheshere,becausetheTisawordcharacterandthecharacterbeforeitisthevoidbeforethestartofthestring.
Theenginecontinueswiththenexttoken:theliterali.
Theenginedoesnotadvancetothenextcharacterinthestring,becausethepreviousregextokenwaszero-width.
idoesnotmatch"T",sotheengineretriesthefirsttokenatthenextcharacterposition.
\bcannotmatchatthepositionbetweenthe"T"andthe"h".
Itcannotmatchbetweenthe"h"andthe"i"either,andneitherbetweenthe"i"andthe"s".
Thenextcharacterinthestringisaspace.
\bmatchesherebecausethespaceisnotawordcharacter,andtheprecedingcharacteris.
Again,theenginecontinueswiththeiwhichdoesnotmatchwiththespace.
19Advancingacharacterandrestartingwiththefirstregextoken,\bmatchesbetweenthespaceandthesecond"i"inthestring.
Continuing,theregexenginefindsthatimatchesi"andsmatchess".
Now,theenginetriestomatchthesecond\batthepositionbeforethe"l".
Thisfailsbecausethispositionisbetweentwowordcharacters.
Theenginerevertstothestartoftheregexandadvancesonecharactertothe"s"in"island".
Again,the\bfailstomatchandcontinuestodosountilthesecondspaceisreached.
Itmatchesthere,butmatchingtheifails.
But\bmatchesatthepositionbeforethethird"i"inthestring.
Theenginecontinues,andfindsthatimatchesi"andsmatchess.
Thelasttokenintheregex,\b,alsomatchesatthepositionbeforethesecondspaceinthestringbecausethespaceisnotawordcharacter,andthecharacterbeforeitis.
Theenginehassuccessfullymatchedthewordis"inourstring,skippingthetwoearlieroccurrencesofthecharactersiands.
Ifwehadusedtheregularexpressionis,itwouldhavematchedtheis"in"This".
TclWordBoundariesWordboundaries,asdescribedabove,aresupportedbyallregularexpressionflavorsdescribedininthisbook,exceptforthetwoPOSIXREflavorsandtheTclregexpcommand.
POSIXdoesnotsupportwordboundariesatall.
Tclusesadifferentsyntax.
InTcl,\bmatchesabackspacecharacter,justlike\x08inmostregexflavors(includingTcl's).
\BmatchesasinglebackslashcharacterinTcl,justlike\\inallotherregexflavors(andTcltoo).
Tclusestheletter"y"insteadoftheletter"b"tomatchwordboundaries.
\ymatchesatanywordboundaryposition,while\Ymatchesatanypositionthatisnotawordboundary.
TheseTclregextokensmatchexactlythesameas\band\BinPerl-styleregexflavors.
Theydon'tdiscriminatebetweenthestartandtheendofaword.
Tclhastwomorewordboundarytokensthatdodiscriminatebetweenthestartandendofaword.
\mmatchesonlyatthestartofaword.
Thatis,itmatchesatanypositionthathasanon-wordcharactertotheleftofit,andawordcharactertotherightofit.
Italsomatchesatthestartofthestringifthefirstcharacterinthestringisawordcharacter.
\Mmatchesonlyattheendofaword.
Itmatchesatanypositionthathasawordcharactertotheleftofit,andanon-wordcharactertotherightofit.
Italsomatchesattheendofthestringifthelastcharacterinthestringisawordcharacter.
TheonlyregexenginethatsupportsTcl-stylewordboundaries(besidesTclitself)istheJGsoftengine.
InPowerGREPandEditPadPro,\band\BarePerl-stylewordboundaries,and\y,\Y,\mand\MareTcl-stylewordboundaries.
Inmostsituations,thelackof\mand\Mtokensisnotaproblem.
\yword\yfinds"wholewordsonly"occurrencesof"word"justlike\mword\Mwould.
\Mword\mcouldnevermatchanywhere,since\Mnevermatchesatapositionfollowedbyawordcharacter,and\mneveratapositionprecededbyone.
Ifyourregularexpressionneedstomatchcharactersbeforeorafter\y,youcaneasilyspecifyintheregexwhetherthesecharactersshouldbewordcharactersornon-wordcharacters.
E.
g.
ifyouwanttomatchanyword,\y\w+\ywillgivethesameresultas\m.
+\M.
Using\winsteadofthedotautomaticallyrestrictsthefirst\ytothestartofaword,andthesecond\ytotheendofaword.
Notethat\y.
+\ywouldnotwork.
Thisregexmatcheseachword,andalsoeachsequenceofnon-wordcharactersbetweenthewordsinyoursubjectstring.
Thatsaid,ifyourflavorsupports\mand\M,theregexenginecouldapply\m\w+\Mslightlyfasterthan\y\w+\y,dependingonitsinternaloptimizations.
20Ifyourregexflavorsupportslookaheadandlookbehind,youcanuse(matchesanHTMLtagwithoutanyattributes.
Thesharpbracketsareliterals.
Thefirstcharacterclassmatchesaletter.
Thesecondcharacterclassmatchesaletterordigit.
Thestarrepeatsthesecondcharacterclass.
Becauseweusedthestar,it'sOKifthesecondcharacterclassmatchesnothing.
Soourregexwillmatchataglike".
Whenmatching",thefirstcharacterclasswillmatchH".
Thestarwillcausethesecondcharacterclasstoberepeatedthreetimes,matchingT",M"andL"witheachstep.
Icouldalsohaveused.
Ididnot,becausethisregexwouldmatch",whichisnotavalidHTMLtag.
Butthisregexmaybesufficientifyouknowthestringyouaresearchingthroughdoesnotcontainanysuchinvalidtags.
LimitingRepetitionModernregexflavors,likethosediscussedinthistutorial,haveanadditionalrepetitionoperatorthatallowsyoutospecifyhowmanytimesatokencanberepeated.
Thesyntaxis{min,max},whereminisapositiveintegernumberindicatingtheminimumnumberofmatches,andmaxisanintegerequaltoorgreaterthanminindicatingthemaximumnumberofmatches.
Ifthecommaispresentbutmaxisomitted,themaximumnumberofmatchesisinfinite.
So{0,}isthesameas*,and{1,}isthesameas+.
Omittingboththecommaandmaxtellstheenginetorepeatthetokenexactlymintimes.
Youcoulduse\b[1-9][0-9]{3}\btomatchanumberbetween1000and9999.
\b[1-9][0-9]{2,4}\bmatchesanumberbetween100and99999.
Noticetheuseofthewordboundaries.
WatchOutforTheGreediness!
SupposeyouwanttousearegextomatchanHTMLtag.
YouknowthattheinputwillbeavalidHTMLfile,sotheregularexpressiondoesnotneedtoexcludeanyinvaliduseofsharpbrackets.
Ifitsitsbetweensharpbrackets,itisanHTMLtag.
Mostpeoplenewtoregularexpressionswillattempttouse.
Theywillbesurprisedwhentheytestitonastringlike"Thisisafirsttest".
Youmightexpecttheregextomatch"andwhencontinuingafterthatmatch,".
Butitdoesnot.
Theregexwillmatchfirst".
Obviouslynotwhatwewanted.
Thereasonisthattheplusisgreedy.
Thatis,thepluscausestheregexenginetorepeattheprecedingtokenasoftenaspossible.
Onlyifthatcausestheentireregextofail,willtheregexenginebacktrack.
Thatis,itwillgobacktotheplus,makeitgiveupthelastiteration,andproceedwiththeremainderoftheregex.
Let'stakealookinsidetheregexenginetoseeindetailhowthisworksandwhythiscausesourregextofail.
Afterthat,Iwillpresentyouwithtwopossiblesolutions.
Liketheplus,thestarandtherepetitionusingcurlybracesaregreedy.
25LookingInsideTheRegexEngineThefirsttokenintheregexis".
Youshouldseetheproblembynow.
Thedotmatchesthe>",andtheenginecontinuesrepeatingthedot.
Thedotwillmatchallremainingcharactersinthestring.
Thedotfailswhentheenginehasreachedthevoidaftertheendofthestring.
Onlyatthispointdoestheregexenginecontinuewiththenexttoken:>.
Sofar,firsttest"andtheenginehasarrivedattheendofthestring.
>cannotmatchhere.
Theengineremembersthattheplushasrepeatedthedotmoreoftenthanisrequired.
(Rememberthattheplusrequiresthedottomatchonlyonce.
)Ratherthanadmittingfailure,theenginewillbacktrack.
Itwillreducetherepetitionoftheplusbyone,andthencontinuetryingtheremainderoftheregex.
Sothematchof.
+isreducedtoEM>firsttes".
Thenexttokenintheregexisstill>.
Butnowthenextcharacterinthestringisthelast"t".
Again,thesecannotmatch,causingtheenginetobacktrackfurther.
Thetotalmatchsofarisreducedtofirstte".
But>stillcannotmatch.
Sotheenginecontinuesbacktrackinguntilthematchof.
+isreducedtoEM>firstcanmatchthenextcharacterinthestring.
Thelasttokenintheregexhasbeenmatched.
Theenginereportsthatfirst"hasbeensuccessfullymatched.
Rememberthattheregexengineiseagertoreturnamatch.
Itwillnotcontinuebacktrackingfurthertoseeifthereisanotherpossiblematch.
Itwillreportthefirstvalidmatchitfinds.
Becauseofgreediness,thisistheleftmostlongestmatch.
LazinessInsteadofGreedinessThequickfixtothisproblemistomakethepluslazyinsteadofgreedy.
Lazyquantifiersaresometimesalsocalled"ungreedy"or"reluctant".
Youcandothatbyputtingaquestionmarkbehindtheplusintheregex.
Youcandothesamewiththestar,thecurlybracesandthequestionmarkitself.
SoourexamplebecomesLet'shaveanotherlookinsidetheregexengine.
Again,and"M".
Thisfails.
Again,theenginewillbacktrack.
Butthistime,thebacktrackingwillforcethelazyplustoexpandratherthanreduceitsreach.
Sothematchof.
+isexpandedtoEM",andtheenginetriesagaintocontinuewith>.
Now,>"ismatchedsuccessfully.
Thelasttokenintheregexhasbeenmatched.
Theenginereportsthat"hasbeensuccessfullymatched.
That'smorelikeit.
AnAlternativetoLazinessInthiscase,thereisabetteroptionthanmakingthepluslazy.
Wecanuseagreedyplusandanegatedcharacterclass:Thereasonwhythisisbetterisbecauseofthebacktracking.
Whenusingthelazyplus,theenginehastobacktrackforeachcharacterintheHTMLtagthatitistryingtomatch.
Whenusing26thenegatedcharacterclass,nobacktrackingoccursatallwhenthestringcontainsvalidHTMLcode.
Backtrackingslowsdowntheregexengine.
Youwillnotnoticethedifferencewhendoingasinglesearchinatexteditor.
ButyouwillsaveplentyofCPUcycleswhenusingsucharegexisusedrepeatedlyinatightloopinascriptthatyouarewriting,orperhapsinacustomsyntaxcoloringschemeforEditPadPro.
Finally,rememberthatthistutorialonlytalksaboutregex-directedengines.
Text-directedenginesdonotbacktrack.
Theydonotgetthespeedpenalty,buttheyalsodonotsupportlazyrepetitionoperators.
Repeating\Q.
.
.
\EEscapeSequencesThe\Q.
.
.
\Esequenceescapesastringofcharacters,matchingthemasliteralcharacters.
TheJGsoftengine,PerlandPCREtreattheescapedcharactersasindividualcharacters.
Ifyouplaceaquantifierafterthe\E,itwillonlybeappliedtothelastcharacter.
E.
g.
ifyouapply\Q*\d+*\E+to"*\d+**\d+*",thematchwillbe*\d+**".
Onlytheasteriskisrepeated.
(Theplusrepeatsatokenoneormoretimes,asI'llexplainlaterinthistutorial.
)TheJavaengine,however,appliesthequantifiertothewhole\Q.
.
.
\Esequence.
SoinJava,theaboveexamplematchesthewholesubjectstring*\d+**\d+*".
IfyouwantJavatoreturnthesamematchasPerl,you'llneedtosplitofftheasteriskfromtheescapesequence,likethis:\Q*\d+\E\*+.
IfyouwantPerltorepeatthewholesequencelikeJavadoes,simplygroupit:(:\Q*\d+*\E)+.
2711.
UseRoundBracketsforGroupingByplacingpartofaregularexpressioninsideroundbracketsorparentheses,youcangroupthatpartoftheregularexpressiontogether.
Thisallowsyoutoapplyaregexoperator,e.
g.
arepetitionoperator,totheentiregroup.
Ihavealreadyusedroundbracketsforthispurposeinprevioustopicsthroughoutthistutorial.
Notethatonlyroundbracketscanbeusedforgrouping.
Squarebracketsdefineacharacterclass,andcurlybracesareusedbyaspecialrepetitionoperator.
RoundBracketsCreateaBackreferenceBesidesgroupingpartofaregularexpressiontogether,roundbracketsalsocreatea"backreference".
Abackreferencestoresthepartofthestringmatchedbythepartoftheregularexpressioninsidetheparentheses.
Thatis,unlessyouusenon-capturingparentheses.
Rememberingpartoftheregexmatchinabackreference,slowsdowntheregexenginebecauseithasmoreworktodo.
Ifyoudonotusethebackreference,youcanspeedthingsupbyusingnon-capturingparentheses,attheexpenseofmakingyourregularexpressionslightlyhardertoread.
TheregexSet(Value)matchesSet"orSetValue".
Inthefirstcase,thefirstbackreferencewillbeempty,becauseitdidnotmatchanything.
Inthesecondcase,thefirstbackreferencewillcontainValue".
Ifyoudonotusethebackreference,youcanoptimizethisregularexpressionintoSet(:Value).
Thequestionmarkandthecolonaftertheopeningroundbracketarethespecialsyntaxthatyoucanusetotelltheregexenginethatthispairofbracketsshouldnotcreateabackreference.
Notethequestionmarkaftertheopeningbracketisunrelatedtothequestionmarkattheendoftheregex.
Thatquestionmarkistheregexoperatorthatmakestheprevioustokenoptional.
Thisoperatorcannotappearafteranopeningroundbracket,becauseanopeningbracketbyitselfisnotavalidregextoken.
Therefore,thereisnoconfusionbetweenthequestionmarkasanoperatortomakeatokenoptional,andthequestionmarkasacharactertochangethepropertiesofapairofroundbrackets.
Thecolonindicatesthatthechangewewanttomakeistoturnoffcapturingthebackreference.
HowtoUseBackreferencesBackreferencesallowyoutoreusepartoftheregexmatch.
Youcanreuseitinsidetheregularexpression(seebelow),orafterwards.
Whatyoucandowithitafterwards,dependsonthetoolyouareusing.
InEditPadProorPowerGREP,youcanusethebackreferenceinthereplacementtextduringasearch-and-replaceoperationbytyping\1(backslashone)intothereplacementtext.
IfyousearchedforEditPad(Lite|Pro)anduse"\1version"asthereplacement,theactualreplacementwillbe"Liteversion"incaseEditPadLite"wasmatched,and"Proversion"incaseEditPadPro"wasmatched.
EditPadProandPowerGREPhaveauniquefeaturethatallowsyoutochangethecaseofthebackreference.
\U1insertsthefirstbackreferenceinuppercase,\L1inlowercaseand\F1withthefirstcharacterinuppercaseandtheremainderinlowercase.
Finally,\I1insertsitwiththefirstletterofeachwordcapitalized,andtheotherlettersinlowercase.
28Regexlibrariesinprogramminglanguagesalsoprovideaccesstothebackreference.
InPerl,youcanusethemagicvariables$1,$2,etc.
toaccessthepartofthestringmatchedbythebackreference.
In.
NET(dotnet),youcanusetheMatchobjectthatisreturnedbytheMatchmethodoftheRegexclass.
ThisobjecthasapropertycalledGroups,whichisacollectionofGroupobjects.
TogetthestringmatchedbythethirdbackreferenceinC#,youcanuseMyMatch.
Groups[3].
Value.
The.
NET(dotnet)RegexclassalsohasamethodReplacethatcandoaregex-basedsearch-and-replaceonastring.
Inthereplacementtext,youcanuse$1,$2,etc.
toinsertbackreferences.
Tofigureoutthenumberofaparticularbackreference,scantheregularexpressionfromlefttorightandcounttheopeningroundbrackets.
Thefirstbracketstartsbackreferencenumberone,thesecondnumbertwo,etc.
Non-capturingparenthesesarenotcounted.
Thisfactmeansthatnon-capturingparentheseshaveanotherbenefit:youcaninsertthemintoaregularexpressionwithoutchangingthenumbersassignedtothebackreferences.
Thiscanbeveryusefulwhenmodifyingacomplexregularexpression.
TheEntireRegexMatchAsBackreferenceZeroCertaintoolsmaketheentireregexmatchavailableasbackreferencezero.
InEditPadProorPowerGREP,youcanusetheentireregexmatchinthereplacementtextduringasearchandreplaceoperationbytyping\0(backslashzero)intothereplacementtext.
InPerl,themagicvariable$&holdstheentireregexmatch.
Librarieslike.
NET(dotnet)wherebackreferencesaremadeavailableasanarrayornumberedlist,theitemwithindexzeroholdstheentireregexmatch.
Usingbackreferencezeroismoreefficientthanputtinganextrapairofroundbracketsaroundtheentireregex,becausethatwouldforcetheenginetocontinuouslykeepanextracopyoftheentireregexmatch.
UsingBackreferencesinTheRegularExpressionBackreferencescannotonlybeusedafteramatchhasbeenfound,butalsoduringthematch.
SupposeyouwanttomatchapairofopeningandclosingHTMLtags,andthetextinbetween.
Byputtingtheopeningtagintoabackreference,wecanreusethenameofthetagfortheclosingtag.
Here'show:.
Thisregexcontainsonlyonepairofparentheses,whichcapturethestringmatchedby[A-Z][A-Z0-9]*intothefirstbackreference.
Thisbackreferenceisreusedwith\1(backslashone).
The/beforeitissimplytheforwardslashintheclosingHTMLtagthatwearetryingtomatch.
Youcanreusethesamebackreferencemorethanonce.
([a-c])x\1x\1willmatchaxaxa",bxbxb"andcxcxc".
Ifabackreferencewasnotusedinaparticularmatchattempt(suchasinthefirstexamplewherethequestionmarkmadethefirstbackreferenceoptional),itissimplyempty.
Usinganemptybackreferenceintheregexisperfectlyfine.
Itwillsimplybereplacedwithnothingness.
Abackreferencecannotbeusedinsideitself.
([abc]\1)willnotwork.
Dependingonyourregexflavor,itwilleithergiveanerrormessage,oritwillfailtomatchanythingwithoutanerrormessage.
Therefore,\0cannotbeusedinsidearegex,onlyinthereplacement.
29LookingInsideTheRegexEngineLet'sseehowtheregexengineappliestheaboveregextothestring"Testingbolditalictext".
Thefirsttokenintheregexistheliteral".
Thismatchfails.
However,becauseofthestar,that'sperfectlyfine.
Thepositioninthestringremainsat">".
Thepositionintheregexisadvancedto[^>].
Thisstepcrossestheclosingbracketofthefirstpairofcapturingparentheses.
Thispromptstheregexenginetostorewhatwasmatchedinsidethemintothefirstbackreference.
Inthiscase,B"isstored.
Afterstoringthebackreference,theengineproceedswiththematchattempt.
doesnotmatch>".
Again,becauseofanotherstar,thisisnotaproblem.
Thepositioninthestringremainsat">",andpositionintheregexisadvancedto>.
Theseobviouslymatch.
Thenexttokenisadot,repeatedbyalazystar.
Becauseofthelaziness,theregexenginewillinitiallyskipthistoken,takingnotethatitshouldbacktrackincasetheremainderoftheregexfails.
Theenginehasnowarrivedatthesecondbolditalic".
Atthispoint,bolditalic".
Atthispoint,matches>".
Acompletematchhasbeenfound:bolditalic".
RepetitionandBackreferencesAsImentionedintheaboveinsidelook,theregexenginedoesnotpermanentlysubstitutebackreferencesintheregularexpression.
Itwillusethelastmatchsavedintothebackreferenceeachtimeitneedstobeused.
Ifanewmatchisfoundbycapturingparentheses,thepreviouslysavedmatchisoverwritten.
Thereisacleardifferencebetween([abc]+)and([abc])+.
Thoughbothsuccessfullymatchcab",thefirstregexwillputcab"intothefirstbackreference,whilethesecondregexwillonlystoreb".
Thatisbecauseinthesecondregex,thepluscausedthepairofparenthesestorepeatthreetimes.
Thefirsttime,c"wasstored.
Thesecondtimea"andthethirdtimeb".
Eachtime,thepreviousvaluewasoverwritten,sob"remains.
Thisalsomeansthat([abc]+)=\1willmatchcab=cab",andthat([abc])+=\1willnot.
Thereasonisthatwhentheenginearrivesat\1,itholdsbwhichfailstomatch"c".
Obviouswhenyoulookata30simpleexamplelikethisone,butacommoncauseofdifficultywithregularexpressionsnonetheless.
Whenusingbackreferences,alwaysdoublecheckthatyouarereallycapturingwhatyouwant.
UsefulExample:CheckingforDoubledWordsWheneditingtext,doubledwordssuchas"thethe"easilycreepin.
Usingtheregex\b(\w+)\s+\1\binyourtexteditor,youcaneasilyfindthem.
Todeletethesecondword,simplytypein"\1"asthereplacementtextandclicktheReplacebutton.
ParenthesesandBackreferencesCannotBeUsedInsideCharacterClassesRoundbracketscannotbeusedinsidecharacterclasses,atleastnotasmetacharacters.
Whenyouputaroundbracketinacharacterclass,itistreatedasaliteralcharacter.
Sotheregex[(a)b]matchesa",b",("and)".
Backreferencesalsocannotbeusedinsideacharacterclass.
The\1inregexlike(a)[\1b]willbeinterpretedasanoctalescapeinmostregexflavors.
Sothisregexwillmatchana"followedbyeither\x01orab.
3112.
NamedCapturingGroupsAllmodernregularexpressionenginessupportcapturinggroups,whicharenumberedfromlefttoright,startingwithone.
Thenumberscanthenbeusedinbackreferencestomatchthesametextagainintheregularexpression,ortousepartoftheregexmatchforfurtherprocessing.
Inacomplexregularexpressionwithmanycapturinggroups,thenumberingcangetalittleconfusing.
NamedCapturewithPython,PCREandPHPPython'sregexmodulewasthefirsttoofferasolution:namedcapture.
Byassigninganametoacapturinggroup,youcaneasilyreferenceitbyname.
(Pgroup)capturesthematchofgroupintothebackreference"name".
Youcanreferencethecontentsofthegroupwiththenumberedbackreference\1orthenamedbackreference(P=name).
TheopensourcePCRElibraryhasfollowedPython'sexample,andoffersnamedcaptureusingthesamesyntax.
ThePHPpregfunctionsofferthesamefunctionality,sincetheyarebasedonPCRE.
Python'ssub()functionallowsyoutoreferenceanamedgroupas"\1"or"\g".
ThisdoesnotworkinPHP.
InPHP,youcanusedouble-quotedstringinterpolationwiththe$regsparameteryoupassedtopcre_match():"$regs['name']".
NamedCapturewith.
NET'sSystem.
Text.
RegularExpressionsTheregularexpressionclassesofthe.
NETframeworkalsosupportnamedcapture.
Unfortunately,theMicrosoftdevelopersdecidedtoinventtheirownsyntax,ratherthanfollowtheonepioneeredbyPython.
Currently,nootherregexflavorsupportsMicrosoft'sversionofnamedcapture.
Hereisanexamplewithtwocapturinggroupsin.
NETstyle:(group)('second'group).
Asyoucansee,.
NETofferstwosyntaxestocreateacapturinggroup:oneusingsharpbrackets,andtheotherusingsinglequotes.
Thefirstsyntaxispreferableinstrings,wheresinglequotesmayneedtobeescaped.
ThesecondsyntaxispreferableinASPcode,wherethesharpbracketsareusedforHTMLtags.
Youcanusethepointybracketflavorandthequotedflavorsinterchangeably.
Toreferenceacapturinggroupinsidetheregex,use\kor\k'name'.
Again,youcanusethetwosyntacticvariationsinterchangeably.
Whendoingasearch-and-replace,youcanreferencethenamedgroupwiththefamiliardollarsignsyntax:"${name}".
Simplyuseanameinsteadofanumberbetweenthecurlybraces.
NamesandNumbersforCapturingGroupsHereiswherethingsgetabitugly.
PythonandPCREtreatnamedcapturinggroupsjustlikeunnamedcapturinggroups,andnumberbothkindsfromlefttoright,startingwithone.
Theregex(a)(Pb)(c)(Pd)matchesabcd"asexpected.
Ifyoudoasearch-and-replacewiththisregex32andthereplacement"\1\2\3\4",youwillget"abcd".
Allfourgroupswerenumberedfromlefttoright,fromonetillfour.
Easyandlogical.
Thingsarequiteabitmorecomplicatedwiththe.
NETframework.
Theregex(a)(b)(c)(d)againmatchesabcd".
However,ifyoudoasearch-and-replacewith"$1$2$3$4"asthereplacement,youwillget"acbd".
Probablynotwhatyouexpected.
The.
NETframeworkdoesnumbernamedcapturinggroupsfromlefttoright,butnumbersthemafteralltheunnamedgroupshavebeennumbered.
Sotheunnamedgroups(a)and(c)getnumberedfirst,fromlefttoright,startingatone.
Thenthenamedgroups(b)and(d)gettheirnumbers,continuingfromtheunnamedgroups,inthiscase:three.
Tomakethingssimple,whenusing.
NET'sregexsupport,justassumethatnamedgroupsdonotgetnumberedatall,andreferencethembynameexclusively.
Tokeepthingscompatibleacrossregexflavors,Istronglyrecommendthatyoudonotmixnamedandunnamedcapturinggroupsatall.
Eithergiveagroupaname,ormakeitnon-capturingasin(:nocapture).
Non-capturinggroupsaremoreefficient,sincetheregexenginedoesnotneedtokeeptrackoftheirmatches.
OtherRegexFlavorsEditPadProandPowerGREPsupportboththePythonsyntaxandthe.
NETsyntaxfornamedcapture.
However,theywillnumbernamedgroupsalongwithunnamedcapturinggroups,justlikePythondoes.
RegexBuddyalsosupportsbothPython'sandMicrosoft'sstyle.
RegexBuddywillconvertoneflavorofnamedcaptureintotheotherwhengeneratingsourcecodesnippetsforPython,PHP/preg,PHP,oroneofthe.
NETlanguages.
Noneoftheotherregexflavorsdiscussedinthisbooksupportnamedcapture.
3313.
UnicodeRegularExpressionsUnicodeisacharactersetthataimstodefineallcharactersandglyphsfromallhumanlanguages,livinganddead.
Withmoreandmoresoftwarebeingrequiredtosupportmultiplelanguages,orevenjustanylanguage,Unicodehasbeenstronglygainingpopularityinrecentyears.
Usingdifferentcharactersetsfordifferentlanguagesissimplytoocumbersomeforprogrammersandusers.
Unfortunately,Unicodebringsitsownrequirementsandpitfallswhenitcomestoregularexpressions.
Oftheregexflavorsdiscussedinthistutorial,Java,XMLandthe.
NETframeworkuseUnicode-basedregexengines.
PerlsupportsUnicodestartingwithversion5.
6.
PCREcanoptionallybecompiledwithUnicodesupport.
NotethatPCREisfarlessflexibleinwhatitallowsforthe\ptokens,despiteitsname"Perl-compatible".
ThePHPpregfunctions,whicharebasedonPCRE,supportUnicodewhenthe/uoptionisappendedtotheregularexpression.
RegexBuddy'sregexengineisfullyUnicode-basedstartingwithversion2.
0.
0.
RegexBuddy1.
x.
xdidnotsupportUnicodeatall.
PowerGREPusesthesameUnicoderegexenginestartingwithversion3.
0.
0.
EarlierversionswouldconvertUnicodefilestoANSIpriortogreppingwithan8-bit(i.
e.
non-Unicode)regexengine.
EditPadProsupportsUnicodestartingwithversion6.
0.
0.
Characters,CodePointsandGraphemesorHowUnicodeMakesaMessofThingsMostpeoplewouldconsider"à"asinglecharacter.
Unfortunately,itneednotbedependingonthemeaningoftheword"character".
AllUnicoderegexenginesdiscussedinthistutorialtreatanysingleUnicodecodepointasasinglecharacter.
Whenthistutorialtellsyouthatthedotmatchesanysinglecharacter,thistranslatesintoUnicodeparlanceas"thedotmatchesanysingleUnicodecodepoint".
InUnicode,"à"canbeencodedastwocodepoints:U+0061(a)followedbyU+0300(graveaccent).
Inthissituation,.
appliedto"à"willmatcha"withouttheaccent.
^.
$willfailtomatch,sincethestringconsistsoftwocodepoints.
matchesà".
TheUnicodecodepointU+0300(graveaccent)isacombiningmark.
Anycodepointthatisnotacombiningmarkcanbefollowedbyanynumberofcombiningmarks.
Thissequence,likeU+0061U+0300above,isdisplayedasasinglegraphemeonthescreen.
Unfortunately,"à"canalsobeencodedwiththesingleUnicodecodepointU+00E0(awithgraveaccent).
Thereasonforthisdualityisthatmanyhistoricalcharactersetsencode"awithgraveaccent"asasinglecharacter.
Unicode'sdesignersthoughtitwouldbeusefultohaveaone-on-onemappingwithpopularlegacycharactersets,inadditiontotheUnicodewayofseparatingmarksandbaseletters(whichmakesarbitrarycombinationsnotsupportedbylegacycharactersetspossible).
HowtoMatchaSingleUnicodeGraphemeMatchingasinglegrapheme,whetherit'sencodedasasinglecodepoint,orasmultiplecodepointsusingcombiningmarks,iseasyinPerl,RegexBuddyandPowerGREP:simplyuse\X.
Youcanconsider\XtheUnicodeversionofthedotinregexenginesthatuseplainASCII.
Thereisonedifference,though:\X34alwaysmatcheslinebreakcharacters,whereasthedotdoesnotmatchlinebreakcharactersunlessyouenablethedotmatchesnewlinematchingmode.
Javaand.
NETunfortunatelydonotsupport\X(yet).
Use\P{M}\p{M}*asasubstitute.
Tomatchanynumberofgraphemes,use(:\P{M}\p{M}*)+insteadof\X+.
MatchingaSpecificCodePointTomatchaspecificUnicodecodepoint,use\uFFFFwhereFFFFisthehexadecimalnumberofthecodepointyouwanttomatch.
Youmustalwaysspecify4hexadecimaldigitsE.
g.
\u00E0matchesà",butonlywhenencodedasasinglecodepointU+00E0.
PerlandPCREdonotsupportthe\uFFFFsyntax.
Theyuse\x{FFFF}instead.
Youcanomitleadingzerosinthehexadecimalnumberbetweenthecurlybraces.
Since\xbyitselfisnotavalidregextoken,\x{1234}canneverbeconfusedtomatch\x1234times.
ItalwaysmatchestheUnicodecodepointU+1234.
\x{1234}{5678}willtrytomatchcodepointU+1234exactly5678times.
InJava,theregextoken\uFFFFonlymatchesthespecifiedcodepoint,evenwhenyouturnedoncanonicalequivalence.
However,thesamesyntax\uFFFFisalsousedtoinsertUnicodecharactersintoliteralstringsintheJavasourcecode.
Pattern.
compile("\u00E0")willmatchboththesingle-code-pointanddouble-code-pointencodingsofà",whilePattern.
compile("\\u00E0")matchesonlythesingle-code-pointversion.
RememberthatwhenwritingaregexasaJavastringliteral,backslashesmustbeescaped.
TheformerJavacodecompilestheregexà,whilethelattercompiles\u00E0.
Dependingonwhatyou'redoing,thedifferencemaybesignificant.
JavaScript,whichdoesnotofferanyUnicodesupportthroughitsRegExpclass,doessupport\uFFFFformatchingasingleUnicodecodepointaspartofitsstringsyntax.
XMLSchemadoesnothavearegextokenformatchingUnicodecodepoints.
However,youcaneasilyuseXMLentitiesliketoinsertliteralcodepointsintoyourregularexpression.
UnicodeCharacterPropertiesInadditiontocomplications,Unicodealsobringsnewpossibilities.
OneisthateachUnicodecharacterbelongstoacertaincategory.
Youcanmatchasinglecharacterbelongingtoaparticularcategorywith\p{}.
Youcanmatchasinglecharacternotbelongingtoaparticularcategorywith\P{}.
Again,"character"reallymeans"Unicodecodepoint".
\p{L}matchesasinglecodepointinthecategory"letter".
Ifyourinputstringis"à"encodedasU+0061U+0300,itmatchesa"withouttheaccent.
Iftheinputis"à"encodedasU+00E0,itmatchesà"withtheaccent.
ThereasonisthatboththecodepointsU+0061(a)andU+00E0(à)areinthecategory"letter",whileU+0300isinthecategory"mark".
Youshouldnowunderstandwhy\P{M}\p{M}*istheequivalentof\X.
\P{M}matchesacodepointthatisnotacombiningmark,while\p{M}*matcheszeroormorecodepointsthatarecombiningmarks.
Tomatchaletterincludinganydiacritics,use\p{L}\p{M}*.
Thislastregexwillalwaysmatchà",regardlessofhowitisencoded.
35The.
NETRegexclassandPCREarecasesensitivewhenitchecksthepartbetweencurlybracesofa\ptoken.
\p{Zs}willmatchanykindofspacecharacter,while\p{zs}willthrowanerror.
Allotherregexenginesdescribedinthistutorialwillmatchthespaceinbothcases,ignoringthecaseofthepropertybetweenthecurlybraces.
Still,IrecommendyoumakeahabitofusingthesameuppercaseandlowercasecombinationasIdidinthelistofpropertiesbelow.
ThiswillmakeyourregularexpressionsworkwithallUnicoderegexengines.
Inadditiontothestandardnotation,\p{L},Java,Perl,PCREandtheJGsoftengineallowyoutousetheshorthand\pL.
Theshorthandonlyworkswithsingle-letterUnicodeproperties.
\pLlisnottheequivalentof\p{Ll}.
Itistheequivalentof\p{L}lwhichmatchesAl"oràl"oranyUnicodeletterfollowedbyaliterall".
PerlandtheJGsoftenginealsosupportthelonghand\p{Letter}.
YoucanfindacompletelistofallUnicodepropertiesbelow.
Youmayomittheunderscoresorusehyphensorspacesinstead.
\p{L}or\p{Letter}:anykindofletterfromanylanguage.
o\p{Ll}or\p{Lowercase_Letter}:alowercaseletterthathasanuppercasevariant.
o\p{Lu}or\p{Uppercase_Letter}:anuppercaseletterthathasalowercasevariant.
o\p{Lt}or\p{Titlecase_Letter}:aletterthatappearsatthestartofawordwhenonlythefirstletterofthewordiscapitalized.
o\p{L&}or\p{Letter&}:aletterthatexistsinlowercaseanduppercasevariants(combinationofLl,LuandLt).
o\p{Lm}or\p{Modifier_Letter}:aspecialcharacterthatisusedlikealetter.
o\p{Lo}or\p{Other_Letter}:aletterorideographthatdoesnothavelowercaseanduppercasevariants.
\p{M}or\p{Mark}:acharacterintendedtobecombinedwithanothercharacter(e.
g.
accents,umlauts,enclosingboxes,etc.
).
o\p{Mn}or\p{Non_Spacing_Mark}:acharacterintendedtobecombinedwithanothercharacterthatdoesnottakeupextraspace(e.
g.
accents,umlauts,etc.
).
o\p{Mc}or\p{Spacing_Combining_Mark}:acharacterintendedtobecombinedwithanothercharacterthattakesupextraspace(vowelsignsinmanyEasternlanguages).
o\p{Me}or\p{Enclosing_Mark}:acharacterthatenclosesthecharacterisiscombinedwith(circle,square,keycap,etc.
).
\p{Z}or\p{Separator}:anykindofwhitespaceorinvisibleseparator.
o\p{Zs}or\p{Space_Separator}:awhitespacecharacterthatisinvisible,butdoestakeupspace.
o\p{Zl}or\p{Line_Separator}:lineseparatorcharacterU+2028.
o\p{Zp}or\p{Paragraph_Separator}:paragraphseparatorcharacterU+2029.
\p{S}or\p{Symbol}:mathsymbols,currencysigns,dingbats,box-drawingcharacters,etc.
.
o\p{Sm}or\p{Math_Symbol}:anymathematicalsymbol.
o\p{Sc}or\p{Currency_Symbol}:anycurrencysign.
o\p{Sk}or\p{Modifier_Symbol}:acombiningcharacter(mark)asafullcharacteronitsown.
o\p{So}or\p{Other_Symbol}:varioussymbolsthatarenotmathsymbols,currencysigns,orcombiningcharacters.
\p{N}or\p{Number}:anykindofnumericcharacterinanyscript.
o\p{Nd}or\p{Decimal_Digit_Number}:adigitzerothroughnineinanyscriptexceptideographicscripts.
o\p{Nl}or\p{Letter_Number}:anumberthatlookslikealetter,suchasaRomannumeral.
36o\p{No}or\p{Other_Number}:asuperscriptorsubscriptdigit,oranumberthatisnotadigit0.
.
9(excludingnumbersfromideographicscripts).
\p{P}or\p{Punctuation}:anykindofpunctuationcharacter.
o\p{Pd}or\p{Dash_Punctuation}:anykindofhyphenordash.
o\p{Ps}or\p{Open_Punctuation}:anykindofopeningbracket.
o\p{Pe}or\p{Close_Punctuation}:anykindofclosingbracket.
o\p{Pi}or\p{Initial_Punctuation}:anykindofopeningquote.
o\p{Pf}or\p{Final_Punctuation}:anykindofclosingquote.
o\p{Pc}or\p{Connector_Punctuation}:apunctuationcharactersuchasanunderscorethatconnectswords.
o\p{Po}or\p{Other_Punctuation}:anykindofpunctuationcharacterthatisnotadash,bracket,quoteorconnector.
\p{C}or\p{Other}:invisiblecontrolcharactersandunusedcodepoints.
o\p{Cc}or\p{Control}:anASCII0x00.
.
0x1ForLatin-10x80.
.
0x9Fcontrolcharacter.
o\p{Cf}or\p{Format}:invisibleformattingindicator.
o\p{Co}or\p{Private_Use}:anycodepointreservedforprivateuse.
o\p{Cs}or\p{Surrogate}:onehalfofasurrogatepairinUTF-16encoding.
o\p{Cn}or\p{Unassigned}:anycodepointtowhichnocharacterhasbeenassigned.
UnicodeScriptsTheUnicodestandardplaceseachassignedcodepoint(character)intoonescript.
Ascriptisagroupofcodepointsusedbyaparticularhumanwritingsystem.
SomescriptslikeThaicorrespondwithasinglehumanlanguage.
OtherscriptslikeLatinspanmultiplelanguages.
Somelanguagesarecomposedofmultiplescripts.
ThereisnoJapaneseUnicodescript.
Instead,UnicodeofferstheHiragana,Katakana,HanandLatinscriptsthatJapanesedocumentsareusuallycomposedof.
AspecialscriptistheCommonscript.
Thisscriptcontainsallsortsofcharactersthatarecommontoawiderangeofscripts.
Itincludesallsortsofpunctuation,whitespaceandmiscellaneoussymbols.
AllassignedUnicodecodepoints(thosematchedby\P{Cn})arepartofexactlyoneUnicodescript.
AllunassignedUnicodecodepoints(thosematchedby\p{Cn})arenotpartofanyUnicodescriptatall.
VeryfewregularexpressionenginessupportUnicodescriptstoday.
Ofalltheflavorsdiscussedinthistutorial,onlytheJGsoftengine,PerlandPCREcanmatchUnicodescripts.
Here'sacompletelistofallUnicodescripts:1.
\p{Common}2.
\p{Arabic}3.
\p{Armenian}4.
\p{Bengali}5.
\p{Bopomofo}6.
\p{Braille}7.
\p{Buhid}8.
\p{CanadianAboriginal}9.
\p{Cherokee}10.
\p{Cyrillic}11.
\p{Devanagari}3712.
\p{Ethiopic}13.
\p{Georgian}14.
\p{Greek}15.
\p{Gujarati}16.
\p{Gurmukhi}17.
\p{Han}18.
\p{Hangul}19.
\p{Hanunoo}20.
\p{Hebrew}21.
\p{Hiragana}22.
\p{Inherited}23.
\p{Kannada}24.
\p{Katakana}25.
\p{Khmer}26.
\p{Lao}27.
\p{Latin}28.
\p{Limbu}29.
\p{Malayalam}30.
\p{Mongolian}31.
\p{Myanmar}32.
\p{Ogham}33.
\p{Oriya}34.
\p{Runic}35.
\p{Sinhala}36.
\p{Syriac}37.
\p{Tagalog}38.
\p{Tagbanwa}39.
\p{TaiLe}40.
\p{Tamil}41.
\p{Telugu}42.
\p{Thaana}43.
\p{Thai}44.
\p{Tibetan}45.
\p{Yi}Insteadofthe\p{Latin}syntaxyoucanalsouse\p{IsLatin}.
The"Is"syntaxisusefulfordistinguishingbetweenscriptsandblocks,asexplainedinthenextsection.
Unfortunately,PCREdoesnotsupport"Is"asofthiswriting.
UnicodeBlocksTheUnicodestandarddividestheUnicodecharactermapintodifferentblocksorrangesofcodepoints.
Eachblockisusedtodefinecharactersofaparticularscriptlike"Tibetan"orbelongingtoaparticulargrouplike"BraillePatterns".
Mostblocksincludeunassignedcodepoints,reservedforfutureexpansionoftheUnicodestandard.
NotethatUnicodeblocksdonotcorrespond100%withscripts.
Anessentialdifferencebetweenblocksandscriptsisthatablockisasinglecontiguousrangeofcodepoints,aslistedbelow.
ScriptsconsistofcharacterstakenfromallovertheUnicodecharactermap.
Blocksmayincludeunassignedcodepoints(i.
e.
codepoints38matchedby\p{Cn}).
Scriptsneverincludeunassignedcodepoints.
Generally,ifyou'renotsurewhethertouseaUnicodescriptorUnicodeblock,usethescript.
E.
g.
theCurrencyblockdoesnotincludethedollarandyensymbols.
ThosearefoundintheBasic_LatinandLatin-1_Supplementblocksinstead,forhistoricalreasons,eventhoughbotharecurrencysymbols,andtheyensymbolisnotaLatincharacter.
Youshouldnotblindlyuseanyoftheblockslistedbelowbasedontheirnames.
Instead,lookattherangesofcharacterstheyactuallymatch.
AtoollikeRegexBuddycanbeveryhelpfulwiththis.
E.
g.
theUnicodeproperty\p{Sc}or\p{Currency_Symbol}wouldbeabetterchoicethantheUnicodeblock\p{InCurrency}whentryingtofindallcurrencysymbols.
1.
\p{InBasic_Latin}:U+0000.
.
U+007F2.
\p{InLatin-1_Supplement}:U+0080.
.
U+00FF3.
\p{InLatin_Extended-A}:U+0100.
.
U+017F4.
\p{InLatin_Extended-B}:U+0180.
.
U+024F5.
\p{InIPA_Extensions}:U+0250.
.
U+02AF6.
\p{InSpacing_Modifier_Letters}:U+02B0.
.
U+02FF7.
\p{InCombining_Diacritical_Marks}:U+0300.
.
U+036F8.
\p{InGreek_and_Coptic}:U+0370.
.
U+03FF9.
\p{InCyrillic}:U+0400.
.
U+04FF10.
\p{InCyrillic_Supplementary}:U+0500.
.
U+052F11.
\p{InArmenian}:U+0530.
.
U+058F12.
\p{InHebrew}:U+0590.
.
U+05FF13.
\p{InArabic}:U+0600.
.
U+06FF14.
\p{InSyriac}:U+0700.
.
U+074F15.
\p{InThaana}:U+0780.
.
U+07BF16.
\p{InDevanagari}:U+0900.
.
U+097F17.
\p{InBengali}:U+0980.
.
U+09FF18.
\p{InGurmukhi}:U+0A00.
.
U+0A7F19.
\p{InGujarati}:U+0A80.
.
U+0AFF20.
\p{InOriya}:U+0B00.
.
U+0B7F21.
\p{InTamil}:U+0B80.
.
U+0BFF22.
\p{InTelugu}:U+0C00.
.
U+0C7F23.
\p{InKannada}:U+0C80.
.
U+0CFF24.
\p{InMalayalam}:U+0D00.
.
U+0D7F25.
\p{InSinhala}:U+0D80.
.
U+0DFF26.
\p{InThai}:U+0E00.
.
U+0E7F27.
\p{InLao}:U+0E80.
.
U+0EFF28.
\p{InTibetan}:U+0F00.
.
U+0FFF29.
\p{InMyanmar}:U+1000.
.
U+109F30.
\p{InGeorgian}:U+10A0.
.
U+10FF31.
\p{InHangul_Jamo}:U+1100.
.
U+11FF32.
\p{InEthiopic}:U+1200.
.
U+137F33.
\p{InCherokee}:U+13A0.
.
U+13FF34.
\p{InUnified_Canadian_Aboriginal_Syllabics}:U+1400.
.
U+167F35.
\p{InOgham}:U+1680.
.
U+169F36.
\p{InRunic}:U+16A0.
.
U+16FF37.
\p{InTagalog}:U+1700.
.
U+171F38.
\p{InHanunoo}:U+1720.
.
U+173F39.
\p{InBuhid}:U+1740.
.
U+175F40.
\p{InTagbanwa}:U+1760.
.
U+177F41.
\p{InKhmer}:U+1780.
.
U+17FF3942.
\p{InMongolian}:U+1800.
.
U+18AF43.
\p{InLimbu}:U+1900.
.
U+194F44.
\p{InTai_Le}:U+1950.
.
U+197F45.
\p{InKhmer_Symbols}:U+19E0.
.
U+19FF46.
\p{InPhonetic_Extensions}:U+1D00.
.
U+1D7F47.
\p{InLatin_Extended_Additional}:U+1E00.
.
U+1EFF48.
\p{InGreek_Extended}:U+1F00.
.
U+1FFF49.
\p{InGeneral_Punctuation}:U+2000.
.
U+206F50.
\p{InSuperscripts_and_Subscripts}:U+2070.
.
U+209F51.
\p{InCurrency_Symbols}:U+20A0.
.
U+20CF52.
\p{InCombining_Diacritical_Marks_for_Symbols}:U+20D0.
.
U+20FF53.
\p{InLetterlike_Symbols}:U+2100.
.
U+214F54.
\p{InNumber_Forms}:U+2150.
.
U+218F55.
\p{InArrows}:U+2190.
.
U+21FF56.
\p{InMathematical_Operators}:U+2200.
.
U+22FF57.
\p{InMiscellaneous_Technical}:U+2300.
.
U+23FF58.
\p{InControl_Pictures}:U+2400.
.
U+243F59.
\p{InOptical_Character_Recognition}:U+2440.
.
U+245F60.
\p{InEnclosed_Alphanumerics}:U+2460.
.
U+24FF61.
\p{InBox_Drawing}:U+2500.
.
U+257F62.
\p{InBlock_Elements}:U+2580.
.
U+259F63.
\p{InGeometric_Shapes}:U+25A0.
.
U+25FF64.
\p{InMiscellaneous_Symbols}:U+2600.
.
U+26FF65.
\p{InDingbats}:U+2700.
.
U+27BF66.
\p{InMiscellaneous_Mathematical_Symbols-A}:U+27C0.
.
U+27EF67.
\p{InSupplemental_Arrows-A}:U+27F0.
.
U+27FF68.
\p{InBraille_Patterns}:U+2800.
.
U+28FF69.
\p{InSupplemental_Arrows-B}:U+2900.
.
U+297F70.
\p{InMiscellaneous_Mathematical_Symbols-B}:U+2980.
.
U+29FF71.
\p{InSupplemental_Mathematical_Operators}:U+2A00.
.
U+2AFF72.
\p{InMiscellaneous_Symbols_and_Arrows}:U+2B00.
.
U+2BFF73.
\p{InCJK_Radicals_Supplement}:U+2E80.
.
U+2EFF74.
\p{InKangxi_Radicals}:U+2F00.
.
U+2FDF75.
\p{InIdeographic_Description_Characters}:U+2FF0.
.
U+2FFF76.
\p{InCJK_Symbols_and_Punctuation}:U+3000.
.
U+303F77.
\p{InHiragana}:U+3040.
.
U+309F78.
\p{InKatakana}:U+30A0.
.
U+30FF79.
\p{InBopomofo}:U+3100.
.
U+312F80.
\p{InHangul_Compatibility_Jamo}:U+3130.
.
U+318F81.
\p{InKanbun}:U+3190.
.
U+319F82.
\p{InBopomofo_Extended}:U+31A0.
.
U+31BF83.
\p{InKatakana_Phonetic_Extensions}:U+31F0.
.
U+31FF84.
\p{InEnclosed_CJK_Letters_and_Months}:U+3200.
.
U+32FF85.
\p{InCJK_Compatibility}:U+3300.
.
U+33FF86.
\p{InCJK_Unified_Ideographs_Extension_A}:U+3400.
.
U+4DBF87.
\p{InYijing_Hexagram_Symbols}:U+4DC0.
.
U+4DFF88.
\p{InCJK_Unified_Ideographs}:U+4E00.
.
U+9FFF89.
\p{InYi_Syllables}:U+A000.
.
U+A48F90.
\p{InYi_Radicals}:U+A490.
.
U+A4CF91.
\p{InHangul_Syllables}:U+AC00.
.
U+D7AF92.
\p{InHigh_Surrogates}:U+D800.
.
U+DB7F93.
\p{InHigh_Private_Use_Surrogates}:U+DB80.
.
U+DBFF4094.
\p{InLow_Surrogates}:U+DC00.
.
U+DFFF95.
\p{InPrivate_Use_Area}:U+E000.
.
U+F8FF96.
\p{InCJK_Compatibility_Ideographs}:U+F900.
.
U+FAFF97.
\p{InAlphabetic_Presentation_Forms}:U+FB00.
.
U+FB4F98.
\p{InArabic_Presentation_Forms-A}:U+FB50.
.
U+FDFF99.
\p{InVariation_Selectors}:U+FE00.
.
U+FE0F100.
\p{InCombining_Half_Marks}:U+FE20.
.
U+FE2F101.
\p{InCJK_Compatibility_Forms}:U+FE30.
.
U+FE4F102.
\p{InSmall_Form_Variants}:U+FE50.
.
U+FE6F103.
\p{InArabic_Presentation_Forms-B}:U+FE70.
.
U+FEFF104.
\p{InHalfwidth_and_Fullwidth_Forms}:U+FF00.
.
U+FFEF105.
\p{InSpecials}:U+FFF0.
.
U+FFFFNotallUnicoderegexenginesusethesamesyntaxtomatchUnicodeblocks.
Perlandusethe\p{InBlock}syntaxaslistedabove.
.
NETandXMLuse\p{IsBlock}instead.
TheJGsoftenginesupportsbothnotations.
Irecommendyouusethe"In"notationifyourregexenginesupportsit.
"In"canonlybeusedforUnicodeblocks,while"Is"canalsobeusedforUnicodepropertiesandscripts,dependingontheregularexpressionflavoryou'reusing.
Byusing"In",it'sobviousyou'rematchingablockandnotasimilarlynamedpropertyorscript.
In.
NETandXML,youmustomittheunderscoresbutkeepthehyphensintheblocknames.
E.
g.
Use\p{IsLatinExtended-A}insteadof\p{InLatin_Extended-A}.
PerlandJavaallowyoutouseanunderscore,hyphen,spaceornothingforeachunderscoreorhyphenintheblock'sname.
.
NETandXMLalsocomparethenamescasesensitively,whilePerlandJavadonot.
\p{islatinextended-a}throwsanerrorin.
NET,while\p{inlatinextended-a}worksfineinPerlandJava.
TheJGsoftenginesupportsalloftheabovenotations.
Youcanuse"In"or"Is",ignoredifferencesinupperandlowercase,andusespaces,underscoresandhyphensasyoulike.
Thiswayyoucankeepusingthesyntaxofyourfavoriteprogramminglanguage,andhaveitworkasyou'dexpectinPowerGREPorEditPadPro.
Theactualnamesoftheblocksarethesameinallregularexpressionengines.
TheblocknamesaredefinedintheUnicodestandard.
PCREdoesnotsupportUnicodeblocks.
AlternativeUnicodeRegexSyntaxUnicodeisarelativelynewadditiontotheworldofregularexpressions.
Asyouguessedfrommyexplanationsofdifferentnotations,differentregexenginedesignersunfortunatelyhavedifferentideasaboutthesyntaxtouse.
PerlandJavaevensupportafewadditionalalternativenotationsthatyoumayencounterinregularexpressionscreatedbyothers.
Irecommendagainstusingthesenotationsinyourownregularexpressions,tomaintainclarityandcompatibilitywithotherregexflavors,andunderstandabilitybypeoplemorefamiliarwithotherflavors.
IfyouarejustgettingstartedwithUnicoderegularexpressions,youmaywanttoskipthissectionuntillater,toavoidconfusion(iftheabovedidn'tconfuseyoualready).
InPerlandPCREregularexpressions,youmayencounteraUnicodepropertylike\p{^Lu}or\p{^Letter}.
Thesearenegatedpropertiesidenticalto\P{Lu}or\P{Letter}.
Sinceveryfewregexflavorssupportthe\p{^L}notation,andallUnicode-compatibleregexflavors(includingPerlandPCRE)support\P{L},Istronglyrecommendyouusethelattersyntax.
41Perl(butnotPCRE)andJavasupportthe\p{IsL}notation,prefixingone-letterandtwo-letterUnicodepropertynotationswith"Is".
Sinceveryfewregexflavorssupportthe\p{IsL}notation,andallUnicode-compatibleregexflavors(includingPerlandJava)support\p{L},Istronglyrecommendyouusethelattersyntax.
PerlandJavaallowyoutoomitthe"In"whenmatchingUnicodeblocks,soyoucanwrite\p{Arrows}insteadof\p{InArrows}.
PerlcanalsomatchUnicodescripts,andsomescriptslike"Hebrew"havethesamenameasaUnicodeblock.
Inthatsituation,PerlwillmatchtheHebrewscriptinsteadoftheHebrewblockwhenyouwrite\p{Hebrew}.
WhiletherearenoUnicodepropertieswiththesamenamesasblocks,theproperty\p{Currency_Symbol}isconfusinglysimilartotheblock\p{Currency}.
AsIexplainedinthesectiononUnicodeblocks,thecharacterstheymatcharequitedifferent.
Toavoidallsuchconfusion,Istronglyrecommendyouusethe"In"syntaxforblocks,the"Is"syntaxforscripts(ifsupported),andtheshorthandsyntax\p{Lu}forproperties.
Again,theJGsoftenginesupportsalloftheaboveoddballnotations.
ThisisonlydonetoallowyoutocopyandpasteregularexpressionsandhavethemworkastheydoinPerlorJava.
Youshouldconsiderthesenotationsdeprecated.
DoYouNeedToWorryAboutDifferentEncodingsWhileyoushouldalwayskeepinmindthepitfallscreatedbythedifferentwaysinwhichaccentedcharacterscanbeencoded,youdon'talwayshavetoworryaboutthem.
Ifyouknowthatyourinputstringandyourregexusethesamestyle,thenyoudon'thavetoworryaboutitatall.
ThisprocessiscalledUnicodenormalization.
AllprogramminglanguageswithnativeUnicodesupport,suchasJava,C#andVB.
NET,havelibraryroutinesfornormalizingstrings.
Ifyounormalizeboththesubjectandregexbeforeattemptingthematch,therewon'tbeanyinconsistencies.
IfyouareusingJava,youcanpasstheCANON_EQflagasthesecondparametertoPattern.
compile().
ThistellstheJavaregexenginetoconsidercanonicallyequivalentcharactersasidentical.
E.
g.
theregexàencodedasU+00E0willmatchà"encodedasU+0061U+0300,andviceversa.
Noneoftheotherregexenginescurrentlysupportcanonicalequivalencewhilematching.
Ifyoutypetheàkeyonthekeyboard,allwordprocessorsthatIknowofwillinsertthecodepointU+00E0intothefile.
Soifyou'reworkingwithtextthatyoutypedinyourself,anyregexthatyoutypeinyourselfwillmatchinthesameway.
Finally,ifyou'reusingPowerGREPtosearchthroughtextfilesencodedusingatraditionalWindows(oftencalled"ANSI")orISO-8859codepage,PowerGREPwillalwaysusetheone-on-onesubstitution.
SincealltheWindowsorISO-8859codepagesencodeaccentedcharactersasasinglecodepoint,allsoftwarethatIknowofwilluseasingleUnicodecodepointforeachcharacterwhenconvertingthefiletoUnicode.
4214.
RegexMatchingModesMostregularexpressionenginesdiscussedinthistutorialsupportthefollowingfourmatchingmodes:/imakestheregexmatchcaseinsensitive.
/senables"single-linemode".
Inthismode,thedotmatchesnewlines.
/menables"multi-linemode".
Inthismode,thecaretanddollarmatchbeforeandafternewlinesinthesubjectstring.
/xenables"free-spacingmode".
Inthismode,whitespacebetweenregextokensisignored,andanunescaped#startsacomment.
Twolanguagesthatdon'tsupportalloftheabovethreeareJavaScriptandRuby.
Someregexflavorsalsohaveadditionalmodesoroptionsthathavesingleletterequivalents.
Theseareveryimplementation-dependent.
Mosttoolsthatsupportregularexpressionshavecheckboxesorsimilarcontrolsthatyoucanusetoturnthesemodesonoroff.
Mostprogramminglanguagesallowyoutopassoptionflagswhenconstructingtheregexobject.
E.
g.
inPerl,m/regex/iturnsoncaseinsensitivity,whilePattern.
compile("regex",Pattern.
CASE_INSENSITIVE)doesthesameinJava.
SpecifyingModesInsideTheRegularExpressionSometimes,thetoolorlanguagedoesnotprovidetheabilitytospecifymatchingoptions.
E.
g.
thehandyString.
matches()methodinJavadoesnottakeaparameterformatchingoptionslikePattern.
compile()does.
Inthatsituation,youcanaddamodemodifiertothestartoftheregex.
E.
g.
(i)turnsoncaseinsensitivity,while(ism)turnsonallthreeoptions.
TurningModesOnandOffforOnlyPartofTheRegularExpressionModernregexflavorsallowyoutoapplymodifierstoonlypartoftheregularexpression.
Ifyouinsertthemodifier(ism)inthemiddleoftheregex,themodifieronlyappliestothepartoftheregextotherightofthemodifier.
Youcanturnoffmodesbyprecedingthemwithaminussign.
Allmodesaftertheminussignwillbeturnedoff.
E.
g.
(i-sm)turnsoncaseinsensitivity,andturnsoffbothsingle-linemodeandmulti-linemode.
Notallregexflavorssupportthis.
JavaScriptandPythonapplyallmodemodifierstotheentireregularexpression.
Theydon'tsupportthe(-ismx)syntax,sinceturningoffanoptionispointlesswhenmodemodifiersapplytothewholeregularexpressions.
Alloptionsareoffbydefault.
Youcanquicklytesthowtheregexflavoryou'reusinghandlesmodemodifiers.
Theregex(i)te(-i)stshouldmatchtest"andTEst",butnot"teST"or"TEST".
43ModifierSpansInsteadofusingtwomodifiers,onetoturnanoptionon,andonetoturnitoff,youuseamodifierspan.
(i)ignorecase(-i)casesensitive(i)ignorecaseisequivalentto(i)ignorecase(-i:casesensitive)ignorecase.
Youhaveprobablynoticedtheresemblancebetweenthemodifierspanandthenon-capturinggroup(:group).
Technically,thenon-capturinggroupisamodifierspanthatdoesnotchangeanymodifiers.
Itisobviousthatthemodifierspandoesnotcreateabackreference.
Modifierspansaresupportedbyallregexflavorsthatallowyoutousemodemodifiersinthemiddleoftheregularexpression,andbythoseflavorsonly.
TheseincludetheJGsoftengine,.
NET,Java,PerlandPCRE.
4415.
PossessiveQuantifiersWhendiscussingtherepetitionoperatorsorquantifiers,Iexplainedthedifferencebetweengreedyandlazyrepetition.
Greedinessandlazinessdeterminetheorderinwhichtheregexenginetriesthepossiblepermutationsoftheregexpattern.
Agreedyquantifierwillfirsttrytorepeatthetokenasmanytimesaspossible,andgraduallygiveupmatchesastheenginebacktrackstofindanoverallmatch.
Alazyquantifierwillfirstrepeatthetokenasfewtimesasrequired,andgraduallyexpandthematchastheenginebacktracksthroughtheregextofindanoverallmatch.
Becausegreedinessandlazinesschangetheorderinwhichpermutationsaretried,theycanchangetheoverallregexmatch.
However,theydonotchangethefactthattheregexenginewillbacktracktotryallpossiblepermutationsoftheregularexpressionincasenomatchcanbefound.
Possessivequantifiersareawaytopreventtheregexenginefromtryingallpermutations.
Thisisprimarilyusefulforperformancereasons.
Youcanalsousepossessivequantifierstoeliminatecertainmatches.
HowPossessiveQuantifiersWorkSeveralmodernregularexpressionflavors,includingtheJGsoft,JavaandPCREhaveathirdkindofquantifier:thepossessivequantifier.
Likeagreedyquantifier,apossessivequantifierwillrepeatthetokenasmanytimesaspossible.
Unlikeagreedyquantifier,itwillnotgiveupmatchesastheenginebacktracks.
Withapossessivequantifier,thedealisallornothing.
Youcanmakeaquantifierpossessivebyplacinganextra+afterit.
E.
g.
*isgreedy,*islazy,and*+ispossessive.
and{n,m}+areallpossessiveaswell.
Let'sseewhathappensifwetrytomatchagainst""abc"".
The"matchesthematchesa",b"andc"asitisrepeatedbythestar.
Thefinal"thenmatchesthefinal""andwefoundanoverallmatch.
Inthiscase,theendresultisthesame,whetherweuseagreedyorpossessivequantifier.
Thereisaslightperformanceincreasethough,becausethepossessivequantifierdoesn'thavetorememberanybacktrackingpositions.
Theperformanceincreasecanbesignificantinsituationswheretheregexfails.
Ifthesubjectis""abc"(noclosingquote),theabovematchingprocesswillhappeninthesameway,exceptthatthesecond"fails.
Whenusingapossessivequantifier,therearenostepstobacktrackto.
Theregularexpressiondoesnothaveanyalternationornon-possessivequantifiersthatcangiveuppartoftheirmatchtotryadifferentpermutationoftheregularexpression.
Sothematchattemptfailsimmediatelywhenthesecond"fails.
Hadweusedagreedyquantifierinstead,theenginewouldhavebacktracked.
Afterthe"failedattheendofthestring,the[^"]*wouldgiveuponematch,leavingitwithab".
The"wouldthenfailtomatch"c".
[^"]*backtrackstojusta",and"failstomatch"b".
Finally,backtrackstomatchzerocharacters,and"fails"a".
Onlyatthispointhaveallbacktrackingpositionsbeenexhausted,anddoestheenginegiveupthematchattempt.
Essentially,thisregexperformsasmanyneedlessstepsastherearecharactersfollowingtheunmatchedopeningquote.
45WhenPossessiveQuantifiersMatterThemainpracticalbenefitofpossessivequantifiersistospeedupyourregularexpression.
Inparticular,possessivequantifiersallowyourregextofailfaster.
Intheaboveexample,whentheclosingquotefailstomatch,weknowtheregularexpressioncouldn'thavepossiblyskippedoveraquote.
Sothere'snoneedtobacktrackandcheckforthequote.
Wemaketheregexengineawareofthisbymakingthequantifierpossessive.
Infact,someengines,includingtheJGsoftenginedetectthat[^"]*and"aremutuallyexclusivewhencompilingyourregularexpression,andautomaticallymakethestarpossessive.
Now,linearbacktrackinglikearegexwithasinglequantifierdoesisprettyfast.
It'sunlikelyyou'llnoticethespeeddifference.
However,whenyou'renestingquantifiers,apossessivequantifiermaysaveyourday.
Nestingquantifiersmeansthatyouhaveoneormorerepeatedtokensinsideagroup,andthegroupisalsorepeated.
That'swhencatastrophicbacktrackingoftenrearsitsuglyhead.
Insuchcases,you'lldependonpossessivequantifiersand/oratomicgroupingtosavetheday.
PossessiveQuantifiersCanChangeTheMatchResultUsingpossessivequantifierscanchangetheresultofamatchattempt.
Sincenobacktrackingisdone,andmatchesthatwouldrequireagreedyquantifiertobacktrackwillnotbefoundwithapossessivequantifier.
E.
g.
willmatch"abc""in""abc"x",but".
*+"willnotmatchthisstringatall.
Inbothregularexpressions,thefirst"willmatchthefirst""inthestring.
Therepeateddotthenmatchestheremainderofthestringabc"x".
Thesecond"thenfailstomatchattheendofthestring.
Now,thepathsofthetworegularexpressionsdiverge.
Thepossessivedot-starwantsitall.
Nobacktrackingisdone.
Sincethe"failed,therearenopermutationslefttotry,andtheoverallmatchattemptfails.
Thegreedydot-star,whileinitiallygrabbingeverything,iswillingtogiveback.
Itwillbacktrackonecharacteratatime.
Backtrackingtoabc"","failstomatch"x".
Backtrackingtoabc","matches"".
Anoverallmatch"abc""wasfound.
Essentially,thelessonhereisthatwhenusingpossessivequantifiers,youneedtomakesurethatwhateveryou'reapplyingthepossessivequantifiertoshouldnotbeabletomatchwhatshouldfollowit.
Theproblemintheaboveexampleisthatthedotalsomatchestheclosingquote.
Thispreventsusfromusingapossessivequantifier.
Thenegatedcharacterclassintheprevioussectioncannotmatchtheclosingquote,sowecanmakeitpossessive.
UsingAtomicGroupingInsteadofPossessiveQuantifiersTechnically,possessivequantifiersareanotationalconveniencetoplaceanatomicgrouparoundasinglequantifier.
Allregexflavorsthatsupportpossessivequantifiersalsosupportatomicgrouping.
Butnotallregexflavorsthatsupportatomicgroupingsupportpossessivequantifiers.
Withthoseflavors,youcanachievetheexactsameresultsusinganatomicgroup.
Basically,insteadofX*+,write(>X*).
ItisimportanttonoticethatboththequantifiedtokenXandthequantifierareinsidetheatomicgroup.
EvenifXisagroup,youstillneedtoputanextraatomicgrouparoundittoachievethesameeffect.
(:a|b)*+isequivalentto(>(:a|b)*)butnotto(>a|b)*.
46Thelatterisavalidregularexpression,butitwon'thavethesameeffectwhenusedaspartofalargerregularexpression.
E.
g.
(:a|b)*+band(>(:a|b)*)bbothfailtomatch"b".
a|bwillmatchtheb".
Thestarissatisfied,andthefactthatit'spossessiveortheatomicgroupwillcausethestartoforgetallitsbacktrackingpositions.
Thesecondbintheregexhasnothinglefttomatch,andtheoverallmatchattemptfails.
Intheregex(>a|b)*b,theatomicgroupforcesthealternationtogiveupitsbacktrackingpositions.
I.
e.
ifana"ismatched,itwon'tcomebacktotrybiftherestoftheregexfails.
Sincethestarisoutsideofthegroup,itisanormal,greedystar.
Whenthesecondbfails,thegreedystarwillbacktracktozeroiterations.
Then,thesecondbmatchestheb"inthesubjectstring.
Thisdistinctionisparticularlyimportantwhenconvertingaregularexpressionwrittenbysomebodyelseusingpossessivequantifierstoaregexflavorthatdoesn'thavepossessivequantifiers.
Youcould,ofcourse,letatoollikeRegexBuddydothejobforyou.
4716.
AtomicGroupingAnatomicgroupisagroupthat,whentheregexengineexitsfromit,automaticallythrowsawayallbacktrackingpositionsrememberedbyanytokensinsidethegroup.
Atomicgroupsarenon-capturing.
Thesyntaxis(>group).
Lookaroundgroupsarealsoatomic.
Atomicgroupingissupportedbymostmodernregularexpressionflavors,includingtheJGsoftflavor,Java,PCRE,.
NET,PerlandRuby.
Thefirstthreeofthesealsosupportpossessivequantifiers,whichareessentiallyanotationalconvenienceforatomicgrouping.
Anexamplewillmakethebehaviorofatomicgroups.
Theregularexpressiona(bc|b)c(capturinggroup)matchesabcc"andabc".
Theregexa(>bc|b)c(atomicgroup)matchesabcc"butnot"abc".
Whenappliedto"abc",bothregexeswillmatchatoa",bctobc",andthencwillfailtomatchattheendofthestring.
Heretherepathsdiverge.
Theregexwiththecapturinggrouphasrememberedabacktrackingpositionforthealternation.
Thegroupwillgiveupitsmatch,bthenmatchesb"andcmatchesc".
Matchfound!
Theregexwiththeatomicgroup,however,exitedfromanatomicgroupafterbcwasmatched.
Atthatpoint,allbacktrackingpositionsfortokensinsidethegrouparediscarded.
Inthisexample,thealternation'soptiontotrybatthesecondpositioninthestringisdiscarded.
Asaresult,whencfails,theregexenginehasnoalternativeslefttotry.
Ofcourse,theaboveexampleisn'tveryuseful.
Butitdoesillustrateveryclearlyhowatomicgroupingeliminatescertainmatches.
Ormoreimportantly,iteliminatescertainmatchattempts.
RegexOptimizationUsingAtomicGroupingConsidertheregex\b(integer|insert|in)\bandthesubject"integers".
Obviously,becauseofthewordboundaries,thesedon'tmatch.
What'snotsoobviousisthattheregexenginewillspendquitesomeeffortfiguringthisout.
\bmatchesatthestartofthestring,andintegermatchesinteger".
Theregexenginemakesnotethattherearetomorealternativesinthegroup,andcontinueswith\b.
Thisfailstomatchbetweenthe"r"and"s".
Sotheenginebacktrackstotrythesecondalternativeinsidethegroup.
Thesecondalternativematchesin",butthenfailstomatchs.
Sotheenginebacktracksoncemoretothethirdalternative.
inmatchesin".
\bfailsbetweenthe"n"and"t"thistime.
Theregexenginehasnomorerememberedbacktrackingpositions,soitdeclaresfailure.
Thisisquitealotofworktofigureout"integers"isn'tinourlistofwords.
Wecanoptimizethisbytellingtheregularexpressionenginethatifitcan'tmatch\bafteritmatchedinteger",thenitshouldn'tbothertryinganyoftheotherwords.
Thewordwe'veencounteredinthesubjectstringisalongerword,anditisn'tinourlist.
Wecandothismyturningthecapturinggroupintoanatomicgroup:\b(>integer|insert|in)\b.
Now,whenintegermatches,theengineexitsfromanatomicgroup,andthrowsawaythebacktrackingpositionsitstoredforthealternation.
When\bfails,theenginegivesupimmediately.
Thissavingscanbesignificantwhenscanningalargefileforalonglistofkeywords.
Thissavingswillbevitalwhenyouralternativescontainrepeatedtokens(nottomentionrepeatedgroups)thatleadtocatastrophicbacktracking.
48Don'tbetooquicktomakeallyourgroupsatomic.
Aswesawinthefirstexampleabove,atomicgroupingcanexcludevalidmatchestoo.
Comparehow\b(>integer|insert|in)\band\b(>in|integer|insert)\bbehavewhenappliedto"insert".
Theformerregexmatches,whilethelatterfails.
Ifthegroupsweren'tatomic,bothregexeswouldmatch.
Rememberthatalternationtriesitsalternativesfromlefttoright.
Ifthesecondregexmatchesin",itwon'ttrythetwootheralternativesduetotheatomicgroup.
4917.
LookaheadandLookbehindZero-WidthAssertionsPerl5introducedtwoverypowerfulconstructs:"lookahead"and"lookbehind".
Collectively,thesearecalled"lookaround".
Theyarealsocalled"zero-widthassertions".
Theyarezero-widthjustlikethestartandendofline,andstartandendofwordanchorsthatIalreadyexplained.
Thedifferenceisthatlookaroundswillactuallymatchcharacters,butthengiveupthematchandonlyreturntheresult:matchornomatch.
Thatiswhytheyarecalled"assertions".
Theydonotconsumecharactersinthestring,butonlyassertwhetheramatchispossibleornot.
Lookaroundsallowyoutocreateregularexpressionsthatareimpossibletocreatewithoutthem,orthatwouldgetverylongwindedwithoutthem.
PositiveandNegativeLookaheadNegativelookaheadisindispensableifyouwanttomatchsomethingnotfollowedbysomethingelse.
Whenexplainingcharacterclasses,Ialreadyexplainedwhyyoucannotuseanegatedcharacterclasstomatcha"q"notfollowedbya"u".
Negativelookaheadprovidesthesolution:q(!
u).
Thenegativelookaheadconstructisthepairofroundbrackets,withtheopeningbracketfollowedbyaquestionmarkandanexclamationpoint.
Insidethelookahead,wehavethetrivialregexu.
Positivelookaheadworksjustthesame.
q(=u)matchesaqthatisfollowedbyau,withoutmakingtheupartofthematch.
Thepositivelookaheadconstructisapairofroundbrackets,withtheopeningbracketfollowedbyaquestionmarkandanequalssign.
Youcanuseanyregularexpressioninsidethelookahead.
(Notethatthisisnotthecasewithlookbehind.
Iwillexplainwhybelow.
)Anyvalidregularexpressioncanbeusedinsidethelookahead.
Ifitcontainscapturingparentheses,thebackreferenceswillbesaved.
Notethatthelookaheaditselfdoesnotcreateabackreference.
Soitisnotincludedinthecounttowardsnumberingthebackreferences.
Ifyouwanttostorethematchoftheregexinsideabackreference,youhavetoputcapturingparenthesesaroundtheregexinsidethelookahead,likethis:(=(regex)).
Theotherwayaroundwillnotwork,becausethelookaheadwillalreadyhavediscardedtheregexmatchbythetimethebackreferenceistobesaved.
RegexEngineInternalsFirst,let'sseehowtheengineappliesq(!
u)tothestring"Iraq".
Thefirsttokenintheregexistheliteralq.
Aswealreadyknow,thiswillcausetheenginetotraversethestringuntiltheq"inthestringismatched.
Thepositioninthestringisnowthevoidbehindthestring.
Thenexttokenisthelookahead.
Theenginetakesnotethatitisinsidealookaheadconstructnow,andbeginsmatchingtheregexinsidethelookahead.
Sothenexttokenisu.
Thisdoesnotmatchthevoidbehindthestring.
Theenginenotesthattheregexinsidethelookaheadfailed.
Becausethelookaheadisnegative,thismeansthatthelookaheadhassuccessfullymatchedatthecurrentposition.
Atthispoint,theentireregexhasmatched,andq"isreturnedasthematch.
Let'stryapplyingthesameregexto"quit".
qmatchesq".
Thenexttokenistheuinsidethelookahead.
Thenextcharacteristhe"u".
Thesematch.
Theengineadvancestothenextcharacter:"i".
However,itisdonewiththeregexinsidethelookahead.
Theenginenotessuccess,anddiscardstheregexmatch.
Thiscausestheenginetostepbackinthestringto"u".
50Becausethelookaheadisnegative,thesuccessfulmatchinsideitcausesthelookaheadtofail.
Sincetherearenootherpermutationsofthisregex,theenginehastostartagainatthebeginning.
Sinceqcannotmatchanywhereelse,theenginereportsfailure.
Let'stakeonemorelookinside,tomakesureyouunderstandtheimplicationsofthelookahead.
Let'sapplyq(=u)ito"quit".
Ihavemadethelookaheadpositive,andputatokenafterit.
Again,qmatchesq"andumatchesu".
Again,thematchfromthelookaheadmustbediscarded,sotheenginestepsbackfrom"i"inthestringto"u".
Thelookaheadwassuccessful,sotheenginecontinueswithi.
Buticannotmatch"u".
Sothismatchattemptfails.
Allremainingattemptswillfailaswell,becausetherearenomoreq'sinthestring.
PositiveandNegativeLookbehindLookbehindhasthesameeffect,butworksbackwards.
Ittellstheregexenginetotemporarilystepbackwardsinthestring,tocheckifthetextinsidethelookbehindcanbematchedthere.
(/c){#Bold}elsif($string=~m/\GI>/c){#Italics}else{#.
.
.
etc.
.
.
}}Theregexinthewhileloopsearchesforthetag'sopeningbracket,andtheregexesinsidetheloopcheckwhichtagwefound.
Thiswayyoucanparsethetagsinthefileintheordertheyappearinthefile,withouthavingtowriteasinglebigregexthatmatchesalltagsyouareinterestedin.
55\GinOtherProgrammingLanguagesThisflexibilityisnotavailablewithmostotherprogramminglanguages.
E.
g.
inJava,thepositionfor\GisrememberedbytheMatcherobject.
TheMatcherisstrictlyassociatedwithasingleregularexpressionandasinglesubjectstring.
WhatyoucandothoughistoaddalineofcodetomakethematchattemptofthesecondMatcherstartwherethematchofthefirstMatcherended.
\Gwillthenmatchatthisposition.
The\GtokenissupportedbytheJGsoftengine,.
NET,Java,PerlandPCRE.
5620.
If-Then-ElseConditionalsinRegularExpressionsAspecialconstruct(ifthen|else)allowsyoutocreateconditionalregularexpressions.
Iftheifpartevaluatestotrue,thentheregexenginewillattempttomatchthethenpart.
Otherwise,theelsepartisattemptedinstead.
Thesyntaxconsistsofapairofroundbrackets.
Theopeningbracketmustbefollowedbyaquestionmark,immediatelyfollowedbytheifpart,immediatelyfollowedbythethenpart.
Thispartcanbefollowedbyaverticalbarandtheelsepart.
Youmayomittheelsepart,andtheverticalbarwithit.
Fortheifpart,youcanusethelookaheadandlookbehindconstructs.
Usingpositivelookahead,thesyntaxbecomes((=regex)then|else).
Becausethelookaheadhasitsownparentheses,theifandthenpartsareclearlyseparated.
Rememberthatthelookaroundconstructsdonotconsumeanycharacters.
Ifyouusealookaheadastheifpart,thentheregexenginewillattempttomatchthethenorelsepart(dependingontheoutcomeofthelookahead)atthesamepositionwheretheifwasattempted.
Alternatively,youcancheckintheifpartwhetheracapturinggrouphastakenpartinthematchthusfar.
Placethenumberofthecapturinggroupinsideroundbrackets,andusethatastheifpart.
Notethatalthoughthesyntaxforaconditionalcheckonabackreferenceisthesameasanumberinsideacapturinggroups,nocapturinggroupsiscreated.
Thenumberandthebracketsarepartoftheif-then-elsesyntaxstartedwith(.
Forthethenandelse,youcanuseanyregularexpression.
Ifyouwanttousealternation,youwillhavetogroupthethenorelsetogetherusingparentheses,likein((=condition)(then1|then2|then3)|(else1|else2|else3)).
Otherwise,thereisnoneedtouseparenthesesaroundthethenandelseparts.
LookingInsidetheRegexEngineTheregex(a)b((1)c|d)matchesbd"andabc".
Itdoesnotmatch"bc",butdoesmatchbd"in"abd".
Let'sseehowthisregularexpressionworksoneachofthesefoursubjectstrings.
Whenappliedto"bd",afailstomatch.
Sincethecapturinggroupcontainingaisoptional,theenginecontinueswithbatthestartofthesubjectstring.
Sincethewholegroupwasoptional,thegroupdidnottakepartinthematch.
Anysubsequentbackreferencetoitlike\1willfail.
Notethat(a)isverydifferentfrom(a).
Intheformerregex,thecapturinggroupdoesnottakepartinthematchifafails,andbackreferencestothegroupwillfail.
Inthelattergroup,thecapturinggroupalwaystakespartinthematch,capturingeithera"ornothing.
Backreferencestoacapturinggroupthattookpartinthematchandcapturednothingalwayssucceed.
Conditionalsevaluatingsuchgroupsexecutethe"then"part.
Inshort:ifyouwanttouseareferencetoagroupinaconditional,use(a)insteadof(a).
Continuingwithourregex,bmatchesb".
Theregexenginenowevaluatestheconditional.
Thefirstcapturinggroupdidnottakepartinthematchatall,sothe"else"partordisattempted.
dmatchesd"andanoverallmatchisfound.
Movingontooursecondsubjectstring"abc",amatchesa",whichiscapturedbythecapturinggroup.
Subsequently,bmatchesb".
Theregexengineagainevaluatestheconditional.
Thecapturinggrouptookpartinthematch,sothe"then"partorcisattempted.
cmatchesc"andanoverallmatchisfound.
57Ourthirdsubject"bc"doesnotstartwith"a",sothecapturinggroupdoesnottakepartinthematchattempt,likewesawwiththefirstsubjectstring.
bstillmatchesb",andtheenginemovesontotheconditional.
Thefirstcapturinggroupdidnottakepartinthematchatall,sothe"else"partordisattempted.
ddoesnotmatch"c"andthematchattemptatthestartofthestringfails.
Theenginedoestryagainstartingatthesecondcharacterinthestring,butfailssincebdoesnotmatch"c".
Thefourthsubject"abd"isthemostinterestingone.
Likeinthesecondstring,thecapturinggroupgrabsthea"andthebmatches.
Thecapturinggrouptookpartinthematch,sothe"then"partorcisattempted.
cfailstomatch"d",andthematchattemptfails.
Notethatthe"else"partisnotattemptedatthispoint.
Thecapturinggrouptookpartinthematch,soonlythe"then"partisused.
However,theregexengineisn'tdoneyet.
Itwillrestarttheregularexpressionfromthebeginning,movingaheadonecharacterinthesubjectstring.
Startingatthesecondcharacterinthestring,afailstomatch"b".
Thecapturinggroupdoesnottakepartinthesecondmatchattemptwhichstartedatthesecondcharacterinthestring.
Theregexenginemovesbeyondtheoptionalgroup,andattemptsb,whichmatches.
Theregexenginenowarrivesattheconditionalintheregex,andatthethirdcharacterinthesubjectstring.
Thefirstcapturinggroupdidnottakepartinthecurrentmatchattempt,sothe"else"partordisattempted.
dmatchesd"andanoverallmatchbd"isfound.
Ifyouwanttoavoidthislastmatchresult,youneedtouseanchors.
^(a)b((1)c|d)$doesnotfindanymatchesinthelastsubjectstring.
Thecaretwillfailtomatchatthesecondandthirdcharactersinthestring.
RegexFlavorsConditionalsaresupportedbytheJGsoftengine,Perl,PCREandthe.
NETframework.
Alltheseflavors,exceptPerl,alsosupportnamedcapturinggroups.
Theyallowyoutousethenameofacapturinggroupinsteadofitsnumberastheiftest,e.
g.
:(a)b((test)c|d).
Pythonsupportsconditionalsusinganumberedornamedcapturinggroup.
Pythondoesnotsupportconditionalsusinglookaround,eventhoughPythondoessupportlookaroundoutsideconditionals.
Insteadofaconditionallike((=regex)then|else),youcanalternatetwooppositelookarounds:(=regex)then|(!
regex)else).
Example:ExtractEmailHeadersTheregex^((From|To)|Subject):(((2)\w+@\w+\.
[a-z]extractstheFrom,To,andSubjectheadersfromanemailmessage.
Thenameoftheheaderiscapturedintothefirstbackreference.
IftheheaderistheFromorToheader,itiscapturedintothesecondbackreferenceaswell.
Thesecondpartofthepatternistheif-then-elseconditional((2)\w+@\w+\.
[a-z]Theifpartchecksifthesecondcapturinggrouptookpartinthematchthusfar.
ItwillhaveiftheheaderistheFromorToheader.
Inthatcase,wethethenpartoftheconditional\w+@\w+\.
[a-z]+triestomatchanemailaddress.
Tokeeptheexamplesimple,weuseanoverlysimpleregextomatchtheemailaddress,andwedon'ttrytomatchthedisplaynamethatisusuallyalsopartoftheFromorToheader.
58Ifthesecondcapturinggroupdidnotparticipateinthematchthisfar,theelsepart.
+isattemptedinstead.
Thissimplymatchestheremainderoftheline,allowingforanytestsubject.
Finally,weplaceanextrapairofroundbracketsaroundtheconditional.
Thiscapturesthecontentsoftheemailheadermatchedbytheconditionalintothethirdbackreference.
Theconditionalitselfdoesnotcaptureanything.
Whenimplementingthisregularexpression,thefirstcapturinggroupwillstorethenameoftheheader("From","To",or"Subject"),andthethirdcapturinggroupwillstorethevalueoftheheader.
Youcouldtrytomatchevenmoreheadersbyputtinganotherconditionalintothe"else"part.
E.
g.
^((From|To)|(Date)|Subject):(((2)\w+@\w+\.
[a-z]+|((3)mm/dd/yyyy|.
+))wouldmatcha"From","To","Date"or"Subject",andusetheregexmm/dd/yyyytocheckifthedateisvalid.
Obviously,thedatevalidationregexisjustadummytokeeptheexamplesimple.
Theheaderiscapturedinthefirstgroup,anditsvalidatedcontentsinthefourthgroup.
Asyoucansee,regularexpressionsusingconditionalsquicklybecomeunwieldy.
Irecommendthatyouonlyusethemifoneregularexpressionisallyourtoolallowsyoutouse.
Whenprogramming,you'refarbetterofusingtheregex^(From|To|Date|Subject)tocaptureoneheaderwithitsunvalidatedcontents.
Inyoursourcecode,checkthenameoftheheaderreturnedinthefirstcapturinggroup,andthenuseasecondregularexpressiontovalidatethecontentsoftheheaderreturnedinthesecondcapturinggroupofthefirstregex.
Thoughyou'llhavetowriteafewlinesofextracode,thiscodewillbemucheasiertounderstandmaintain.
Ifyouprecompilealltheregularexpressions,usingmultipleregularexpressionswillbejustasfast,ifnotfaster,andtheonebigregexstuffedwithconditionals.
5921.
XMLSchemaCharacterClassesXMLSchemaRegularExpressionssupporttheusualsixshorthandcharacterclasses,plusfourmore.
Thesefouraren'tsupportedbyanyotherregularexpressionflavor.
\imatchesanycharacterthatmaybethefirstcharacterofanXMLname,i.
e.
[_:A-Za-z].
\cmatchesanycharacterthatmayoccurafterthefirstcharacterinanXMLname,i.
e.
[-.
_:A-Za-z0-9].
\Iand\Caretherespectivenegatedshorthands.
Notethatthe\cshorthandsyntaxconflictswiththecontrolcharactersyntaxusedinmanyotherregexflavors.
Youcanusethesefourshorthandsbothinsideandoutsidecharacterclassesusingthebracketnotation.
They'reveryusefulforvalidatingXMLreferencesandvaluesinyourXMLschemas.
Theregularexpression\i\c*matchesanXMLnamelikexml:schema".
Inotherregularexpressionflavors,you'dhavetospellthisoutas[_:A-Za-z][-.
_:A-Za-z0-9]*.
ThelatterregexalsoworkswithXML'sregularexpressionflavor.
Itjusttakesmoretimetotypein.
TheregexmatchesanopeningXMLtagwithoutanyattributes.
matchesanyclosingtag.
matchesanopeningtagwithanynumberofattributes.
Puttingitalltogether,matcheseitheranopeningtagwithattributesoraclosingtag.
CharacterClassSubtractionWhiletheregexflavoritdefinesisquitelimited,theXMLSchemaaddsanewregularexpressionfeaturenotpreviouslyseeninany(popular)regularexpressionflavor:characterclasssubtraction.
Currently,thisfeatureisonlysupportedbytheJGsoftand.
NETregexengines(inadditiontothoseimplementingtheXMLSchemastandard).
Characterclasssubtractionmakesiteasytomatchanysinglecharacterpresentinonelist(thecharacterclass),butnotpresentinanotherlist(thesubtractedclass).
Thesyntaxforthisis[class-[subtract]].
Ifthecharacterafterahyphenisanopeningbracket,XMLregularexpressionsinterpretthehyphenasthesubtractionoperatorratherthantherangeoperator.
E.
g.
[a-z-[aeiuo]]matchesasingleletterthatisnotavowel(i.
e.
asingleconsonant).
Withoutthecharacterclasssubtractionfeature,theonlywaytodothiswouldbetolistallconsonants:[b-df-hj-np-tv-z].
Thisfeatureismorethanjustanotationalconvenience,though.
Youcanusethefullcharacterclasssyntaxwithinthesubtractedcharacterclass.
E.
g.
tomatchallUnicodelettersexceptASCIIletters(i.
e.
allnon-Englishletters),youcouldeasilyuse[\p{L}-[\p{IsBasicLatin}]].
NestedCharacterClassSubtractionSinceyoucanusethefullcharacterclasssyntaxwithinthesubtractedcharacterclass,youcansubtractaclassfromtheclassbeingsubtracted.
E.
g.
[0-9-[0-6-[0-3]]]firstsubtracts0-3from0-6,yielding[0-9-[4-6]],or[0-37-9],whichmatchesanycharacterinthestring"0123789".
Theclasssubtractionmustalwaysbethelastelementinthecharacterclass.
[0-9-[4-6]a-f]isnotavalidregularexpression.
Itshouldberewrittenas[0-9a-f-[4-6]].
Thesubtractionworksonthewholeclass.
60E.
g.
[\p{Ll}\p{Lu}-[\p{IsBasicLatin}]]matchesalluppercaseandlowercaseUnicodeletters,exceptanyASCIIletters.
The\p{IsBasicLatin}issubtractedfromthecombinationof\p{Ll}\p{Lu}ratherthanfrom\p{Lu}alone.
Thisregexwillnotmatch"abc".
Whileyoucanusenestedcharacterclasssubtraction,youcannotsubtracttwoclassessequentially.
TosubtractASCIIlettersandGreeklettersfromaclasswithallUnicodeletters,combinetheASCIIandGreeklettersintooneclass,andsubtractthat,asin[\p{L}-[\p{IsBasicLatin}\p{IsGreek}]].
NotationalCompatibilitywithOtherRegexFlavorsNotethataregexlike[a-z-[aeiuo]]willnotcauseanyerrorsinregexflavorsthatdonotsupportcharacterclasssubtraction.
Butitwon'tmatchwhatyouintendedeither.
E.
g.
inPerl,thisregexconsistsofacharacterclassfollowedbyaliteral].
Thecharacterclassmatchesacharacterthatiseitherintherangea-z,orahyphen,oranopeningbracket,oravowel.
Sincethea-zrangeandthevowelsareredundant,youcouldwritethischaracterclassas[a-z-[]or[-[a-z].
Ahyphenafterarangeistreatedasaliteralcharacter,justlikeahyphenimmediatelyaftertheopeningbracket.
Thisistrueinallregexflavors,includingXML.
E.
g.
[a-z-_]matchesalowercaseletter,ahyphenoranunderscoreinbothPerlandXMLSchema.
WhilethelastparagraphstrictlyspeakingmeansthattheXMLSchemacharacterclasssyntaxisincompatiblewithPerlandthemajorityofotherregexflavors,inpracticethere'snodifference.
Usingnon-alphanumericcharactersincharacterclassrangesisverybadpractice,asitreliesontheorderofcharactersintheASCIIcharactertable,whichmakestheregularexpressionhardtounderstandfortheprogrammerwhoinheritsyourwork.
E.
g.
while[A-[]wouldmatchanyuppercaseletteroranopeningsquarebracketinPerl,thisregexismuchclearerwhenwrittenas[A-Z[].
TheformerregexwouldcauseanerrorinXMLSchema,becauseitinterprets-[]asanemptysubtractedclass,leavinganunbalanced[.
6122.
POSIXBracketExpressionsPOSIXbracketexpressionsareaspecialkindofcharacterclasses.
POSIXbracketexpressionsmatchonecharacteroutofasetofcharacters,justlikeregularcharacterclasses.
Themainpurposeofthebracketexpressionsisthattheyadapttotheuser'sorapplication'slocale.
Alocaleisacollectionofrulesandsettingsthatdescribelanguageandculturalconventions,likesortorder,dateformat,etc.
ThePOSIXstandardalsodefinestheselocales.
Generally,onlyPOSIX-compliantregularexpressionengineshaveproperandfullsupportforPOSIXbracketexpressions.
Somenon-POSIXregexenginessupportPOSIXcharacterclasses,butusuallydon'tsupportcollatingsequencesandcharacterequivalents.
RegularexpressionenginesthatsupportUnicodeuseUnicodepropertiesandscriptstoprovidefunctionalitysimilartoPOSIXbracketexpressions.
InUnicoderegexengines,shorthandcharacterclasseslike\wnormallymatchallrelevantUnicodecharacters,alleviatingtheneedtouselocales.
CharacterClassesDon'tconfusethePOSIXterm"characterclass"withwhatisnormallycalledaregularexpressioncharacterclass.
[x-z0-9]isanexampleofwhatwecalla"characterclass"andPOSIXcallsa"bracketexpression".
[:digit:]isaPOSIXcharacterclass,usedinsideabracketexpressionlike[x-z[:digit:]].
Thesetworegularexpressionsmatchexactlythesame:asinglecharacterthatiseitherx",y",z"oradigit.
Theclassnamesmustbewrittenalllowercase.
POSIXbracketexpressionscanbenegated.
[^x-z[:digit:]]matchesasinglecharacterthatisnotx,y,zoradigit.
AmajordifferencebetweenPOSIXbracketexpressionsandthecharacterclassesinotherregexflavorsisthatPOSIXbracketexpressionstreatthebackslashasaliteralcharacter.
Thismeansyoucan'tusebackslashestoescapetheclosingbracket(]),thecaret(^)andthehyphen(-).
Toincludeacaret,placeitanywhereexceptrightaftertheopeningbracket.
[x^]matchesanxoracaret.
Youcanputtheclosingbracketrightaftertheopeningbracket,orthenegatingcaret.
[]x]matchesaclosingbracketoranx.
[^]x]matchesanycharacterthatisnotaclosingbracketoranx.
Thehyphencanbeincludedrightaftertheopeningbracket,orrightbeforetheclosingbracket,orrightafterthenegatingcaret.
Both[-x]and[x-]matchanxorahyphen.
ExactlywhichPOSIXcharacterclassesareavailabledependsonthePOSIXlocale.
Thefollowingareusuallysupported,oftenalsobyregexenginesthatdon'tsupportPOSIXitself.
I'vealsoindicatedequivalentcharacterclassesthatyoucanuseinASCIIandUnicoderegularexpressionsifthePOSIXclassesareunavailable.
SomeclassesalsohavePerl-styleshorthandequivalents.
JavadoesnotsupportPOSIXbracketexpressions,butdoessupportPOSIXcharacterclassesusingthe\poperator.
Thoughthe\psyntaxisborrowedfromthesyntaxforUnicodeproperties,thePOSIXclassesinJavaonlymatchASCIIcharactersasindicatedbelow.
Theclassnamesarecasesensitive.
UnlikethePOSIXsyntaxwhichcanonlybeusedinsideabracketexpression,Java's\pcanbeusedinsideandoutsidebracketexpressions.
62POSIX:[:alnum:]Description:AlphanumericcharactersASCII:[a-zA-Z0-9]Unicode:[\p{L&}\p{Nd}]Shorthand:Java:\p{Alnum}POSIX:[:alpha:]Description:AlphabeticcharactersASCII:[a-zA-Z]Unicode:\p{L&}Shorthand:Java:\p{Alpha}POSIX:[:ascii:]Description:ASCIIcharactersASCII:[\x00-\x7F]Unicode:\p{InBasicLatin}Shorthand:Java:\p{ASCII}POSIX:[:blank:]Description:SpaceandtabASCII:[\t]Unicode:[\p{Zs}\t]Shorthand:Java:\p{Blank}POSIX:[:cntrl:]Description:ControlcharactersASCII:[\x00-\x1F\x7F]Unicode:\p{Cc}Shorthand:Java:\p{Cntrl}POSIX:[:digit:]Description:DigitsASCII:[0-9]Unicode:\p{Nd}Shorthand:\dJava:\p{Digit}POSIX:[:graph:]Description:Visiblecharacters(i.
e.
anythingexceptspaces,controlcharacters,etc.
)ASCII:[\x21-\x7E]Unicode:[^\p{Z}\p{C}]Shorthand:Java:\p{Graph}63POSIX:[:lower:]Description:LowercaselettersASCII:[a-z]Unicode:\p{Ll}Shorthand:Java:\p{Lower}POSIX:[:print:]Description:Visiblecharactersandspaces(i.
e.
anythingexceptcontrolcharacters,etc.
)ASCII:[\x20-\x7E]Unicode:\P{C}Shorthand:Java:\p{Print}POSIX:[:punct:]Description:Punctuationcharacters.
ASCII:Unicode:\p{P}Shorthand:Java:\p{Punct}POSIX:[:space:]Description:Allwhitespacecharacters,includinglinebreaksASCII:[\t\r\n\v\f]Unicode:[\p{Z}\t\r\n\v\f]Shorthand:\sJava:\p{Space}POSIX:[:upper:]Description:UppercaselettersASCII:[A-Z]Unicode:\p{Lu}Shorthand:Java:\p{Upper}POSIX:[:word:]Description:Wordcharacters(letters,numbersandunderscores)ASCII:[A-Za-z0-9_]Unicode:[\p{L}\p{N}\p{Pc}]Shorthand:\wJava:POSIX:[:xdigit:]Description:HexadecimaldigitsASCII:[A-Fa-f0-9]Unicode:[A-Fa-f0-9]Shorthand:Java:\p{XDigit}64CollatingSequencesAPOSIXlocalecanhavecollatingsequencestodescribehowcertaincharactersorgroupsofcharactersshouldbeordered.
E.
g.
inSpanish,"ll"likein"tortilla"istreatedasonecharacter,andisorderedbetween"l"and"m"inthealphabet.
Youcanusethecollatingsequenceelement[.
span-ll.
]insideabracketexpressiontomatchll".
E.
g.
theregextorti[[.
span-ll.
]]amatchestortilla".
Noticethedoublesquarebrackets.
Onepairforthebracketexpression,andonepairforthecollatingsequence.
Idonotknowofanyregularexpressionenginethatsupportcollatingsequences,otherthanPOSIX-compliantenginespartofaPOSIX-compliantsystem.
NotethatafullyPOSIX-compliantregexenginewilltreat"ll"asasinglecharacterwhenthelocaleissettoSpanish.
Thismeansthattorti[^x]aalsomatchestortilla".
[^x]matchesasinglecharacterthatisnotan"x",whichincludesll"intheSpanishPOSIXlocale.
Inanyotherregularexpressionengine,orinaPOSIXenginenotusingtheSpanishlocale,torti[^x]awillmatchthemisspelledwordtortila"butwillnotmatchtortilla",as[^x]cannotmatchthetwocharacters"ll".
Finally,notethatnotallregexenginesclaimingtoimplementPOSIXregularexpressionsactuallyhavefullsupportforcollatingsequences.
Sometimes,theseenginesusetheregularexpressionsyntaxdefinedbyPOSIX,butdon'thavefulllocalesupport.
Youmaywanttotrytheabovematchestoseeiftheengineyou'reusingdoes.
E.
g.
Tcl'sregexpcommandsupportscollatingsequences,butTclonlysupportstheUnicodelocale,whichdoesnotdefineanycollatingsequences.
TheresultisthatinTcl,acollatingsequencespecifyingasinglecharacterwillmatchjustthatcharacter,andallothercollatingsequenceswillresultinanerror.
CharacterEquivalentsAPOSIXlocalecandefinecharacterequivalentsthatindicatethatcertaincharactersshouldbeconsideredasidenticalforsorting.
E.
g.
inFrench,accentsareignoredwhenorderingwords.
"élève"comesbefore"être"whichcomesbefore"événement".
"é"and"ê"areallthesameas"e",but"l"comesbefore"t"whichcomesbefore"v".
WiththelocalesettoFrench,aPOSIX-compliantregularexpressionenginewillmatche"andê"whenyouusethecollatingsequence[=e=]inthebracketexpression[[=e=]].
Ifacharacterdoesnothaveanyequivalents,thecharacterequivalencetokensimplyrevertstothecharacteritself.
E.
g.
[[=x=][=z=]]isthesameas[xz]intheFrenchlocale.
Likecollatingsequences,POSIXcharacterequivalentsarenotavailableinanyregexenginethatIknowof,otherthanthosefollowingthePOSIXstandard.
AndthosethatdomaynothavethenecessaryPOSIXlocalesupport.
HeretooTcl'sregexpcommandsupportscharacterequivalents,butUnicodelocale,theonlyoneTclsupports,doesnotdefineanycharacterequivalents.
Thiseffectivelymeansthat[[=x=]]and[x]areexactlythesameinTcl,andwillonlymatchx",foranycharacteryoumaytryinsteadof"x".
6523.
AddingCommentstoRegularExpressionsIfyouhaveworkedthroughtheentiretutorial,Iguessyouwillagreethatregularexpressionscanquicklybecomerathercryptic.
Therefore,manymodernregexflavorsallowyoutoinsertcommentsintoregexes.
Thesyntaxis(#comment)where"comment"canbewhateveryouwant,aslongasitdoesnotcontainaclosingroundbracket.
Theregexengineignoreseverythingafterthe(#untilthefirstclosingroundbracket.
E.
g.
Icouldclarifytheregextomatchavaliddatebywritingitas(#year)(19|20)\d\d[-/.
](#month)(0[1-9]|1[012]day)(0[1-9]|[12][0-9]|3[01]).
Nowitisinstantlyobviousthatthisregexmatchesadateinyyyy-mm-ddformat.
Somesoftware,suchasRegexBuddy,EditPadProandPowerGREPcanapplysyntaxcoloringtoregularexpressionswhileyouwritethem.
Thatmakesthecommentsreallystandout,enablingtherightcommentintherightspottomakeacomplexregularexpressionmucheasiertounderstand.
RegexcommentsaresupportedbytheJGsoftengine,.
NET,Perl,PCRE,PythonandRuby.
Tomakeyourregularexpressionevenmorereadable,youcanturnonfree-spacingmode.
Allflavorsthatsupportcommentsalsosupportfree-spacingmode.
Inaddition,Javasupportsfree-spacingmode,eventhoughitdoesn'tsupport(#)-stylecomments.
6624.
Free-SpacingRegularExpressionsTheJGsoftengine,.
NET,Java,Perl,PCRE,PythonandRubysupportavariantoftheregularexpressionsyntaxcalledfree-spacingmode.
Youcanturnonthismodewiththe(x)modemodifier,orbyturningonthecorrespondingoptionintheapplicationorpassingittotheregexconstructorinyourprogramminglanguage.
Infree-spacingmode,whitespacebetweenregularexpressiontokensisignored.
Whitespaceincludesspaces,tabsandlinebreaks.
Notethatonlywhitespacebetweentokensisignored.
E.
g.
abcisthesameasabcinfree-spacingmode,but\dand\darenotthesame.
Theformermatchesd",whilethelattermatchesadigit.
\disasingleregextokencomposedofabackslashanda"d".
Breakingupthetokenwithaspacegivesyouanescapedspace(whichmatchesaspace),andaliteral"d".
Likewise,groupingmodifierscannotbebrokenup.
(>atomic)isthesameas(>atomic)andas(>atomic).
Theyallmatchthesameatomicgroup.
They'renotthesameas(>atomic).
Infact,thelatterwillcauseasyntaxerror.
The>groupingmodifierisasingleelementintheregexsyntax,andmuststaytogether.
Thisistrueforallsuchconstructs,includinglookaround,namedgroups,etc.
Acharacterclassisalsotreatedasasingletoken.
[abc]isnotthesameas[abc].
Theformermatchesoneofthreeletters,whilethelattermatchesthosethreelettersoraspace.
Inotherwords:free-spacingmodehasnoeffectinsidecharacterclasses.
Spacesandlinebreaksinsidecharacterclasseswillbeincludedinthecharacterclass.
Thismeansthatinfree-spacingmode,youcanuse\or[]tomatchasinglespace.
Usewhicheveryoufindmorereadable.
CommentsinFree-SpacingModeAnotherfeatureoffree-spacingmodeisthatthe#characterstartsacomment.
Thecommentrunsuntiltheendoftheline.
Everythingfromthe#untilthenextlinebreakcharacterisignored.
Puttingitalltogether,Icouldclarifytheregextomatchavaliddatebywritingitacrossmultiplelinesas:#Matcha20thor21stcenturydateinyyyy-mm-ddformat(19|20)\d\d#year(group1)separator(0[1-9]|1[012])#month(group2)separator(0[1-9]|[12][0-9]|3[01])#day(group3)Part2Examples691.
SampleRegularExpressionsBelow,youwillfindmanyexamplepatternsthatyoucanuseforandadapttoyourownpurposes.
Keytechniquesusedincraftingeachregexareexplained,withlinkstothecorrespondingpagesinthetutorialwheretheseconceptsandtechniquesareexplainedingreatdetail.
Ifyouarenewtoregularexpressions,youcantakealookattheseexamplestoseewhatispossible.
Regularexpressionsareverypowerful.
Theydotakesometimetolearn.
ButyouwillearnbackthattimequicklywhenusingregularexpressionstoautomatesearchingoreditingtasksinEditPadProorPowerGREP,orwhenwritingscriptsorapplicationsinavarietyoflanguages.
RegexBuddyoffersthefastestwaytogetuptospeedwithregularexpressions.
RegexBuddywillanalyzeanyregularexpressionandpresentittoyouinaclearlytounderstand,detailedoutline.
TheoutlinelinkstoRegexBuddy'sregextutorial(thesameoneyoufindonthiswebsite),whereyoucanalwaysgetin-depthinformationwithasingleclick.
Oh,andyoudefinitelydonotneedtobeaprogrammertotakeadvantageofregularexpressions!
GrabbingHTMLTagsmatchestheopeningandclosingpairofaspecificHTMLtag.
Anythingbetweenthetagsiscapturedintothefirstbackreference.
Thequestionmarkintheregexmakesthestarlazy,tomakesureitstopsbeforethefirstclosingtagratherthanbeforethelast,likeagreedystarwoulddo.
Thisregexwillnotproperlymatchtagsnestedinsidethemselves,likein"onetwoone".
willmatchtheopeningandclosingpairofanyHTMLtag.
Besuretoturnoffcasesensitivity.
Thekeyinthissolutionistheuseofthebackreference\1intheregex.
Anythingbetweenthetagsiscapturedintothesecondbackreference.
Thissolutionwillalsonotmatchtagsnestedinthemselves.
TrimmingWhitespaceYoucaneasilytrimunnecessarywhitespacefromthestartandtheendofastringorthelinesinatextfilebydoingaregexsearch-and-replace.
Searchfor^[\t]+andreplacewithnothingtodeleteleadingwhitespace(spacesandtabs).
Searchfor[\t]+$totrimtrailingwhitespace.
Dobothbycombiningtheregularexpressionsinto^[\t]+|[\t]+$.
Insteadof[\t]whichmatchesaspaceoratab,youcanexpandthecharacterclassinto[\t\r\n]ifyoualsowanttostriplinebreaks.
Oryoucanusetheshorthand\sinstead.
IPAddressesMatchinganIPaddressisanothergoodexampleofatrade-offbetweenregexcomplexityandexactness.
\b\d{1,3}\.
\d{1,3}\.
\d{1,3}\.
\d{1,3}\bwillmatchanyIPaddressjustfine,butwillalsomatch70999.
999.
999.
999"asifitwereavalidIPaddress.
Whetherthisisaproblemdependsonthefilesordatayouintendtoapplytheregexto.
Torestrictall4numbersintheIPaddressto0.
.
255,youcanusethiscomplexbeast:\b(25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\.
(25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\.
(25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\.
(25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\b(everythingonasingleline).
Thelongregexstoreseachofthe4numbersoftheIPaddressintoacapturinggroup.
YoucanusethesegroupstofurtherprocesstheIPnumber.
Ifyoudon'tneedaccesstotheindividualnumbers,youcanshortentheregexwithaquantifierto:\b(:(:25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\.
){3}(:25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\b.
Similarly,youcanshortenthequickregexto\b(:\d{1,3}\.
){3}\d{1,3}\bMoreDetailedExamplesNumericRanges.
Sinceregularexpressionsworkwithtextratherthannumbers,matchingspecificnumericrangesrequiresabitofextracare.
MatchingaFloatingPointNumber.
Alsoillustratesthecommonmistakeofmakingeverythinginaregularexpressionoptional.
MatchinganEmailAddress.
There'salotofcontroversyaboutwhatisaproperregextomatchemailaddresses.
It'saperfectexampleshowingthatyouneedtoknowexactlywhatyou'retryingtomatch(andwhatnot),andthatthere'salwaysatrade-offbetweenregexcomplexityandaccuracy.
MatchingValidDates.
Aregularexpressionthatmatches31-12-1999butnot31-13-1999.
MatchingCompleteLines.
Showshowtomatchcompletelinesinatextfileratherthanjustthepartofthelinethatsatisfiesacertainrequirement.
Alsoshowshowtomatchlinesinwhichaparticularregexdoesnotmatch.
RemovingDuplicateLinesorItems.
Illustratessimpleyetcleveruseofcapturingparenthesesorbackreferences.
RegexExamplesforProcessingSourceCode.
Howtomatchcommonprogramminglanguagesyntaxsuchascomments,strings,numbers,etc.
TwoWordsNearEachOther.
Showshowtousearegularexpressiontoemulatethe"near"operatorthatsometoolshave.
CommonPitfallsCatastrophicBacktracking.
Ifyourregularexpressionseemstotakeforever,orsimplycrashesyourapplication,ithaslikelycontractedacaseofcatastrophicbacktracking.
Thesolutionisusuallytobemorespecificaboutwhatyouwanttomatch,sothenumberofmatchestheenginehastotrydoesn'triseexponentially.
71MakingEverythingOptional.
Ifallthepartsinyourregexareoptional,itwillmatchazero-widthstringanywhere.
Yourregexwillneedtoexpressthefactsthatdifferentpartsareoptionaldependingonwhichpartsarepresent.
RepeatingaCapturingGroupvs.
CapturingaRepeatedGroup.
Repeatingacapturinggroupwillcaptureonlythelastiterationofthegroup.
Capturearepeatedgroupifyouwanttocapturealliterations.
722.
MatchingFloatingPointNumberswithaRegularExpressionInthisexample,Iwillshowyouhowyoucanavoidacommonmistakeoftenmadebypeopleinexperiencedwithregularexpressions.
Asanexample,wewilltrytobuildaregularexpressionthatcanmatchanyfloatingpointnumber.
Ourregexshouldalsomatchintegers,andfloatingpointnumberswheretheintegerpartisnotgiven(i.
e.
zero).
Wewillnottrytomatchnumberswithanexponent,suchas1.
5e8(150millioninscientificnotation).
Atfirstthought,thefollowingregexseemstodothetrick:0-9]*\.
[0-9]*.
Thisdefinesafloatingpointnumberasanoptionalsign,followedbyanoptionalseriesofdigits(integerpart),followedbyanoptionaldot,followedbyanotheroptionalseriesofdigits(fractionpart).
Spellingouttheregexinwordsmakesitobvious:everythinginthisregularexpressionisoptional.
Thisregularexpressionwillconsiderasignbyitselforadotbyitselfasavalidfloatingpointnumber.
Infact,itwillevenconsideranemptystringasavalidfloatingpointnumber.
ThisregularexpressioncancauseserioustroubleifitisusedinascriptinglanguagelikePerlorPHPtoverifyuserinput.
Notescapingthedotisalsoacommonmistake.
Adotthatisnotescapedwillmatchanycharacter,includingadot.
Ifwehadnotescapedthedot,"4.
4"wouldbeconsideredafloatingpointnumber,and"4X4"too.
Whencreatingaregularexpression,itismoreimportanttoconsiderwhatitshouldnotmatch,thanwhatitshould.
Theaboveregexwillindeedmatchaproperfloatingpointnumber,becausetheregexengineisgreedy.
Butitwillalsomatchmanythingswedonotwant,whichwehavetoexclude.
Hereisabetterattempt:0-9]*\.
[0-9]+|[0-9]+).
Thisregularexpressionwillmatchanoptionalsign,thatiseitherfollowedbyzeroormoredigitsfollowedbyadotandoneormoredigits(afloatingpointnumberwithoptionalintegerpart),orfollowedbyoneormoredigits(aninteger).
Thisisafarbetterdefinition.
Anymatchwillincludeatleastonedigit,becausethereisnowayaroundthe[0-9]+part.
Wehavesuccessfullyexcludedthematcheswedonotwant:thosewithoutdigits.
Wecanoptimizethisregularexpressionas:0-9]*\.
[0-9]+.
Ifyoualsowanttomatchnumberswithexponents,youcanuse:0-9]*\.
[0-9]+([eE][-+][0-9]+).
NoticehowImadetheentireexponentpartoptionalbygroupingittogether,ratherthanmakingeachelementintheexponentoptional.
733.
HowtoFindorValidateanEmailAddressTheregularexpressionIreceivethemostfeedback,nottomention"bug"reportson,istheoneyou'llfindrightinthetutorial'sintroduction:\b[A-Z0-9.
A-Z0-9.
-]+\.
[A-Z]{2,4}\b.
Thisregularexpression,Iclaim,matchesanyemailaddress.
MostofthefeedbackIgetrefutesthatclaimbyshowingoneemailaddressthatthisregexdoesn'tmatch.
Usually,the"bug"reportalsoincludesasuggestiontomaketheregex"perfect".
AsIexplainbelow,myclaimonlyholdstruewhenoneacceptsmydefinitionofwhatavalidemailaddressreallyis,andwhatit'snot.
Ifyouwanttouseadifferentdefinition,you'llhavetoadapttheregex.
Matchingavalidemailaddressisaperfectexampleshowingthat(1)beforewritingaregex,youhavetoknowexactlywhatyou'retryingtomatch,andwhatnot;and(2)there'softenatrade-offbetweenwhat'sexact,andwhat'spractical.
Thevirtueofmyregularexpressionaboveisthatitmatches99%oftheemailaddressesinusetoday.
Alltheemailaddressitmatchescanbehandledby99%ofallemailsoftwareoutthere.
Ifyou'relookingforaquicksolution,youonlyneedtoreadthenextparagraph.
Ifyouwanttoknowallthetrade-offsandgetplentyofalternativestochoosefrom,readon.
Ifyouwanttousetheregularexpressionabove,there'stwothingsyouneedtounderstand.
First,longregexesmakeitdifficulttonicelyformatparagraphs.
SoIdidn'tincludea-zinanyofthethreecharacterclasses.
Thisregexisintendedtobeusedwithyourregexengine's"caseinsensitive"optionturnedon.
(You'dbesurprisedhowmany"bug"reportsIgetaboutthat.
)Second,theaboveregexisdelimitedwithwordboundaries,whichmakesitsuitableforextractingemailaddressesfromfilesorlargerblocksoftext.
Ifyouwanttocheckwhethertheusertypedinavalidemailaddress,replacethewordboundarieswithstart-of-stringandend-of-stringanchors,likethis:^[A-Z0-9.
A-Z0-9.
-]+\.
[A-Z]{2,4}$.
Thepreviousparagraphalsoappliestoallfollowingexamples.
Youmayneedtochangewordboundariesintostart/end-of-stringanchors,orviceversa.
Andyouwillneedtoturnonthecaseinsensitivematchingoption.
Trade-OffsinValidatingEmailAddressesYes,thereareawholebunchofemailaddressesthatmypetregexdoesn'tmatch.
Themostfrequentlyquotedexampleareaddressesonthe.
museumtopleveldomain,whichislongerthanthe4lettersmyregexallowsforthetopleveldomain.
Iacceptthistrade-offbecausethenumberofpeopleusing.
museumemailaddressesisextremelylow.
I'veneverhadacomplaintthattheorderformsornewslettersubscriptionformsontheJGsoftwebsitesrefuseda.
museumaddress(whichtheywould,sincetheyusetheaboveregextovalidatetheemailaddress).
Toinclude.
museum,youcoulduse^[A-Z0-9.
A-Z0-9.
-]+\.
[A-Z]{2,6}$.
However,thenthere'sanothertrade-off.
Thisregexwillmatchjohn@mail.
office".
It'sfarmorelikelythatJohnforgottotypeinthe.
comtopleveldomainratherthanhavingjustcreatedanew.
officetopleveldomainwithoutICANN'spermission.
Thisshowsanothertrade-off:doyouwanttheregextocheckifthetopleveldomainexistsMyregexdoesn't.
Anycombinationoftwotofourletterswilldo,whichcoversallexistingandplannedtopleveldomainsexcept.
museum.
Butitwillmatchaddresseswithinvalidtop-leveldomainslike74asdf@asdf.
asdf".
Bynotbeingoverlystrictaboutthetop-leveldomain,Idon'thavetoupdatetheregexeachtimeanewtop-leveldomainiscreated,whetherit'sacountrycodeorgenericdomain.
^[A-Z0-9.
A-Z0-9.
A-Z]{2}|com|org|net|gov|biz|info|name|aero|biz|info|jobs|museum)$couldbeusedtoallowanytwo-lettercountrycodetopleveldomain,andonlyspecificgenerictopleveldomains.
Bythetimeyoureadthis,thelistmightalreadybeoutofdate.
Ifyouusethisregularexpression,Irecommendyoustoreitinaglobalconstantinyourapplication,soyouonlyhavetoupdateitinoneplace.
Youcouldlistallcountrycodesinthesamemanner,eventhoughtherearealmost200ofthem.
Emailaddressescanbeonserversonasubdomain,e.
g.
john@server.
department.
company.
com".
Alloftheaboveregexeswillmatchthisemailaddress,becauseIincludedadotinthecharacterclassafterthe@symbol.
However,theaboveregexeswillalsomatchjohn@aol.
.
.
com"whichisnotvalidduetotheconsecutivedots.
Youcanexcludesuchmatchesbyreplacing[A-Z0-9.
-]+\.
with(:[A-Z0-9-]+\.
)+inanyoftheaboveregexes.
Iremovedthedotfromthecharacterclassandinsteadrepeatedthecharacterclassandthefollowingliteraldot.
E.
g.
\b[A-Z0-9.
A-Z0-9-]+\.
)+[A-Z]{2,4}\bwillmatchjohn@server.
department.
company.
com"butnot"john@aol.
.
.
com".
Anothertrade-offisthatmyregexonlyallowsEnglishletters,digitsandafewspecialsymbols.
ThemainreasonisthatIdon'ttrustallmyemailsoftwaretobeabletohandlemuchelse.
EventhoughJohn.
O'Hara@theharas.
comisasyntacticallyvalidemailaddress,there'sariskthatsomesoftwarewillmisinterprettheapostropheasadelimitingquote.
E.
g.
blindlyinsertingthisemailaddressintoaSQLwillcauseittofailifstringsaredelimitedwithsinglequotes.
Andofcourse,it'sbeenmanyyearsalreadythatdomainnamescanincludenon-Englishcharacters.
Mostsoftwareandevendomainnameregistrars,however,stillsticktothe37charactersthey'reusedto.
Theconclusionisthattodecidewhichregularexpressiontouse,whetheryou'retryingtomatchanemailaddressorsomethingelsethat'svaguelydefined,youneedtostartwithconsideringallthetrade-offs.
Howbadisittomatchsomethingthat'snotvalidHowbadisitnottomatchsomethingthatisvalidHowcomplexcanyourregularexpressionbeHowexpensivewoulditbeifyouhadtochangetheregularexpressionlaterDifferentanswerstothesequestionswillrequireadifferentregularexpressionasthesolution.
MyemailregexdoeswhatIwant,butitmaynotdowhatyouwant.
RegexesDon'tSendEmailDon'tgooverboardintryingtoeliminateinvalidemailaddresseswithyourregularexpression.
Ifyouhavetoaccept.
museumdomains,allowingany6-lettertopleveldomainisoftenbetterthanspellingoutalistofallcurrentdomains.
Thereasonisthatyoudon'treallyknowwhetheranaddressisvaliduntilyoutrytosendanemailtoit.
Andeventhatmightnotbeenough.
Eveniftheemailarrivesinamailbox,thatdoesn'tmeansomebodystillreadsthatmailbox.
Thesameprincipleappliesinmanysituations.
Whentryingtomatchavaliddate,it'softeneasiertouseabitofarithmetictocheckforleapyears,ratherthantryingtodoitinaregex.
Usearegularexpressiontofindpotentialmatchesorcheckiftheinputusesthepropersyntax,anddotheactualvalidationonthepotentialmatchesreturnedbytheregularexpression.
Regularexpressionsareapowerfultool,butthey'refarfromapanacea.
75TheOfficialStandard:RFC2822Maybeyou'rewonderingwhythere'sno"official"fool-proofregextomatchemailaddresses.
Well,thereisanofficialdefinition,butit'shardlyfool-proof.
TheofficialstandardisknownasRFC2822.
Itdescribesthesyntaxthatvalidemailaddressesmustadhereto.
Youcan(butyoushouldn't--readon)implementitwiththisregularexpression:(:[a-z0-9!
a-z0-9!
x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f]a-z0-9](:[a-z0-9-]*[a-z0-9])\.
)+[a-z0-9](:[a-z0-9-]*[a-z0-9]25[0-5]|2[0-4][0-9]|[01][0-9][0-9])\.
){3}(:25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[a-z0-9-]*[a-z0-9]:(:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])Thisregexhastwoparts:thepartbeforethe@,andthepartafterthe@.
Therearetwoalternativesforthepartbeforethe@:itcaneitherconsistofaseriesofletters,digitsandcertainsymbols,includingoneormoredots.
However,dotsmaynotappearconsecutivelyoratthestartorendoftheemailaddress.
Theotheralternativerequiresthepartbeforethe@tobeenclosedindoublequotes,allowinganystringofASCIIcharactersbetweenthequotes.
Whitespacecharacters,doublequotesandbackslashesmustbeescapedwithbackslashes.
Thepartafterthe@alsohastwoalternatives.
Itcaneitherbeafullyqualifieddomainname(e.
g.
regular-expressions.
info),oritcanbealiteralInternetaddressbetweensquarebrackets.
TheliteralInternetaddresscaneitherbeanIPaddress,oradomain-specificroutingaddress.
Thereasonyoushouldn'tusethisregexisthatitonlychecksthebasicsyntaxofemailaddresses.
john@aol.
com.
nospamwouldbeconsideredavalidemailaddressaccordingtoRFC2822.
Obviously,thisemailaddresswon'twork,sincethere'sno"nospam"top-leveldomain.
Italsodoesn'tguaranteeyouremailsoftwarewillbeabletohandleit.
Notallapplicationssupportthesyntaxusingdoublequotesorsquarebrackets.
Infact,RFC2822itselfmarksthenotationusingsquarebracketsasobsolete.
WegetamorepracticalimplementationofRFC2822ifweomitthesyntaxusingdoublequotesandsquarebrackets.
Itwillstillmatch99.
99%ofallemailaddressesinactualusetoday.
[a-z0-9!
a-z0-9!
a-z0-9](:[a-z0-9-]*[a-z0-9])\.
)+[a-z0-9](:[a-z0-9-]*[a-z0-9])Afurtherchangeyoucouldmakeistoallowanytwo-lettercountrycodetopleveldomain,andonlyspecificgenerictopleveldomains.
Thisregexfiltersdummyemailaddresseslikeasdf@adsf.
adsf.
Youwillneedtoupdateitasnewtop-leveldomainsareadded.
[a-z0-9!
a-z0-9!
a-z0-9](:[a-z0-9-]*[a-z0-9]A-Z]{2}|com|org|net|gov|biz|info|name|aero|biz|info|jobs|museum)\bSoevenwhenfollowingofficialstandards,therearestilltrade-offstobemade.
Don'tblindlycopyregularexpressionsfromonlinelibrariesordiscussionforums.
Alwaystestthemonyourowndataandwithyourownapplications.
764.
MatchingaValidDate(19|20)\d\d[-/.
](0[1-9]|1[012]0[1-9]|[12][0-9]|3[01])matchesadateinyyyy-mm-ddformatfrombetween1900-01-01and2099-12-31,withachoiceoffourseparators.
Theyearismatchedby(19|20)\d\d.
Iusedalternationtoallowthefirsttwodigitstobe19or20.
Theroundbracketsaremandatory.
HadIomittedthem,theregexenginewouldgolookingfor19ortheremainderoftheregularexpression,whichmatchesadatebetween2000-01-01and2099-12-31.
Roundbracketsaretheonlywaytostoptheverticalbarfromsplittinguptheentireregularexpressionintotwooptions.
Themonthismatchedby0[1-9]|1[012],againenclosedbyroundbracketstokeepthetwooptionstogether.
Byusingcharacterclasses,thefirstoptionmatchesanumberbetween01and09,andthesecondmatches10,11or12.
Thelastpartoftheregexconsistsofthreeoptions.
Thefirstmatchesthenumbers01through09,thesecond10through29,andthethirdmatches30or31.
Smartuseofalternationallowsustoexcludeinvaliddatessuchas2000-00-00thatcouldnothavebeenexcludedwithoutusingalternation.
Tobereallyperfectionist,youwouldhavetosplitupthemonthintovariousoptionstotakeintoaccountthelengthofthemonth.
Theaboveregexstillmatches2003-02-31,whichisnotavaliddate.
Makingleadingzerosoptionalcouldbeanotherenhancement.
Ifyouwanttorequirethedelimiterstobeconsistent,youcoulduseabackreference.
(19|20)\d\d([-/.
])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])willmatch1999-01-01"butnot"1999/01-01".
Again,howcomplexyouwanttomakeyourregularexpressiondependsonthedatayouareusingiton,andhowbigaproblemitisifanunwantedmatchslipsthrough.
Ifyouarevalidatingtheuser'sinputofadateinascript,itisprobablyeasiertodocertainchecksoutsideoftheregex.
Forexample,excludingFebruary29thwhentheyearisnotaleapyearisfareasiertodoinascriptinglanguage.
Itisfareasiertocheckifayearisdivisibleby4(andnotdivisibleby100unlessdivisibleby400)usingsimplearithmeticthanusingregularexpressions.
HereishowyoucouldcheckavaliddateinPerl.
NotethatIaddedanchorstomakesuretheentirevariableisadate,andnotapieceoftextcontainingadate.
Ialsoaddedroundbracketstocapturetheyearintoabackreference.
subisvaliddate{my$input=shift;if($input=~m!
^((:19|20)\d\d)[-/.
](0[1-9]|1[012]0[1-9]|[12][0-9]|3[01])$!
){#Atthispoint,$1holdstheyear,$2themonthand$3thedayofthedateenteredif($3==31and($2==4or$2==6or$2==9or$2==11)){return0;#31stofamonthwith30days}elsif($3>=30and$2==2){return0;#February30thor31st}elsif($2==2and$3==29andnot($1%4==0and($1%100!
=0or$1%400==0))){return0;#February29thoutsidealeapyear}else{return1;#Validdate}}else{return0;#Notadate}}Tomatchadateinmm/dd/yyyyformat,rearrangetheregularexpressionto(0[1-9]|1[012])[-/.
](0[1-9]|[12][0-9]|3[01]19|20)\d\d.
Fordd-mm-yyyyformat,use(0[1-9]|[12][0-9]|3[01]0[1-9]|1[012]19|20)\d\d.
775.
MatchingWholeLinesofTextOften,youwanttomatchcompletelinesinatextfileratherthanjustthepartofthelinethatsatisfiesacertainrequirement.
Thisisusefulifyouwanttodeleteentirelinesinasearch-and-replaceinatexteditor,orcollectentirelinesinaninformationretrievaltool.
Tokeepthisexamplesimple,let'ssaywewanttomatchlinescontainingtheword"John".
TheregexJohnmakesiteasyenoughtolocatethoselines.
ButthesoftwarewillonlyindicateJohn"asthematch,nottheentirelinecontainingtheword.
Thesolutionisfairlysimple.
Tospecifythatweneedanentireline,wewillusethecaretanddollarsignandturnontheoptiontomakethemmatchatembeddednewlines.
InsoftwareaimedatworkingwithtextfileslikeEditPadProandPowerGREP,theanchorsalwaysmatchatembeddednewlines.
TomatchthepartsofthelinebeforeandafterthematchofouroriginalregularexpressionJohn,wesimplyusethedotandthestar.
Besuretoturnofftheoptionforthedottomatchnewlines.
Theresultingregexis:^.
*John.
*$.
Youcanusethesamemethodtoexpandthematchofanyregularexpressiontoanentireline,orablockofcompletelines.
Insomecases,suchaswhenusingalternation,youwillneedtogrouptheoriginalregextogetherusingroundbrackets.
FindingLinesContainingorNotContainingCertainWordsIfalinecanmeetanyoutofseriesofrequirements,simplyusealternationintheregularexpression.
^.
*\b(one|two|three)\b.
*$matchesacompletelineoftextthatcontainsanyofthewords"one","two"or"three".
Thefirstbackreferencewillcontainthewordthelineactuallycontains.
Ifitcontainsmorethanoneofthewords,thenthelast(rightmost)wordwillbecapturedintothefirstbackreference.
Thisisbecausethestarisgreedy.
Ifwemakethefirststarlazy,likein^.
*\b(one|two|three)\b.
*$,thenthebackreferencewillcontainthefirst(leftmost)word.
Ifalinemustsatisfyallofmultiplerequirements,weneedtouselookahead.
bone\b)(=.
*\btwo\b)(=.
*\bthree\b).
*$matchesacompletelineoftextthatcontainsallofthewords"one","two"and"three".
Again,theanchorsmustmatchatthestartandendofalineandthedotmustnotmatchlinebreaks.
Becauseofthecaret,andthefactthatlookaheadiszero-width,allofthethreelookaheadsareattemptedatthestartoftheeachline.
Eachlookaheadwillmatchanypieceoftextonasingleline(.
*)followedbyoneofthewords.
Allthreemustmatchsuccessfullyfortheentireregextomatch.
Notethatinsteadofwordslike\bword\b,youcanputanyregularexpression,nomatterhowcomplex,insidethelookahead.
Finally,.
*$causestheregextoactuallymatchtheline,afterthelookaheadshavedetermineditmeetstherequirements.
Ifyourconditionisthatalineshouldnotcontainsomething,usenegativelookahead.
^((!
regexp).
)*$matchesacompletelinethatdoesnotmatchregexp.
Noticethatunlikebefore,whenusingpositivelookahead,Irepeatedboththenegativelookaheadandthedottogether.
Forthepositivelookahead,weonlyneedtofindonelocationwhereitcanmatch.
Butthenegativelookaheadmustbetestedateachandeverycharacterpositionintheline.
Wemusttestthatregexpfailseverywhere,notjustsomewhere.
Finally,youcancombinemultiplepositiveandnegativerequirementsasfollows:bmust-have\b)(=.
*\bmandatory\b)((!
avoid|illegal)Whencheckingmultiplepositiverequirements,the.
*attheendoftheregularexpressionfullofzero-widthassertionsmadesurethatweactuallymatchedsomething.
Sincethenegativerequirementmustmatchtheentireline,itiseasytoreplacethe.
*withthenegativetest.
786.
DeletingDuplicateLinesFromaFileIfyouhaveafileinwhichalllinesaresorted(alphabeticallyorotherwise),youcaneasilydelete(subsequent)duplicatelines.
Simplyopenthefileinyourfavoritetexteditor,anddoasearch-and-replacesearchingfor^(.
*)(\r\n\1)+$matchesasingle-linestringthatdoesnotallowthequotecharactertoappearinsidethestring.
Usingthenegatedcharacterclassismoreefficientthanusingalazydot.
allowsthestringtospanacrossmultiplelines.
"[^"\\\r\n]r\n]*)*"matchesasingle-linestringinwhichthequotecharactercanappearifitisescapedbyabackslash.
Thoughthisregularexpressionmayseemmorecomplicatedthanitneedstobe,itismuchfasterthansimplersolutionswhichcancauseawholelotofbacktrackingincaseadoublequoteappearssomewhereallbyitselfratherthanpartofastring.
allowsthestringtospanmultiplelines.
Youcanadapttheaboveregexestomatchanysequencedelimitedbytwo(possiblydifferent)characters.
Ifweuse"b"forthestartingcharacter,"e"andtheend,and"x"astheescapecharacter,theversionwithoutescapebecomesb[^e\r\n]*e,andtheversionwithescapebecomesb[^ex\r\n]*(:x.
[^ex\r\n]*)*e.
Numbers\b\d+\bmatchesapositiveintegernumber.
Donotforgetthewordboundaries!
b\d+\ballowsforasign.
\b0[xX][0-9a-fA-F]+\bmatchesaC-stylehexadecimalnumber.
((\b[0-9]0-9]+\bmatchesanintegernumberaswellasafloatingpointnumberwithoptionalintegerpart.
(\b[0-9]+\.
([0-9]+\b)|\.
[0-9]+\b)matchesafloatingpointnumberwithoptionalintegeraswellasoptionalfractionalpart,butdoesnotmatchanintegernumber.
((\b[0-9]b[0-9]+([eE][-+][0-9]+)\bmatchesanumberinscientificnotation.
Themantissacanbeanintegerorfloatingpointnumberwithoptionalintegerpart.
Theexponentisoptional.
\b[0-9]+(\.
[0-9]+)(e[+-][0-9]+)\balsomatchesanumberinscientificnotation.
Thedifferencewiththepreviousexampleisthatifthemantissaisafloatingpointnumber,theintegerpartismandatory.
Ifyoureadthroughthefloatingpointnumberexample,youwillnoticethattheaboveregexesaredifferentfromwhatisusedthere.
Theaboveregexesaremorestringent.
Theyusewordboundariestoexcludenumbersthatarepartofotherthingslikeidentifiers.
Youcanprepend[-+]toalloftheaboveregexestoincludeanoptionalsignintheregex.
Ididnotdosoabovebecauseinprogramminglanguages,the+and-areusuallyconsideredoperatorsratherthansigns.
ReservedWordsorKeywordsMatchingreservedwordsiseasy.
Simplyusealternationtostringthemtogether:\b(first|second|third|etc)\bAgain,donotforgetthewordboundaries.
798.
FindTwoWordsNearEachOtherSomesearchtoolsthatusebooleanoperatorsalsohaveaspecialoperatorcalled"near".
Searchingfor"term1nearterm2"findsalloccurrencesofterm1andterm2thatoccurwithinacertain"distance"fromeachother.
Thedistanceisanumberofwords.
Theactualnumberdependsonthesearchtool,andisoftenconfigurable.
Youcaneasilyperformthesametaskwiththeproperregularexpression.
Emulating"near"withaRegularExpressionWithregularexpressionsyoucandescribealmostanytextpattern,includingapatternthatmatchestwowordsneareachother.
Thispatternisrelativelysimple,consistingofthreeparts:thefirstword,acertainnumberofunspecifiedwords,andthesecondword.
Anunspecifiedwordcanbematchedwiththeshorthandcharacterclass\w+.
Thespacesandothercharactersbetweenthewordscanbematchedwith\W+(uppercaseWthistime).
Thecompleteregularexpressionbecomes\bword1\W+(:\w+\W+){1,6}word2\b.
Thequantifier{1,6}makestheregexrequireatleastonewordbetween"word1"and"word2",andallowatmostsixwords.
Ifthewordsmayalsooccurinreverseorder,weneedtospecifytheoppositepatternaswell:\b(:word1\W+(:\w+\W+){1,6}word2|word2\W+(:\w+\W+){1,6}word1)\bIfyouwanttofindanypairoftwowordsoutofalistofwords,youcanuse:\b(word1|word2|word3)(:\W+\w+){1,6}\W+(word1|word2|word3)\b.
Thisregexwillalsofindawordnearitself,e.
g.
itwillmatchword2nearword2".
809.
RunawayRegularExpressions:CatastrophicBacktrackingConsidertheregularexpression(x+x+)+y.
Beforeyouscreaminhorrorandsaythiscontrivedexampleshouldbewrittenas(xx)+ytomatchexactlythesamewithoutthoseterriblynestedquantifiers:justassumethateach"x"representssomethingmorecomplex,withcertainstringsbeingmatchedbyboth"x".
SeethesectiononHTMLfilesbelowforarealexample.
Let'sseewhathappenswhenyouapplythisregexto"xxxxxxxxxxy".
Thefirstx+willmatchall10x"characters.
Thesecondx+fails.
Thefirstx+thenbacktracksto9matches,andthesecondonepicksuptheremainingx".
Thegrouphasnowmatchedonce.
Thegrouprepeats,butfailsatthefirstx+.
Sinceonerepetitionwassufficient,thegroupmatches.
ymatchesy"andanoverallmatchisfound.
Theregexisdeclaredfunctional,thecodeisshippedtothecustomer,andhiscomputerexplodes.
Almost.
Theaboveregexturnsuglywhenthe"y"ismissingfromthesubjectstring.
Whenyfails,theregexenginebacktracks.
Thegrouphasoneiterationitcanbacktrackinto.
Thesecondx+matchedonlyonex",soitcan'tbacktrack.
Butthefirstx+cangiveupone"x".
Thesecondx+promptlymatchesxx".
Thegroupagainhasoneiteration,failsthenextone,andtheyfails.
Backtrackingagain,thesecondx+nowhasonebacktrackingposition,reducingitselftomatchx".
Thegrouptriesaseconditeration.
Thefirstx+matchesbutthesecondisstuckattheendofthestring.
Backtrackingagain,thefirstx+inthegroup'sfirstiterationreducesitselfto7characters.
Thesecondx+matchesxxx".
Failingy,thesecondx+isreducedtoxx"andthenx".
Now,thegroupcanmatchaseconditeration,withonex"foreachx+.
Butthis(7,1),(1,1)combinationfailstoo.
Soitgoesto(6,4)andthen(6,2)(1,1)andthen(6,1),(2,1)andthen(6,1),(1,2)andthenIthinkyoustarttogetthedrift.
Ifyoutrythisregexona10xstringinRegexBuddy'sdebugger,it'lltake2559stepstofigureoutthefinalyismissing.
Foran11xstring,itneeds5119steps.
For12,ittakes10239steps.
ClearlywehaveanexponentialcomplexityofO(2^n)here.
At16xthedebuggerbowsoutat100,000steps,diagnosingabadcaseofcatastrophicbacktracking.
RegexBuddyisforgivinginthatitdetectsit'sgoingincircles,andabortsthematchattempt.
Otherregexengines(like.
NET)willkeepgoingforever,whileotherswillcrashwithastackoverflow(likePerl,beforeversion5.
10).
StackoverflowsareparticularlynastyonWindows,sincetheytendtomakeyourapplicationvanishwithoutatraceorexplanation.
Beverycarefulifyourunawebservicethatallowsuserstosupplytheirownregularexpressions.
Peoplewithlittleregexexperiencehavesurprisingskillatcomingupwithexponentiallycomplexregularexpressions.
PossessiveQuantifiersandAtomicGroupingtoTheRescueIntheaboveexample,thesanethingtodoisobviouslytorewriteitas(xx)+ywhicheliminatesthenestedquantifiersentirely.
Nestedquantifiersarerepeatedoralternatedtokensinsideagroupthatisitselfrepeatedoralternated.
Thesealmostalwaysleadtocatastrophicbacktracking.
Abouttheonlysituationwheretheydon'tiswhenthestartofeachalternativeinsidethegroupisnotoptional,andmutuallyexclusivewiththestartofalltheotheralternatives,andmutuallyexclusivewiththetokenthatfollowsit(insideitsalternativeinsidethegroup).
E.
g.
(a+b+|c+d+)+yissafe.
Ifanythingfails,theregexenginewillbacktrackthroughthewholeregex,butitwilldosolinearly.
Thereasonisthatallthetokensaremutuallyexclusive.
Noneofthemcanmatchanycharactersmatchedbyanyoftheothers.
Sothematchattemptateachbacktrackingpositionwillfail,causingtheregexenginetobacktracklinearly.
Ifyoutestthison"aaaabbbbccccdddd",RegexBuddyneedsonly14stepsratherthan100,000+stepstofigureitout.
81However,it'snotalwayspossibleoreasytorewriteyourregextomakeeverythingmutuallyexclusive.
Soweneedawaytotelltheregexenginenottobacktrack.
Whenwe'vegrabbedallthex's,there'snoneedtobacktrack.
Therecouldn'tpossiblybea"y"inanythingmatchedbyeitherx+.
Usingapossessivequantifier,ourregexbecomes(x+x+)++y.
Thisfailsthe16xstringinmerely8steps.
That's7stepstomatchallthex's,and1steptofigureoutthatyfails.
Nobacktrackingisdone.
Usinganatomicgroup,theregexbecomes(>(x+x+)+)ywiththeexactsameresults.
ARealExample:MatchingCSVRecordsHere'sarealexamplefromatechnicalsupportcaseIoncehandled.
Thecustomerwastryingtofindlinesinacomma-delimitedtextfilewherethe12thitemonalinestartedwitha"P".
Hewasusingtheinnocently-lookingregexp11}P.
Atfirstsight,thisregexlookslikeitshoulddothejobjustfine.
Thelazydotandcommamatchasinglecomma-delimitedfield,andthe{11}skipsthefirst11fields.
Finally,thePchecksifthe12thfieldindeedstartswithP.
Infact,thisisexactlywhatwillhappenwhenthe12thfieldindeedstartswithaP.
Theproblemrearsitsuglyheadwhenthe12thfielddoesnotstartwithaP.
Let'ssaythestringis"1,2,3,4,5,6,7,8,9,10,11,12,13".
Atthatpoint,theregexenginewillbacktrack.
Itwillbacktracktothepointwhere11}hadconsumed1,2,3,4,5,6,7,8,9,10,11",givingupthelastmatchofthecomma.
Thenexttokenisagainthedot.
Thedotmatchesacomma.
Thedotmatchesthecomma!
However,thecommadoesnotmatchthe"1"inthe12thfield,sothedotcontinuesuntilthe11thiterationof.
*,hasconsumed11,12,".
Youcanalreadyseetherootoftheproblem:thepartoftheregex(thedot)matchingthecontentsofthefieldalsomatchesthedelimiter(thecomma).
Becauseofthedoublerepetition(starinside{11}),thisleadstoacatastrophicamountofbacktracking.
Theregexenginenowcheckswhetherthe13thfieldstartswithaP.
Itdoesnot.
Sincethereisnocommaafterthe13thfield,theregexenginecannolongermatchthe11thiterationof.
*,.
Butitdoesnotgiveupthere.
Itbacktrackstothe10thiteration,expandingthematchofthe10thiterationto10,11,".
SincethereisstillnoP,the10thiterationisexpandedto10,11,12,".
Reachingtheendofthestringagain,thesamestorystartswiththe9thiteration,subsequentlyexpandingitto9,10,",9,10,11,",9,10,11,12,".
Butbetweeneachexpansion,therearemorepossibilitiestobetried.
Whenthe9thiterationconsumes9,10,",the10thcouldmatchjust11,"aswellas11,12,".
Continuouslyfailing,theenginebacktrackstothe8thiteration,againtryingallpossiblecombinationsforthe9th,10th,and11thiterations.
Yougettheidea:thepossiblenumberofcombinationsthattheregexenginewilltryforeachlinewherethe12thfielddoesnotstartwithaPishuge.
AllthiswouldtakealongtimeifyouranthisregexonalargeCSVfilewheremostrowsdon'thaveaPatthestartofthe12thfield.
PreventingCatastrophicBacktrackingThesolutionissimple.
Whennestingrepetitionoperators,makeabsolutelysurethatthereisonlyonewaytomatchthesamematch.
Ifrepeatingtheinnerloop4timesandtheouterloop7timesresultsinthesameoverallmatchasrepeatingtheinnerloop6timesandtheouterloop2times,youcanbesurethattheregexenginewilltryallthosecombinations.
Inourexample,thesolutionistobemoreexactaboutwhatwewanttomatch.
Wewanttomatch11comma-delimitedfields.
Thefieldsmustnotcontaincomma's.
Sotheregexbecomes:r\n]*,){11}P.
If82thePcannotbefound,theenginewillstillbacktrack.
Butitwillbacktrackonly11times,andeachtimethe[^,\r\n]isnotabletoexpandbeyondthecomma,forcingtheregexenginetothepreviousoneofthe11iterationsimmediately,withouttryingfurtheroptions.
SeetheDifferencewithRegexBuddyIfyoutrythisexamplewithRegexBuddy'sdebugger,youwillseethattheoriginalregex11}Pneeds29,687stepstoconcludethereregexcannotmatch"1,2,3,4,5,6,7,8,9,10,11,12".
Ifthestringis"1,2,3,4,5,6,7,8,9,10,11,12,13",just3charactersmore,thenumberofstepsdoublesto60,315.
It'snottoohardtoimaginethatatthiskindofexponentialrate,attemptingthisregexonalargefilewithlonglinescouldeasilytakeforever.
RegexBuddy'sdebuggerwillaborttheattemptafter100,000steps,topreventitfromrunningoutofmemory.
Ourimprovedregex^([^,\r\n]*,){11}P,however,needsjustforty-eightstepstofail,whetherthesubjectstringhas12numbers,13numbers,16numbersorabillion.
Whilethecomplexityoftheoriginalregexwasexponential,thecomplexityoftheimprovedregexisconstantwithrespecttowhateverfollowsthe12thfield.
Thereasonistheregexfailsimmediatelywhenitdiscoversthe12thfielddoesn'tstartwithaP.
Itsimplybacktracks12timeswithoutexpandingagain,andthat'sit.
Thecomplexityoftheimprovedregexislineartothelengthofthefirst11fields.
36stepsareneededinourexample.
That'sthebestwecando,sincetheenginedoeshavetoscanthroughallthecharactersofthefirst11fieldstofindoutwherethe12thonebegins.
Ourimprovedregexisaperfectsolution.
AlternativeSolutionUsingAtomicGroupingIntheaboveexample,wecouldeasilyreducetheamountofbacktrackingtoaverylowlevelbybetterspecifyingwhatwewanted.
Butthatisnotalwayspossibleinsuchastraightforwardmanner.
Inthatcase,youshoulduseatomicgroupingtopreventtheregexenginefrombacktracking.
Usingatomicgrouping,theaboveregexbecomes11})P.
Everythingbetween(>)istreatedasonesingletokenbytheregexengine,oncetheregexengineleavesthegroup.
Becausetheentiregroupisonetoken,nobacktrackingcantakeplaceoncetheregexenginehasfoundamatchforthegroup.
Ifbacktrackingisrequired,theenginehastobacktracktotheregextokenbeforethegroup(thecaretinourexample).
Ifthereisnotokenbeforethegroup,theregexmustretrytheentireregexatthenextpositioninthestring.
Let'sseehow11})Pisappliedto"1,2,3,4,5,6,7,8,9,10,11,12,13".
Thecaretmatchesatthestartofthestringandtheengineenterstheatomicgroup.
Thestarislazy,sothedotisinitiallyskipped.
Butthecommadoesnotmatch"1",sotheenginebacktrackstothedot.
That'sright:backtrackingisallowedhere.
Thestarisnotpossessive,andisnotimmediatelyenclosedbyanatomicgroup.
Thatis,theregexenginedidnotcrosstheclosingroundbracketoftheatomicgroup.
Thedotmatches1",andthecommamatchestoo.
{11}causesfurtherrepetitionuntiltheatomicgrouphasmatched1,2,3,4,5,6,7,8,9,10,11,".
Now,theengineleavestheatomicgroup.
Becausethegroupisatomic,allbacktrackinginformationisdiscardedandthegroupisnowconsideredasingletoken.
TheenginenowtriestomatchPtothe"1"inthe12thfield.
Thisfails.
83Sofar,everythinghappenedjustlikeintheoriginal,troublesomeregularexpression.
Nowcomesthedifference.
Pfailedtomatch,sotheenginebacktracks.
Theprevioustokenisanatomicgroup,sothegroup'sentirematchisdiscardedandtheenginebacktracksfurthertothecaret.
Theenginenowtriestomatchthecaretatthenextpositioninthestring,whichfails.
Theenginewalksthroughthestringuntiltheend,anddeclaresfailure.
Failureisdeclaredafter30attemptstomatchthecaret,andjustoneattempttomatchtheatomicgroup,ratherthanafter30attemptstomatchthecaretandahugenumberofattemptstotryallcombinationsofbothquantifiersintheregex.
Thatiswhatatomicgroupingandpossessivequantifiersarefor:efficiencybydisallowingbacktracking.
Themostefficientregexforourproblemathandwouldber\n]*),){11})P,sincepossessive,greedyrepetitionofthestarisfasterthanabacktrackinglazydot.
Ifpossessivequantifiersareavailable,youcanreduceclutterbywritingr\n]*+,){11})P.
QuicklyMatchingaCompleteHTMLFileAnothercommonsituationwherecatastrophicbacktrackingoccursiswhentryingtomatch"something"followedby"anything"followedby"anothersomething"followedby"anything",wherethelazydot.
*isused.
Themore"anything",themorebacktracking.
Sometimes,thelazydotissimplyasymptomofalazyprogrammer.
isnotappropriatetomatchadouble-quotedstring,sinceyoudon'treallywanttoallowanythingbetweenthequotes.
Astringcan'thave(unescaped)embeddedquotes,so"[^"\r\n]*"ismoreappropriate,andwon'tleadtocatastrophicbacktrackingwhencombinedinalargerregularexpression.
However,sometimes"anything"reallyisjustthat.
Theproblemisthat"anothersomething"alsoqualifiesas"anything",givingusagenuinex+x+situation.
SupposeyouwanttousearegularexpressiontomatchacompleteHTMLfile,andextractthebasicpartsfromthefile.
IfyouknowthestructureofHTMLfiles,writingtheregex.
*.
*.
*.
*.
*.
*isverystraight-forward.
Withthe"dotmatchesnewlines"or"singleline"matchingmodeturnedon,itwillworkjustfineonvalidHTMLfiles.
Unfortunately,thisregularexpressionwon'tworknearlyaswellonanHTMLfilethatmissessomeofthetags.
Theworstcaseisamissingtagattheendofthefile.
Whenfailstomatch,theregexenginebacktracks,givingupthematchfor.
*.
Itwillthenfurtherexpandthelazydotbefore,lookingforasecondclosing""tagintheHTMLfile.
Whenthatfails,theenginegivesup]*>"tagallthewaytotheendofthefile.
Sincethatalsofails,theengineproceedslookingallthewaytotheendofthefileforasecondclosingheadtag,asecondclosingtitletag,etc.
IfyourunthisregexinRegexBuddy'sdebugger,theoutputwilllooklikeasawtooth.
Theregexmatchesthewholefile,backsupalittle,matchesthewholefileagain,backsupsomemore,backsupyetsomemore,matcheseverythingagain,etc.
untileachofthe7.
*tokenshasreachedtheendofthefile.
TheresultisthatthisregularhasaworstcasecomplexityofN^7.
IfyoudoublethelengthoftheHTMLfilewiththemissingtagbyappendingtextattheend,theregularexpressionwilltake128times(2^7)aslongtofigureouttheHTMLfileisn'tvalid.
Thisisn'tquiteasdisastrousasthe2^Ncomplexityofourfirstexample,butwillleadtoveryunacceptableperformanceonlargerinvalidfiles.
Inthissituation,weknowthateachoftheliteraltextblocksinourregularexpression(theHTMLtags,whichfunctionasdelimiters)willoccuronlyonceinavalidHTMLfile.
Thatmakesitveryeasytopackageeachofthelazydots(thedelimitedcontent)inanatomicgroup.
84(>.
*title>title>head>body[^>]*>)(>.
*).
*willmatchavalidHTMLfileinthesamenumberofstepsastheoriginalregex.
ThegainisthatitwillfailonaninvalidHTMLfilealmostasfastasitmatchesavalidone.
Whenfailstomatch,theregexenginebacktracks,givingupthematchforthelastlazydot.
Butthen,there'snothingfurthertobacktrackto.
Sinceallofthelazydotsareinanatomicgroup,theregexengineshasdiscardedtheirbacktrackingpositions.
Thegroupsfunctionasa"donotexpandfurther"roadblock.
Theregexengineisforcedtoannouncefailureimmediately.
I'msureyou'venoticedthateachatomicgroupalsocontainsanHTMLtagafterthelazydot.
Thisiscritical.
WedoallowthelazydottobacktrackuntilitsmatchingHTMLtagwasfound.
E.
g.
when.
*isprocessing"Lastparagraph",the".
However,bwillfail"p".
Atthatpoint,theregexenginewillbacktrackandexpandthelazydottoinclude".
Sincetheregexenginehasn'tlefttheatomicgroupyet,itisfreetobacktrackinsidethegroup.
Oncehasmatched,andtheregexengineleavestheatomicgroup,itdiscardsthelazydot'sbacktrackingpositions.
Thenitcannolongerbeexpanded.
Essentially,whatwe'vedoneistobindarepeatedregextoken(thelazydottomatchHTMLcontent)tothenon-repeatedregextokenthatfollowsit(theliteralHTMLtag).
Sinceanything,includingHTMLtags,canappearbetweentheHTMLtagsinourregularexpression,wecannotuseanegatedcharacterclassinsteadofthelazydottopreventthedelimitingHTMLtagsfrombeingmatchedasHTMLcontent.
ButwecananddidachievethesameresultbycombiningeachlazydotandtheHTMLtagfollowingitintoanatomicgroup.
AssoonastheHTMLtagismatched,thelazydot'smatchislockeddown.
ThisensuresthatthelazydotwillnevermatchtheHTMLtagthatshouldbematchedbytheliteralHTMLtagintheregularexpression.
8510.
RepeatingaCapturingGroupvs.
CapturingaRepeatedGroupWhencreatingaregularexpressionthatneedsacapturinggrouptograbpartofthetextmatched,acommonmistakeistorepeatthecapturinggroupinsteadofcapturingarepeatedgroup.
Thedifferenceisthattherepeatedcapturinggroupwillcaptureonlythelastiteration,whileagroupcapturinganothergroupthat'srepeatedwillcapturealliterations.
Anexamplewillmakethisclear.
Let'ssayyouwanttomatchataglike!
abc!
"or!
123!
".
Onlythesetwoarepossible,andyouwanttocapturetheabc"or123"tofigureoutwhichtagyougot.
That'seasyenough:!
(abc|123)!
willdothetrick.
Nowlet'ssaythatthetagcancontainmultiplesequencesof"abc"and"123",like!
abc123!
"or!
123abcabc!
".
Thequickandeasysolutionis!
(abc|123)+!
.
Thisregularexpressionwillindeedmatchthesetags.
However,itnolongermeetsourrequirementtocapturethetag'slabelintothecapturinggroup.
Whenthisregexmatches!
abc123!
",thecapturinggroupstoresonly123".
Whenitmatches!
123abcabc!
",itonlystoresabc".
Thisiseasytounderstandifwelookathowtheregexengineapplies!
(abc|123)!
to"!
abc123!
".
First,!
matches!
".
Theenginethenentersthecapturinggroup.
Itmakesnotethatcapturinggroup#1wasenteredwhentheenginereachedthepositionbetweenthefirstandsecondcharacterinthesubjectstring.
Thefirsttokeninthegroupisabc,whichmatchesabc".
Amatchisfound,sothesecondalternativeisn'ttried.
(Theenginedoesstoreabacktrackingposition,butthiswon'tbeusedinthisexample.
)Theenginenowleavesthecapturinggroup.
Itmakesnotethatcapturinggroup#1wasexitedwhentheenginereachedthepositionbetweenthe4thand5thcharactersinthestring.
Afterhavingexitedfromthegroup,theenginenoticestheplus.
Theplusisgreedy,sothegroupistriedagain.
Theengineentersthegroupagain,andtakesnotethatcapturinggroup#1wasenteredbetweenthe4thand5thcharactersinthestring.
Italsomakesnotethatsincetheplusisnotpossessive,itmaybebacktracked.
Thatis,ifthegroupcannotbematchedasecondtime,that'sfine.
Inthisbacktrackingnote,theregexenginealsosavestheentranceandexitpositionsofthegroupduringthepreviousiterationofthegroup.
abcfailstomatch"123",but123succeeds.
Thegroupisexitedagain.
Theexitpositionbetweencharacters7and8isstored.
Theplusallowsforanotheriteration,sotheenginetriesagain.
Backtrackinginfoisstored,andthenewentrancepositionforthegroupissaved.
Butnow,bothabcand123failtomatch"!
".
Thegroupfails,andtheenginebacktracks.
Whilebacktracking,theenginerestoresthecapturingpositionsforthegroup.
Namely,thegroupwasenteredbetweencharacters4and5,andexistedbetweencharacters7and8.
Theengineproceedswith!
,whichmatches!
".
Anoverallmatchisfound.
Theoverallmatchspansthewholesubjectstring.
Thecapturinggroupspacescharacters5,6and7,or123".
Backtrackinginformationisdiscardedwhenamatchisfound,sothere'snowaytotellafterthefactthatthegrouphadapreviousiterationthatmatchedabc".
(Theonlyexceptiontothisisthe.
NETregexengine,whichdoespreservebacktrackinginformationforcapturinggroupsafterthematchattempt.
)Thesolutiontocapturingabc123"inthisexampleshouldbeobviousnow:theregexengineshouldenterandleavethegrouponlyonce.
Thismeansthattheplusshouldbeinsidethecapturinggroupratherthanoutside.
Sincewedoneedtogroupthetwoalternatives,we'llneedtoplaceasecondcapturinggrouparoundtherepeatedgroup:!
((abc|123)+)!
.
Whenthisregexmatches!
abc123!
",capturinggroup#1willstoreabc123",andgroup#2willstore123".
Sincewe'renotinterestedintheinnergroup'smatch,wecanoptimizethisregularexpressionbymakingtheinnergroupnon-capturing:!
((:abc|123)+)!
.
Part3Tools&Languages891.
SpecializedToolsandUtilitiesforWorkingwithRegularExpressionsThesetoolsandutilitieshaveregularexpressionsasthecoreoftheirfunctionality.
grep-TheutilityfromtheUNIXworldthatfirstmaderegularexpressionspopularPowerGREP-NextgenerationgrepforMicrosoftWindowsRegexBuddy-Learn,create,understand,test,useandsaveregularexpressions.
RegexBuddymakesworkingwithregularexpressionseasierthaneverbefore.
GeneralApplicationswithNotableSupportforRegularExpressionsTherearealotofapplicationsthesedaysthatsupportregularexpressionsinonewayoranother,enhancingcertainpartoftheirfunctionality.
Butcertainapplicationsstandoutfromthecrowdbyimplementingafull-featuredPerl-styleregularexpressionflavorandallowingregularexpressionstobeusedinsteadofliteralsearchtermsthroughouttheapplication.
EditPadPro-Convenienttexteditorwithapowerfulregex-basedsearchandreplacefeature,aswellasregex-basedcustomizablesyntaxcoloring.
ProgrammingLanguagesandLibrariesIfyouareaprogrammer,youcansavealotofcodingtimebyusingregularexpressions.
Witharegularexpression,youcandopowerfulstringparsinginonlyahandfullinesofcode,ormaybeevenjustasingleline.
Aregexisfastertowriteandeasiertodebugandmaintainthandozensorhundredsoflinesofcodetoachievethesamebyhand.
Delphi-Delphidoesnothavebuilt-inregexsupport.
Delphifor.
NETcanusethe.
NETframeworkregexsupport.
ForWin32,thereareseveralPCRE-basedVCLcomponentsavailable.
Java-Java4andlaterincludeanexcellentregularexpressionslibraryinthejava.
util.
regexpackage.
JavaScript-IfyouuseJavaScripttovalidateuserinputonawebpageattheclientside,usingJavaScript'sbuilt-inregularexpressionsupportwillgreatlyreducetheamountofcodeyouneedtowrite.
.
NET(dotnet)-Microsoft'snewdevelopmentframeworkincludesapoorlydocumented,butverypowerfulregularexpressionpackage,thatyoucanuseinany.
NET-basedprogramminglanguagesuchasC#(Csharp)orVB.
NET.
PCRE-PopularopensourceregularexpressionlibrarywritteninANSICthatyoucanlinkdirectlyintoyourCandC++applications,orusethroughan.
so(UNIX/Linux)ora.
dll(Windows).
90Perl-Thetext-processinglanguagethatgaveregularexpressionsasecondlife,andintroducedmanynewfeatures.
RegularexpressionsareanessentialpartofPerl.
PHP-Popularlanguageforcreatingdynamicwebpages,withthreesetsofregexfunctions.
TwoimplementPOSIXERE,whilethethirdisbasedonPCRE.
POSIX-ThePOSIXstandarddefinestworegularexpressionflavorsthatareimplementedinmanyapplications,programminglanguagesandsystems.
Python-Popularhigh-levelscriptinglanguagewithacomprehensivebuilt-inregularexpressionlibraryREALbasic-Cross-platformdevelopmenttoolsimilartoVisualBasic,withabuilt-inRegExclassbasedonPCRE.
Ruby-Anotherpopularhigh-levelscriptinglanguagewithcomprehensiveregularexpressionsupportasalanguagefeature.
Tcl-Tcl,apopular"glue"language,offersthreeregexflavors.
TwoPOSIX-compatibleflavors,andan"advanced"Perl-styleflavor.
VBScript-MicrosoftscriptinglanguageusedinASP(ActiveServerPages)andWindowsscripting,withabuilt-inRegExpobjectimplementingtheregexflavordefinedintheJavaScriptstandard.
VisualBasic6-LastversionofVisualBasicforWin32development.
YoucanusetheVBScriptRegExpobjectinyourVB6applications.
XMLSchema-TheW3CXMLSchemastandarddefinesitsownregularexpressionflavorforvalidatingsimpletypesusingpatternfacets.
DatabasesModerndatabasesoftenofferbuilt-inregularexpressionfeaturesthatcanbeusedinSQLstatementstofiltercolumnsusingaregularexpression.
Withsomedatabasesyoucanalsouseregularexpressionstoextracttheusefulpartofacolumn,ortomodifycolumnsusingasearch-and-replace.
MySQL-MySQL'sREGEXPoperatorworksjustliketheLIKEoperator,exceptthatitusesaPOSIXExtendedRegularExpression.
Oracle-OracleDatabase10gadds4regularexpressionfunctionsthatcanbeusedinSQLandPL/SQLstatementstofilterrowsandtoextractandreplaceregexmatches.
OracleimplementsPOSIXExtendedRegularExpressions.
PostgreSQL-PostgreSQLprovidesmatchingoperatorsandextractionandsubstitutionfunctionsusingthe"AdvancedRegularExpression"enginealsousedbyTcl.
912.
UsingRegularExpressionswithDelphifor.
NETandWin32UseSystem.
Text.
RegularExpressionswithDelphifor.
NETWhendevelopingBorlandDelphiWinFormsandVCL.
NETapplications,youcanaccessallclassesthatarepartoftheCommonLanguageRuntime(CLR),includingSystem.
Text.
RegularExpressions.
Simplyaddthisnamespacetotheusesclause,andyoucanaccessthe.
NETregexclassessuchasRegex,MatchandGroup.
YoucanusethemwithDelphijustastheycanbeusedbyC#andVBdevelopers.
PCRE-basedComponentsforDelphiforWindows/Win32IfyourapplicationisagoodoldWindowsapplicationusingtheWin32API,youobviouslycannotusetheregexsupportfromthe.
NETframework.
Delphiitselfdoesnotprovidearegularexpressionlibrary,soyouwillneedtouseathirdpartyVCLcomponent.
IrecommendthatyouuseacomponentthatisbasedontheopensourcePCRElibrary.
Thisisaveryfastlibrary,writteninC.
Theregexsyntaxitsupportsisverycomplete.
ThereareafewDelphicomponentsthatimplementregularexpressionspurelyinDelphi.
Thoughthatmaysoundlikeanadvantage,thepureDelphilibrariesIhaveseendonotsupportafull-featuredmodernregexsyntax.
TherearemanyPCRE-basedVCLcomponentsavailable.
Mostarefree,somearenot.
SomecompilePCREintoaDLLthatyouneedtoshipalongwithyourapplication,otherslinkthePCREOBJfilesdirectlyintoyourDelphiEXE.
OnesuchcomponentisTPerlRegEx,whichIdevelopedmyself.
YoucandownloadTPerlRegExforfreeathttp://www.
regular-expressions.
info/delphi.
html.
TPerlRegExDelphisource,PCRECsources,PCREOBJfilesandDLLareincluded.
YoucanchoosetolinktheOBJfilesdirectlyintoyourapplication,ortousetheDLL.
TPerlRegExhasfullsupportforregexsearch-and-replaceandregexsplitting,whichPCREdoesnot.
Fulldocumentationisincludedwiththedownloadasahelpfile.
RegexBuddy'sWin32DelphicodesnippetsarebasedontheTPerlRegExcomponent.
923.
EditPadPro:ConvenientTextEditorwithFullRegularExpressionSupportEditPadProisoneofthemostconvenienttexteditorsavailableontheMicrosoftWindowsplatform.
YoucanuseEditPadProalldaylongwithoutitgettingintothewayofwhatyouaretryingtodo.
Whenyouusesearch&replaceandthespellcheckerfunctionality,forexample,youdonotgetanastypopupwindowblockingyourviewofthedocumentyouareworkingon,butasmall,extrapanejustbelowthetext.
Ifyouoftenworkwithmanyfilesatthesametime,youwillsavetimewiththetabbedinterfaceandtheProjectfunctionalityforopeningandsavingsetsofrelatedfiles.
EditPadPro'sRegularExpressionSupportEditPadProdoesn'tusealimitedandoutdatedregularexpressionenginelikesomanyothertexteditorsdo.
EditPadProusesthesamefull-featuredregularexpressionengineusedbyPowerGREPandRegexBuddy.
EditPadPro'sregexflavorisfullycompatiblewiththeflavorsusedbyPerl,Java,.
NETandmanyothermodernPerl-styleregularexpressionflavors.
AllregexoperatorsexplainedinthetutorialinthisbookareavailableinEditPadPro.
93EditPadProintegrateswithRegexBuddy.
YoucaninstantlyfireupRegexBuddytoedittheregexyouwanttouseinEditPadPro,orselectonefromaRegexBuddylibrary.
SearchandReplaceUsingRegularExpressionsPressingCtrl+FinEditPadProwillmakethesearchandreplacepaneappear.
Marktheboxlabeled"regularexpressions"toenableregexmode.
Typeintheregexyouwanttosearchfor,andhittheFindFirstorFindNextbutton.
EditPadProwillthenhighlightsearchmatch.
Ifthesearchpanetakesuptoomuchspace,simplycloseitafterenteringtheregularexpression.
PressCtrl+F3tofindthefirstmatch,orF3tofindthenextone.
Whentherearenofurtherregexmatches,EditPadProdoesn'tinterruptyouwithapopupmessagethatyouhavetoOK.
Thetextcursorandselectionwillsimplystaywheretheywere,andthefindbuttonthatyouclickedwillflashbriefly.
Thismayseemalittlesubtleatfirst,butyou'llquicklyappreciateEditPadProstayingoutofyourwayandkeepingyouproductive.
Replacingtextisjustaseasy.
First,typethereplacementtext,usingbackreferencesifyouwant,intheReplacebox.
Searchforthematchyouwanttoreplaceasabove.
Toreplacethecurrentmatch,clicktheReplacebutton.
Toreplaceitandimmediatelysearchforthenextmatch,clicktheReplaceNextbutton.
Or,clickReplaceAlltogetitoverwith.
SyntaxColoringorHighlightingSchemesLikemanymoderntexteditors,EditPadProsupportssyntaxcoloringorsyntaxhighlightingforvariouspopularfileformatsandprogramminglanguages.
WhatmakesEditPadProunique,isthatyoucanuseregularexpressionstodefineyourownsyntaxcoloringschemesforfiletypesnotsupportedbydefault.
Tocreateyourowncoloringscheme,allyouneedtodoisdownloadthecustomsyntaxcoloringschemeseditor(onlyavailableifyouhavepurchasedEditPadPro),anduseregularexpressionstospecifythedifferentsyntacticelementsofthefileformatorprogramminglanguageyouwanttosupport.
TheregexengineusedbythesyntaxcoloringisidenticaltotheoneusedbyEditPadPro'ssearchandreplacefeature,soeverythingyoulearnedinthetutorialinthisbookapplies.
SyntaxcoloringschemescanbesharedontheEditPadProwebsite.
TheadvantageisthatyoudonotneedtolearnyetanotherscriptinglanguageoruseaspecificdevelopmenttooltocreateyourownsyntaxcoloringschemesforEditPadPro.
Allyouneedisdecentknowledgeofregularexpressions.
FileNavigationSchemesforTextFoldingandNavigationTexteditorscateringtoprogrammersoftenallowyoutofoldcertainsectionsinsourcecodefilestogetabetteroverview.
Anothercommonfeatureisasidebarshowingyouthefile'sstructure,enablingyoutoquicklyjumptoaparticularclassdefinitionormethodimplementation.
94EditPadProalsooffersboththesefeatures,withonekeydifference.
Mosttexteditorsonlysupportfoldingandnavigationforalimitedsetoffiletypes,usuallythemorepopularprogramminglanguages.
Ifyouusealesscommonlanguageorfileformat,nottomentionacustomone,you'reoutofluck.
EditPadPro,however,implementsfoldingandnavigationusingfilenavigationschemes.
AbunchofthemareincludedwithEditPadPro.
Theseschemesarefullyeditable,andyoucanevencreateyourown.
ManyfilenavigationschemeshavebeensharedbyotherEditPadProusers.
Youcancreateandeditthisschemeswithaspecialfilenavigationschemeeditor,whichyoucandownloadafterbuyingEditPadPro.
Likethesyntaxcoloringschemes,filenavigationschemesarebasedentirelyonregularexpressions.
Becausefilenavigationschemesareextremelyflexible,editingthemwilltakesomeeffort.
Butwithabitofpractice,youcanmakeEditPadPro'scodefoldingandfilenavigationtoworkjustthewayyouwantit,andsupportallthefiletypesthatyouworkwith,evenproprietaryones.
MoreInformationonEditPadProandFreeTrialDownloadEditPadProworksunderWindows98,ME,NT4,2000,XPandVista.
FormoreinformationonEditPadPro,pleasevisitwww.
editpadpro.
com.
954.
WhatIsgrepGrepisatoolthatoriginatedfromtheUNIXworldduringthe1970's.
Itcansearchthroughfilesandfolders(directoriesinUNIX)andcheckwhichlinesinthosefilesmatchagivenregularexpression.
Grepwilloutputthefilenamesandthelinenumbersortheactuallinesthatmatchedtheregularexpression.
Allinallaveryusefultoolforlocatinginformationstoredanywhereonyourcomputer,even(orespecially)ifyoudonotreallyknowwheretolook.
UsinggrepIfyoutypegrepregex*.
txtgrepwillsearchthroughalltextfilesinthecurrentfolder.
Itwillapplytheregextoeachlineinthefiles,andprint(i.
e.
display)eachlineonwhichamatchwasfound.
Thismeansthatgrepisinherentlyline-based.
Regexmatchescannotspanmultiplelines.
Ifyouliketoworkonthecommandline,thetraditionalgreptoolwillmakealotoftaskseasier.
AllLinuxdistributions(excepttinyfloppy-basedones)installaversionofgrepbydefault,usuallyGNUgrep.
IfyouareusingMicrosoftWindows,youwillneedtodownloadandinstallitseparately.
IfyouuseBorlanddevelopmenttools,youalreadyhaveBorland'sTurboGREPinstalled.
grepnotonlyworkswithglobbedfiles,butalsowithanythingyousupplyonthestandardinput.
Whenusedwithstandardinput,grepwillprintalllinesitreadsfromstandardinputthatmatchtheregex.
E.
g.
:theLinuxfindcommandwillglobthecurrentdirectoryandprintallfilenamesitfinds,sofind|grepregexwillprintonlythefilenamesthatmatchregex.
Grep'sRegexEngineMostversionsofgrepusearegex-directedengine,liketheregexflavorsdiscussedintheregextutorialinthisbook.
However,grepdoesnotsupportallthefancyregexfeaturesthatmodernregexflavorssupport.
Usually,supportislimitedtocharacterclasses(noshorthands),thedot,thestartandendoflineanchors,alternationwiththeverticalbar,andgreedyrepetitionwiththequestionmark,starandplus.
Dependingontheversionyouhave,youmayneedtoescapethequestionmark,plusandverticalbartogivethemtheirspecialmeaning.
Originally,grepdidnotsupportthesemetacharacters.
Theyareusuallystilltreatedasliteralcharacterswhenunescaped,forbackwardcompatibility.
Anenhancedversionofgrepiscalledegrep.
Itusesatext-directedengine.
Sinceneithergrepnoregrepsupportanyofthespecialfeatureslikebackreferences,lazyrepetition,orlookaround,andbecausegrepandegreponlyindicatewhetheramatchwasfoundonaparticularlineornot,thisdistinctiondoesnotmatter,exceptthatthetext-directedengineisfaster.
GNUgrep,themostpopularversionofgreponLinux,usesbothatext-directedandaregex-directedengine.
Ifyouuseadvancedfeatureslikebackreferences,whichGNUgrepsupports(butnottraditionalgrepandegrep),itwillusetheregex-directedengine.
Otherwise,itusesthefastertext-directedengine.
Again,forthetasksthatgrepisdesignedfor,thisdoesnotmattertoyou,theuser.
96BeyondTheCommandLineIfyouliketoworkonthecommandline,thenthetraditionalgreptoolisforyou.
Butifyouliketouseagraphicaluserinterface,therearemanygrep-liketoolsavailableforWindowsandotherplatforms.
Simplysearchfor"grep"onyourfavoritesoftwaredownloadsite.
Unfortunately,manygreptoolscomewithpoordocumentation,leavingituptoyoutofigureoutexactlywhichregexflavortheyuse.
It'snotbecausetheyclaimtobePerl-compatible,thattheyactuallyare.
Somearealmostperfectlycompatible(butneveridentical,though),butothersfailmiserablywhenyouwanttouseadvancedandveryusefulconstructslikelookaround.
OneWindows-basedgreptoolthatstandsoutfromthecrowdisPowerGREP,whichIwilldiscussnext.
975.
UsingRegularExpressionsinJavaJava4(JDK1.
4)andlaterhavecomprehensivesupportforregularexpressionsthroughthestandardjava.
util.
regexpackage.
BecauseJavalackedaregexpackageforsolong,therearealsomany3rdpartyregexpackagesavailableforJava.
IwillonlydiscussSun'sregexlibrarythatisnowpartoftheJDK.
Itsqualityisexcellent,betterthanmostofthe3rdpartypackages.
UnlessyouneedtosupportolderversionsoftheJDK,thejava.
util.
regexpackageisthewaytogo.
Java5and6usethesameregularexpressionflavor(withafewminorfixes),andprovidethesameregularexpressionclasses.
Theyaddafewadvancedfunctionsnotdiscussedonthispage.
QuickRegexMethodsofTheStringClassTheJavaStringclasshasseveralmethodsthatallowyoutoperformanoperationusingaregularexpressiononthatstringinaminimalamountofcode.
Thedownsideisthatyoucannotspecifyoptionssuchas"caseinsensitive"or"dotmatchesnewline".
Forperformancereasons,youshouldalsonotusethesemethodsifyouwillbeusingthesameregularexpressionoften.
myString.
matches("regex")returnstrueorfalsedependingwhetherthestringcanbematchedentirelybytheregularexpression.
ItisimportanttorememberthatString.
matches()onlyreturnstrueiftheentirestringcanbematched.
Inotherwords:"regex"isappliedasifyouhadwritten"^regex$"withstartandendofstringanchors.
Thisisdifferentfrommostotherregexlibraries,wherethe"quickmatchtest"methodreturnstrueiftheregexcanbematchedanywhereinthestring.
IfmyStringis"abc"thenmyString.
matches("bc")returnsfalse.
bcmatches"abc",but^bc$(whichisreallybeingusedhere)doesnot.
myString.
replaceAll("regex","replacement")replacesallregexmatchesinsidethestringwiththereplacementstringyouspecified.
Nosurpriseshere.
Allpartsofthestringthatmatchtheregexarereplaced.
Youcanusethecontentsofcapturingparenthesesinthereplacementtextvia$1,$2,$3,etc.
$0(dollarzero)insertstheentireregexmatch.
$12isreplacedwiththe12thbackreferenceifitexists,orwiththe1stbackreferencefollowedbytheliteral"2"iftherearelessthan12backreferences.
Ifthereare12ormorebackreferences,itisnotpossibletoinsertthefirstbackreferenceimmediatelyfollowedbytheliteral"2"inthereplacementtext.
Inthereplacementtext,adollarsignnotfollowedbyadigitcausesanIllegalArgumentExceptiontobethrown.
Iftherearelessthan9backreferences,adollarsignfollowedbyadigitgreaterthanthenumberofbackreferencesthrowsanIndexOutOfBoundsException.
Sobecarefulifthereplacementstringisauser-specifiedstring.
Toinsertadollarsignasliteraltext,use\$inthereplacementtext.
Whencodingthereplacementtextasaliteralstringinyoursourcecode,rememberthatthebackslashitselfmustbeescapedtoo:"\\$".
myString.
split("regex")splitsthestringateachregexmatch.
Themethodreturnsanarrayofstringswhereeachelementisapartoftheoriginalstringbetweentworegexmatches.
Thematchesthemselvesarenotincludedinthearray.
UsemyString.
split("regex",n)togetanarraycontainingatmostnitems.
Theresultisthatthestringissplitatmostn-1times.
Thelastiteminthestringistheunsplitremainderoftheoriginalstring.
98UsingThePatternClassInJava,youcompilearegularexpressionbyusingthePattern.
compile()classfactory.
ThisfactoryreturnsanobjectoftypePattern.
E.
g.
:PatternmyPattern=Pattern.
compile("regex");Youcanspecifycertainoptionsasanoptionalsecondparameter.
Pattern.
compile("regex",Pattern.
CASE_INSENSITIVE|Pattern.
DOTALL|Pattern.
MULTILINE)makestheregexcaseinsensitiveforUSASCIIcharacters,causesthedottomatchlinebreaksandcausesthestartandendofstringanchorstomatchatembeddedlinebreaksaswell.
WhenworkingwithUnicodestrings,specifyPattern.
UNICODE_CASEifyouwanttomaketheregexcaseinsensitiveforallcharactersinalllanguages.
YoushouldalwaysspecifyPattern.
CANON_EQtoignoredifferencesinUnicodeencodings,unlessyouaresureyourstringscontainonlyUSASCIIcharactersandyouwanttoincreaseperformance.
Ifyouwillbeusingthesameregularexpressionofteninyoursourcecode,youshouldcreateaPatternobjecttoincreaseperformance.
CreatingaPatternobjectalsoallowsyoutopassmatchingoptionsasasecondparametertothePattern.
compile()classfactory.
IfyouuseoneoftheStringmethodsabove,theonlywaytospecifyoptionsistoembedmodemodifierintotheregex.
Putting(i)atthestartoftheregexmakesitcaseinsensitive.
(m)istheequivalentofPattern.
MULTILINE,(s)equalsPattern.
DOTALLand(u)isthesameasPattern.
UNICODE_CASE.
Unfortunately,Pattern.
CANON_EQdoesnothaveanembeddedmodemodifierequivalent.
UsemyPattern.
split("subject")tosplitthesubjectstringusingthecompiledregularexpression.
ThiscallhasexactlythesameresultsasmyString.
split("regex").
Thedifferenceisthattheformerisfastersincetheregexwasalreadycompiled.
UsingTheMatcherClassExceptforsplittingastring(seepreviousparagraph),youneedtocreateaMatcherobjectfromthePatternobject.
TheMatcherwilldotheactualwork.
TheadvantageofhavingtwoseparateclassesisthatyoucancreatemanyMatcherobjectsfromasinglePatternobject,andthusapplytheregularexpressiontomanysubjectstringssimultaneously.
TocreateaMatcherobject,simplycallPattern.
matcher()likethis:myMatcher=Pattern.
matcher("subject").
IfyoualreadycreatedaMatcherobjectfromthesamepattern,callmyMatcher.
reset("newsubject")insteadofcreatinganewmatcherobject,forreducedgarbageandincreasedperformance.
Eitherway,myMatcherisnowreadyforduty.
Tofindthefirstmatchoftheregexinthesubjectstring,callmyMatcher.
find().
Tofindthenextmatch,callmyMatcher.
find()again.
WhenmyMatcher.
find()returnsfalse,indicatingtherearenofurthermatches,thenextcalltomyMatcher.
find()willfindthefirstmatchagain.
TheMatcherisautomaticallyresettothestartofthestringwhenfind()fails.
TheMatcherobjectholdstheresultsofthelastmatch.
Callitsmethodsstart(),end()andgroup()togetdetailsabouttheentireregexmatchandthematchesbetweencapturingparentheses.
Eachofthesemethodsacceptsasingleintparameterindicatingthenumberofthebackreference.
Omittheparametertogetinformationabouttheentireregexmatch.
start()istheindexofthefirstcharacterinthematch.
end()istheindexofthefirstcharacterafterthematch.
Botharerelativetothestartofthesubjectstring.
Sothelengthofthematchisend()-start().
group()returnsthestringmatchedbytheregularexpressionorpairofcapturingparentheses.
99myMatcher.
replaceAll("replacement")hasexactlythesameresultsasmyString.
replaceAll("regex","replacement").
Again,thedifferenceisspeed.
TheMatcherclassallowsyoutodoasearch-and-replaceandcomputethereplacementtextforeachregexmatchinyourowncode.
YoucandothiswiththeappendReplacement()andappendTail()Hereishow:StringBuffermyStringBuffer=newStringBuffer();myMatcher=myPattern.
matcher("subject");while(myMatcher.
find()){if(checkIfThisMatchShouldBeReplaced()){myMatcher.
appendReplacement(myStringBuffer,computeReplacementString());}}myMatcher.
appendTail(myStringBuffer);Obviously,checkIfThisMatchShouldBeReplaced()andcomputeReplacementString()areplaceholdersformethodsthatyousupply.
Thefirstreturnstrueorfalseindicatingifareplacementshouldbemadeatall.
Notethatskippingreplacementsiswayfasterthanreplacingamatchwithexactlythesametextaswasmatched.
computeReplacementString()returnstheactualreplacementstring.
RegularExpressions,LiteralStringsandBackslashesInliteralJavastringsthebackslashisanescapecharacter.
Theliteralstring"\\"isasinglebackslash.
Inregularexpressions,thebackslashisalsoanescapecharacter.
Theregularexpression\\matchesasinglebackslash.
ThisregularexpressionasaJavastring,becomes"\\\\".
That'sright:4backslashestomatchasingleone.
Theregex\wmatchesawordcharacter.
AsaJavastring,thisiswrittenas"\\w".
Thesamebackslash-messoccurswhenprovidingreplacementstringsformethodslikeString.
replaceAll()asliteralJavastringsinyourJavacode.
Inthereplacementtext,adollarsignmustbeencodedas\$andabackslashas\\whenyouwanttoreplacetheregexmatchwithanactualdollarsignorbackslash.
However,backslashesmustalsobeescapedinliteralJavastrings.
Soasingledollarsigninthereplacementtextbecomes"\\$"whenwrittenasaliteralJavastring.
Thesinglebackslashbecomes"\\\\".
Rightagain:4backslashestoinsertasingleone.
JavaDemoApplicationusingRegularExpressionsToreallygettogripswiththejava.
util.
regexpackage,IrecommendthatyoustudythedemoapplicationIcreated.
Thedemocodehaslotsofcommentsthatclearlyindicatewhatmycodedoes,whyIcodeditthatway,andwhichotheroptionsyouhave.
Thedemocodealsocatchesallexceptionsthatmaybethrownbythevariousmethods,somethingIdidnotexplainabove.
Thedemoapplicationcoversalmosteveryaspectofthejava.
util.
regexpackage.
Youcanuseittolearnhowtousethepackage,andtoquicklytestregularexpressionswhilecoding.
1006.
JavaDemoApplicationusingRegularExpressionspackageregexdemo;importjava.
util.
regex.
*;importjava.
awt.
*;importjava.
awt.
event.
*;importjavax.
swing.
*;/***RegularExpressionsDemo*Demonstrationshowinghowtousethejava.
util.
regexpackagethatispartof*theJDK1.
4andlater*Copyright(c)2003JanGoyvaerts.
Allrightsreserved.
*Visithttp://www.
regular-expressions.
infoforadetailedtutorial*toregularexpressions.
*Thissourcecodeisprovidedforeducationalpurposesonly,withoutanywarrantyofanykind.
*Distributionofthissourcecodeand/ortheapplicationcompiled*fromthissourcecodeisprohibited.
*Pleaserefereverybodyinterestedingettingacopyofthesourcecodeto*http://www.
regular-expressions.
info*@authorJanGoyvaerts*@version1.
0*/publicclassFrameRegexDemoextendsJFrame{//CodegeneratedbytheJBuilder9designertocreatetheframedepictedbelow//hasbeenomittedforbrevity101/**Theeasiestwaytocheckifaparticularstringmatchesaregularexpression*istosimplycallString.
matches()passingtheregularexpressiontoit.
*Itisnotpossibletosetmatchingoptionsthisway,sothecheckboxes*inthisdemoareignoredwhenclickingbtnMatch.
**Onedisadvantageofthismethodisthatitwillonlyreturntrueif*theregexmatchesthe*entire*string.
Inotherwords,animplicit\A*isprependedtotheregexandanimplicit\zisappendedtoit.
*Soyoucannotusematches()totestifasubstringanywhereinthestring*matchestheregex.
**NotethatwhentypinginaregularexpressionintotextSubject,*backslashesareinterpretedattheregexlevel.
*Typingin\(willmatchaliteral(and\\matchesaliteralbackslash.
*Whenpassingliteralstringsinyoursourcecode,youneedtoescape*backslashesinstringsasusual.
*Thestring"\\("matchesaliteral(character*and"\\\\"matchesasingleliteralbackslash.
*/voidbtnMatch_actionPerformed(ActionEvente){textReplaceResults.
setText("n/a");//CallingthePattern.
matchesstaticmethodisanalternativeway//if(Pattern.
matches(textRegex.
getText(),textSubject.
getText())){try{if(textSubject.
getText().
matches(textRegex.
getText())){textResults.
setText("Theregexmatchestheentiresubject");}else{textResults.
setText("Theregexdoesnotmatchtheentiresubject");}}catch(PatternSyntaxExceptionex){textResults.
setText("Youhaveanerrorinyourregularexpression:\n"+ex.
getDescription());}}/**Theeasiestwaytoperformaregexsearch-and-replaceonastring*istocallthestring'sreplaceFirst()andreplaceAll()methods.
*replaceAll()willreplaceallsubstringsthatmatchtheregularexpression*withthereplacementstring,whilereplaceFirst()willonlyreplace*thefirstmatch.
**Again,youcannotsetmatchingoptionsthisway,sothecheckboxes*inthisdemoareignoredwhenclickingbtnMatch.
**Inthereplacementtext,youcanuse$0toinserttheentireregexmatch,*and$1,$2,$3,etc.
forthebackreferences(textmatchedbythepartinthe*regexbetweenthefirst,second,third,etc.
pairofroundbrackets)*\$insertsasingle$character.
**$$orotherimproperuseofthe$signthrowsanIllegalArgumentException.
*Ifyoureferenceagroupthatdoesnotexist(e.
g.
$4ifthereareonly*3groups),throwsanIndexOutOfBoundsException.
*Besuretoproperlyhandletheseexceptionsifyouallowtheenduser*totypeinthereplacementtext.
**Notethatinthememocontrol,youtype\$toinsertadollarsign,*and\\toinsertabackslash.
Ifyouprovidethereplacementstringasa*stringliteralinyourJavacode,youneedtouse"\\$"and"\\\\".
*ThisisbecausebackslashesneedtobeescapedinJavastringliteralstoo.
*/voidbtnReplace_actionPerformed(ActionEvente){try{textReplaceResults.
setText(textSubject.
getText().
replaceAll(textRegex.
getText(),textReplace.
getText()));textResults.
setText("n/a");}catch(PatternSyntaxExceptionex){102//textRegexdoesnotcontainavalidregularexpressiontextResults.
setText("Youhaveanerrorinyourregularexpression:\n"+ex.
getDescription());textReplaceResults.
setText("n/a");}catch(IllegalArgumentExceptionex){//textReplacecontainsinapropriatedollarsignstextResults.
setText("Youhaveanerrorinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");}catch(IndexOutOfBoundsExceptionex){//textReplacecontainsabackreferencethatdoesnotexist//(e.
g.
$4ifthereareonlythreegroups)textResults.
setText("Non-existentgroupinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");}}/**Showtheresultsofsplittingastring.
*/voidprintSplitArray(String[]array){textResults.
setText(null);for(inti=0;i**Ifthesplitwouldresultintrailingemptystrings,(whentheregexmatches*attheendofthestring),thetrailingemptystringsarealsothrownaway.
*Ifyouwanttokeeptheemptystrings,callsplit(regex,-1).
The-1tells*thesplit()methodtoaddtrailingemptystringstotheresultingarray.
**Youcanlimitthenumberofitemsintheresultingarraybyspecifyinga*positivenumberasthesecondparametertosplit().
Thelimityouspecify*itthenumberofitemsthearraywillatmostcontain.
Theregexisapplied*atmostlimit-1times,andthelastiteminthearraycontainstheunsplit*remainderoftheoriginalstring.
Ifyouareonlyinterestedinthefirst*3itemsinthearray,specifyalimitof4anddisregardthelastitem.
*Thisismoreefficientthanhavingthestringsplitcompletely.
*/voidbtnSplit_actionPerformed(ActionEvente){textReplaceResults.
setText("n/a");try{printSplitArray(textSubject.
getText().
split(textRegex.
getText()/*,Limit*/));}catch(PatternSyntaxExceptionex){//textRegexdoesnotcontainavalidregularexpressiontextResults.
setText("Youhaveanerrorinyourregularexpression:\n"+ex.
getDescription());}}/**FigureouttheregexoptionstobepassedtothePattern.
compile()*classfactorybasedonthestateofthecheckboxes.
*/intgetRegexOptions(){intOptions=0;if(checkCanonEquivalence.
isSelected()){//InUnicode,certaincharacterscanbeencodedinmorethanoneway.
//Manyletterswithdiacriticscanbeencodedasasinglecharacter//identifyingtheletterwiththediacritic,andencodedastwo//characters:theletterbyitselffollowedbythediacriticbyitself//Thoughtheinternalrepresentationisdifferent,whenthestringis//renderedtothescreen,theresultisexactlythesame.
Options|=Pattern.
CANON_EQ;}if(checkCaseInsensitive.
isSelected()){103//OmittingUNICODE_CASEcausesonlyUSASCIIcharacterstobematched//caseinsensitively.
Thisisappropriateifyouknowbeforehandthat//thesubjectstringwillonlycontainUSASCIIcharacters//asitspeedsupthepatternmatching.
Options|=Pattern.
CASE_INSENSITIVE|Pattern.
UNICODE_CASE;}if(checkDotAll.
isSelected()){//Bydefault,thedotwillnotmatchlinebreakcharacters.
//Specifythisoptiontomakethedotmatchallcharacters,//includinglinebreaksOptions|=Pattern.
DOTALL;}if(checkMultiLine.
isSelected()){//Bydefault,thecaret^,dollar$onlymatchatthestart//andtheendofthestring.
Specifythisoptiontomake^alsomatch//afterlinebreaksinthestring,andmake$matchbeforelinebreaks.
Options|=Pattern.
MULTILINE;}returnOptions;}/**PatternconstructedbybtnObject*/PatterncompiledRegex;/**MatcherobjectthatwillsearchthesubjectstringusingcompiledRegex*/MatcherregexMatcher;JLabeljLabel8=newJLabel();JButtonbtnAdvancedReplace=newJButton();/**Ifyouwillbeusingaparticularregularexpressionoften,*youshouldcreateaPatternobjecttostoretheregularexpression.
*Youcanthenreusetheregexasoftenasyouwantbyreusingthe*Patternobject.
**Tousetheregularexpressiononastring,createaMatcherobject*bycallingcompiledRegex.
matcher()passingthesubjectstringtoit.
*TheMatcherwilldotheactualsearching,replacingorsplitting.
**YoucancreateasmanyMatcherobjectsfromasinglePatternobject*asyouwant,andusetheMatchersatthesametime.
Toapplytheregex*toanothersubjectstring,eithercreateanewMatcherusing*compiledRegex.
matcher()ortelltheexistingMatchertoworkonanew*stringbycallingregexMatcher.
reset(subjectString).
*/voidbtnObjects_actionPerformed(ActionEvente){compiledRegex=null;textReplaceResults.
setText("n/a");try{//Ifyoudonotwanttospecifyanyoptions(thisisthecasewhen//allcheckboxesinthisdemoareunchecked),youcanomitthe//secondparameterforthePattern.
compile()classfactory.
compiledRegex=Pattern.
compile(textRegex.
getText(),getRegexOptions());//Createtheobjectthatwillsearchthesubjectstring//usingtheregularexpression.
regexMatcher=compiledRegex.
matcher(textSubject.
getText());textResults.
setText("PatternandMatcherobjectscreated.
");}catch(PatternSyntaxExceptionex){//textRegexdoesnotcontainavalidregularexpressiontextResults.
setText("Youhaveanerrorinyourregularexpression:\n"+ex.
getDescription());}catch(IllegalArgumentExceptionex){//ThisexceptionindicatesabugingetRegexOptionstextResults.
setText("Undefinedbitvaluesaresetintheregexoptions");}}/**PrinttheresultsofasearchproducedbyregexMatcher.
find()*andstoredinregexMatcher.
*/voidprintMatch(){try{104textResults.
setText("Indexofthefirstcharacterinthematch:"+Integer.
toString(regexMatcher.
start())+"\n");textResults.
append("Indexofthefirstcharacterafterthematch:"+Integer.
toString(regexMatcher.
end())+"\n");textResults.
append("Lengthofthematch:"+Integer.
toString(regexMatcher.
end()-regexMatcher.
start())+"\n");textResults.
append("Matchedtext:"+regexMatcher.
group()+"\n");if(regexMatcher.
groupCount()>0){//Capturingparenthesesarenumbered1.
.
groupCount()//groupnumberzeroistheentireregexmatchfor(inti=1;i**YoualsoneedtousethePatternandMatcherobjectsforthe*search-and-replaceifyouwanttousetheregexoptionssuchas*"caseinsensitive"or"dotall".
**SeethebtnReplacenotesforthespecial$-syntaxinthereplacementtext.
*/voidbtnObjReplace_actionPerformed(ActionEvente){if(regexMatcher==null){textResults.
setText("ClickCreateObjectstocreatetheMatcherobject");}else{try{textReplaceResults.
setText(regexMatcher.
replaceAll(textReplace.
getText()));}catch(IllegalArgumentExceptionex){//textReplacecontainsinapropriatedollarsignstextResults.
setText("Youhaveanerrorinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");}catch(IndexOutOfBoundsExceptionex){//textReplacecontainsabackreferencethatdoesnotexist//(e.
g.
$4ifthereareonlythreegroups)textResults.
setText("Non-existentgroupinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");}}}/**UsingMatcher.
appendReplacement()andMatcher.
appendTail()youcanimplement*asearch-and-replaceofarbitrarycomplexity.
Theseroutinesallowyou*tocomputethereplacementstringinyourowncode.
Sothereplacementtext*canbewhateveryouwant.
**Todothis,simplycallMatcher.
find()inaloop.
Foreachmatchreturned*byfind(),callappendReplacement()withwhateverreplacementtextyouwant.
*Whenfind()cannolongerfindmatches,callappendTail().
**appendReplacement()appendsthesubstringbetweentheendoftheprevious*matchthatwasreplacedwithappendReplacement()andthecurrentmatch.
*IfthisisthefirstcalltoappendReplacement()sincecreatingtheMatcher*orcallingreset(),thentheappendedsubstringstartsatthestartof*thestring.
Then,thespecifiedreplacementtextisappended.
*Ifthereplacementtextcontainsdollarsigns,theywillbeinterpreted*asusual.
E.
g.
$1isreplacedwiththematchbetweenthefirstpairof*capturingparentheses.
**appendTail()appendsthesubstringbetweentheendofthepreviousmatch*thatwasreplacecedwithappendReplacement()andtheendofthestring.
*IfappendReplacement()wasnotcalledsincecreatingtheMatcheror*callingreset(),theentiresubjectstringisappended.
**TheabovemeansthatyoushouldcallMatcher.
reset()beforestartingthe*operation,unlessyou'resuretheMatcherisfreshlyconstructed.
*Ifcertainmatchesdonotneedtobereplaced,simplyskipcalling*appendReplacement()forthosematches.
(CallingappendReplacement()with*Matcher.
group()asthereplacementtextwillonlyhurtperformanceand*maygetyouintotroublewithdollarsignsthatmayappearinthematch.
)*/voidbtnAdvancedReplace_actionPerformed(ActionEvente){if(regexMatcher==null){textResults.
setText("ClickCreateObjectstocreatetheMatcherobject");}else{//WewillstorethereplacementtexthereStringBufferreplaceResult=newStringBuffer();while(regexMatcher.
find()){try{//Inthisexample,wesimplyreplacetheregexmatchwiththesametext//inuppercase.
NotethatappendReplacementparsesthereplacement//texttosubstitute$1,$2,etc.
withthecontentsofthe//correspondingcapturingparenthesesjustlikereplaceAll()106regexMatcher.
appendReplacement(replaceResult,regexMatcher.
group().
toUpperCase());}catch(IllegalStateExceptionex){//appendReplacement()wascalledwithoutapriorsuccessfulcalltofind()//ThisexceptionindicatesabuginyoursourcecodetextResults.
setText("appendReplacement()calledwithoutaprior"+"successfulcalltofind()");textReplaceResults.
setText("n/a");return;}catch(IllegalArgumentExceptionex){//ReplacementtextcontainsinapropriatedollarsignstextResults.
setText("Errorinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");return;}catch(IndexOutOfBoundsExceptionex){//Replacementtextcontainsabackreferencethatdoesnotexist//(e.
g.
$4ifthereareonlythreegroups)textResults.
setText("Non-existentgroupinthereplacementtext:\n"+ex.
getMessage());textReplaceResults.
setText("n/a");return;}}regexMatcher.
appendTail(replaceResult);textReplaceResults.
setText(replaceResult.
toString());textResults.
setText("n/a");//AfterusingappendReplacementandappendTail,theMatcherobjectmustbe//resetsowecanuseappendReplacementandappendTailagain.
//Inpractice,youwillprobablyputthiscallatthestartoftheroutine//whereyouwanttouseappendReplacementandappendTail.
//IdidnotdothatherebecausethiswayyoucanclickontheNextMatch//buttonacoupleoftimestoskipafewmatches,andthenclickonthe//AdvancedReplacebuttontoobservethatappendReplace()willcopythe//skippedmatchesunchanged.
regexMatcher.
reset();}}/**Ifyouwanttosplitmanystringsusingthesameregularexpression,*youshouldcreateaPatternobjectandcallPattern.
split()*ratherthanString.
split().
Bothmethodsproduceexactlythesameresults.
*However,whencreatingaPatternobject,youcanspecifyoptionssuchas*"caseinsensitive"and"dotall".
**NotethatnoMatcherobjectisused.
*/voidbtnObjSplit_actionPerformed(ActionEvente){textReplaceResults.
setText("n/a");if(compiledRegex==null){textResults.
setText("PleaseclickCreateObjectstocompiletheregex");}else{printSplitArray(compiledRegex.
split(textSubject.
getText()/*,Limit*/));}}}//ActionListenerclassesgeneratedbyJBuilder9havebeenomittedforbrevity1077.
UsingRegularExpressionswithJavaScriptandECMAScriptJavaScript1.
2andlaterhasbuilt-insupportforregularexpressions.
MSIE4andlater,Netscape4andlater,allversionsofFirefox,andmostothermodernwebbrowserssupportJavaScript1.
2.
IfyouuseJavaScripttovalidateuserinputonawebpageattheclientside,usingJavaScript'sregularexpressionsupportwillgreatlyreducetheamountofcodeyouneedtowrite.
JavaScript'sregularexpressionflavorispartoftheECMA-262standardforthelanguage.
ThismeansyourregularexpressionsshouldworkexactlythesameinallimplementationsofJavaScript(i.
e.
indifferentwebbrowsers).
InJavaScript,aregularexpressioniswrittenintheformof/pattern/modifierswhere"pattern"istheregularexpressionitself,and"modifiers"areaseriesofcharactersindicatingvariousoptions.
The"modifiers"partisoptional.
ThissyntaxisborrowedfromPerl.
JavaScriptsupportsthefollowingmodifiers,asubsetofthosesupportedbyPerl:/genables"global"matching.
Whenusingthereplace()method,specifythismodifiertoreplaceallmatches,ratherthanonlythefirstone.
/imakestheregexmatchcaseinsensitive.
/menables"multi-linemode".
Inthismode,thecaretanddollarmatchbeforeandafternewlinesinthesubjectstring.
Youcancombinemultiplemodifiersbystringingthemtogetherasin/regex/gim.
Notablyabsentisanoptiontomakethedotmatchlinebreakcharacters.
Sinceforwardslashesdelimittheregularexpression,anyforwardslashesthatappearintheregexneedtobeescaped.
E.
g.
theregex1/2iswrittenas/1\/2/inJavaScript.
JavaScriptimplementsPerl-styleregularexpressions.
However,itlacksquiteanumberofadvancedfeaturesavailableinPerlandothermodernregularexpressionflavors:No\Aor\Zanchorstomatchthestartorendofthestring.
Useacaretordollarinstead.
Lookbehindisnotsupportedatall.
Lookaheadisfullysupported.
NoatomicgroupingorpossessivequantifiersNoUnicodesupport,exceptformatchingsinglecharacterswith\uFFFFNonamedcapturinggroups.
Usenumberedcapturinggroupsinstead.
Nomodemodifierstosetmatchingoptionswithintheregularexpression.
Noconditionals.
Noregularexpressioncomments.
DescribeyourregularexpressionwithJavaScript//commentsinstead,outsidetheregularexpressionstring.
RegexpMethodsofTheStringClassTotestifaparticularregexmatches(partof)astring,youcancallthestrings'smatch()method:if(myString.
match(/regex/)){/*Success!
*/}.
Ifyouwanttoverifyuserinput,youshoulduseanchorstomakesurethatyouaretestingagainsttheentirestring.
Totestiftheuserenteredanumber,use:108myString.
match(/^\d+$/).
/\d+/matchesanystringcontainingoneormoredigits,but/^\d+$/matchesonlystringsconsistingentirelyofdigits.
Todoasearchandreplacewithregexes,usethestring'sreplace()method:myString.
replace(/replaceme/g,"replacement").
Usingthe/gmodifiermakessurethatalloccurrencesof"replaceme"arereplaced.
Thesecondparameterisannormalstringwiththereplacementtext.
Iftheregexpcontainscapturingparentheses,youcanusebackreferencesinthereplacementtext.
$1inthereplacementtextinsertsthetextmatchedbythefirstcapturinggroup,$2thesecond,etc.
upto$9.
Finally,usingastring'ssplit()methodallowsyoutosplitthestringintoanarrayofstringsusingaregularexpressiontodeterminethepositionsatwhichthestringissplitted.
E.
g.
myArray=myString.
split(/,/)splitsacomma-delimitedlistintoanarray.
Thecomma'sthemselvesarenotincludedintheresultingarrayofstrings.
HowtoUseTheJavaScriptRegExpObjectEachJavaScriptexecutionthread(i.
e.
eachbrowserwindoworframe)containsonepre-initializedRegExpobject.
Usually,youwillnotusethisobjectdirectly.
Theeasiestwaytocreateanewregexpinstanceistosimplyusethespecialregexsyntax:myregexp=/regex/.
Ifyouhavetheregularexpressioninastring(e.
g.
becauseitwastypedinbytheuser),youcanusetheRegExpconstructor:myregexp=newRegExp(regexstring).
Modifierscanbespecifiedasasecondparameter:myregexp=newRegExp(regexstring,"gims").
IrecommendthatyoudonotusetheRegExpconstructorwithaliteralstring,becauseinliteralstrings,backslashesmustbeescaped.
Theregularexpression\w+canbecreatedasre=/\w+/orasre=newRegExp("\\w+").
Thelatterisdefinitelyhardertoread.
Theregularexpression\\matchesasinglebackslash.
InJavaScript,thisbecomesre=/\\/orre=newRegExp(Whicheverwayyoucreate"myregexp",youcanpassittotheStringmethodsexplainedaboveinsteadofaliteralregularexpression:myString.
replace(myregexp,"replacement").
Ifyouwanttoretrievethepartofthestringthatwasmatched,calltheexec()functionoftheRegExpobjectthatyoucreated,e.
g.
:mymatch=myregexp.
exec("subject").
Thisfunctionreturnsanarray.
Thezerothiteminthearraywillholdthetextthatwasmatchedbytheregularexpression.
Thefollowingitemscontainthetextmatchedbythecapturingparenthesesintheregexp,ifany.
mymatch.
indexindicatesthecharacterpositioninthesubjectstringatwhichthepatternmatched.
Callingtheexec()functionalsochangesanumberofpropertiesoftheRegExpobject.
Notethateventhoughyoucancreatemultiple"myregexp"instances,eachJavaScriptthreadofexecutiononlyhasoneglobalRegExpobject.
Thismeansthatthepropertyvaluesofallthe"myregexp"instanceswillallbethesame,andindicatetheresultoftheverylastcalltoexec().
ThelastMatchpropertyholdsthetextmatchedbythelastcalltoexec(),andlastIndexstorestheindexinthesubjectstringofthefirstcharacterinthematch.
leftContextstoresthepartofthesubjectstringtotheleftortheregexpmatch,andrightContexttheparttotheright.
1098.
JavaScriptRegExpExample:RegularExpressionTesterRegexp:Subjectstring:Replacementtext:Result:1109.
MySQLRegularExpressionswithTheREGEXPOperatorMySQL'ssupportforregularexpressionsisratherlimited,butstillveryuseful.
MySQLonlyhasoneoperatorthatallowsyoutoworkwithregularexpressions.
ThisistheREGEXPoperator,whichworksjustliketheLIKEoperator,exceptthatinsteadofusingthe_and%wildcards,itusesaPOSIXExtendedRegularExpression(ERE).
Despitethe"extended"inthenameofthestandard,thePOSIXEREflavorisafairlybasicregexflavorbymodernstandards,asyoucanseeintheregexflavorcomparisoninthisbook.
Still,itmakestheREGEXPoperatorfarmorepowerfulandflexiblethanthesimpleLIKEoperator.
OneimportantdifferencebetweentheLIKEandREGEXPoperatorsisthattheLIKEoperatoronlyreturnsTrueifthepatternmatchesthewholestring.
E.
g.
WHEREtestcolumnLIKE'jg'willreturnonlyrowswheretestcolumnisidenticalto"jg",exceptfordifferencesincaseperhaps.
Ontheotherhand,WHEREtestcolumnREGEXP'jg'willreturnallrowswheretestcolumnhas"jg"anywhereinthestring.
UseWHEREtestcolumnREGEXP'^jg$'togetonlycolumnsidenticalto"jg".
TheequivalentofWHEREtestcolumnLIKE'jg%'wouldbeWHEREtestcolumnREGEXP'^jg'.
There'snoneedtoputa.
*attheendoftheregex(theREGEXPequivalentofLIKE's%),sincepartialmatchesareaccepted.
MySQLdoesnotofferanymatchingmodes.
POSIXEREsdon'tsupportmodemodifiersinsidetheregularexpression,andMySQL'sREGEXPoperatordoesnotprovideawaytospecifymodesoutsidetheregularexpression.
TheREGEXPoperatoralwaysappliesregularexpressionscaseinsensitively,thedotmatchesallcharactersincludingnewlines,andthecaretanddollaronlymatchattheverystartandendofthestring.
Inotherwords:MySQLtreatsnewlinecharacterslikeordinarycharacters.
RememberthatMySQLsupportsC-styleescapesequencesinstrings.
WhilePOSIXEREdoesnotsupporttokenslike\ntomatchnon-printablecharacterslikelinebreaks,MySQLdoessupportthisescapeinitsstrings.
So"WHEREtestcolumnREGEXP'\n'"returnsallrowswheretestcolumncontainsalinebreak.
MySQLconvertsthe\ninthestringintoasinglelinebreakcharacterbeforeparsingtheregularexpression.
Thisalsomeansthatbackslashesneedtobeescaped.
Theregex\\tomatchasinglebackslashbecomes'\\\\'asaMySQLstring,andtheregex\$tomatchadollarsymbolbecomes'\\$'asaMySQLstring.
AllthisisunlikeotherdatabaseslikeOracle,whichdon'tsupport\nanddon'trequirebackslashestobeescaped.
Toreturnrowswherethecolumndoesn'tmatchtheregularexpression,useWHEREtestcolumnNOTREGEXP'pattern'TheRLIKEoperatorisasynonymoftheREGEXPoperator.
WHEREtestcolumnRLIKE'pattern'andWHEREtestcolumnNOTRLIKE'pattern'areidenticaltoWHEREtestcolumnREGEXP'pattern'andWHEREtestcolumnNOTREGEXP'pattern'.
IrecommendyouuseREGEXPinsteadofRLIKE,toavoidconfusionwiththeLIKEoperator.
11110.
UsingRegularExpressionswithTheMicrosoft.
NETFrameworkTheMicrosoft.
NETFramework,whichyoucanusewithany.
NETprogramminglanguagesuchasC#(Csharp)orVisualBasic.
NET,hassolidsupportforregularexpressions.
Thedocumentationoftheregularexpressionclassesisverypoor,however.
Readontolearnhowtouseregularexpressionsinyour.
NETapplications.
Inthetextbelow,IwilluseVB.
NETsyntaxtoexplainthevariousclasses.
Afterthetext,youwillfindacompleteapplicationwritteninC#toillustratehowtouseregularexpressionsingreatdetail.
Irecommendthatyoudownloadthesourcecode,readthesourcecodeandplaywiththeapplication.
Thatwillgiveyouaclearideahowtouseregexesinyourownapplications.
Asyoucanseeintheregularexpressionflavorcomparison,.
NET'sregexflavorisveryfeature-rich.
Theonlynoteworthyfeaturethat'slackingarepossessivequantifiers.
Therearenodifferencesintheregexflavorsupportedby.
NETversions1.
x,2.
0and3.
0,exceptforonefeatureaddedin.
NET2.
0:characterclasssubtraction.
ItworksexactlythewayitdoesinXMLSchemaregularexpressions.
TheXMLSchemastandardfirstdefinedthisfeatureanditssyntax.
System.
Text.
RegularExpressionsOverview(UsingVB.
NETSyntax)TheregexclassesarelocatedinthenamespaceSystem.
Text.
RegularExpressions.
Tomakethemavailable,placeImportsSystem.
Text.
RegularExpressionsatthestartofyoursourcecode.
TheRegexclassistheoneyouusetocompilearegularexpression.
Forefficiency,regularexpressionsarecompiledintoaninternalformat.
Ifyouplantousethesameregularexpressionrepeatedly,constructaRegexobjectasfollows:DimRegexObjasRegex=NewRegex("regularexpression").
YoucanthencallRegexObj.
IsMatch("subject")tocheckwhethertheregularexpressionmatchesthesubjectstring.
TheRegexallowsanoptionalsecondparameteroftypeRegexOptions.
YoucouldspecifyRegexOptions.
IgnoreCaseasthefinalparametertomaketheregexcaseinsensitive.
OtheroptionsareRegexOptions.
SinglelinewhichcausesthedottomatchnewlinesandRegexOptions.
Multilinewhichcausesthecaretanddollartomatchatembeddednewlinesinthesubjectstring.
CallRegexObj.
Replace("subject","replacement")toperformasearch-and-replaceusingtheregexonthesubjectstring,replacingallmatcheswiththereplacementstring.
Inthereplacementstring,youcanuse$&toinserttheentireregexmatchintothereplacementtext.
Youcanuse$1,$2,$3,etc.
.
.
toinsertthetextmatchedbetweencapturingparenthesesintothereplacementtext.
Use$$toinsertasingledollarsignintothereplacementtext.
Toreplacewiththefirstbackreferenceimmediatelyfollowedbythedigit9,use${1}9.
Ifyoutype$19,andtherearelessthan19backreferences,the$19willbeinterpretedasliteraltext,andappearintheresultstringassuch.
Toinsertthetextfromanamedcapturinggroup,use${name}.
Improperuseofthe$signmayproduceanundesirableresultstring,butwillnevercauseanexceptiontoberaised.
RegexObj.
Split("Subject")splitsthesubjectstringalongregexmatches,returninganarrayofstrings.
Thearraycontainsthetextbetweentheregexmatches.
Iftheregexcontainscapturingparentheses,thetextmatchedbythemisalsoincludedinthearray.
Ifyouwanttheentireregexmatchestobeincludedinthearray,simplyplaceroundbracketsaroundtheentireregularexpressionwheninstantiatingRegexObj.
112TheRegexclassalsocontainsseveralstaticmethodsthatallowyoutouseregularexpressionswithoutinstantiatingaRegexobject.
Thisreducestheamountofcodeyouhavetowrite,andisappropriateifthesameregularexpressionisusedonlyonceorreusedseldomly.
NotethatmemberoverloadingisusedalotintheRegexclass.
Allthestaticmethodshavethesamenames(butdifferentparameterlists)asothernon-staticmethods.
Regex.
IsMatch("subject","regex")checksiftheregularexpressionmatchesthesubjectstring.
Regex.
Replace("subject","regex","replacement")performsasearch-and-replace.
Regex.
Split("subject","regex")splitsthesubjectstringintoanarrayofstringsasdescribedabove.
AllthesemethodsacceptanoptionaladditionalparameteroftypeRegexOptions,liketheconstructor.
TheSystem.
Text.
RegularExpressions.
MatchClassIfyouwantmoreinformationabouttheregexmatch,callRegex.
Match()toconstructaMatchobject.
IfyouinstantiatedaRegexobject,useDimMatchObjasMatch=RegexObj.
Match("subject").
Ifnot,usethestaticversion:DimMatchObjasMatch=Regex.
Match("subject","regex").
Eitherway,youwillgetanobjectofclassMatchthatholdsthedetailsaboutthefirstregexmatchinthesubjectstring.
MatchObj.
Successindicatesifthereactuallywasamatch.
Ifso,useMatchObj.
Valuetogetthecontentsofthematch,MatchObj.
Lengthforthelengthofthematch,andMatchObj.
Indexforthestartofthematchinthesubjectstring.
Thestartofthematchiszero-based,soiteffectivelycountsthenumberofcharactersinthesubjectstringtotheleftofthematch.
Iftheregularexpressioncontainscapturingparentheses,usetheMatchObj.
Groupscollection.
MatchObj.
Groups.
Countindicatesthenumberofcapturingparentheses.
Thecountincludesthezerothgroup,whichistheentireregexmatch.
MatchObj.
Groups(3).
Valuegetsthetextmatchedbythethirdpairofroundbrackets.
MatchObj.
Groups(3).
LengthandMatchObj.
Groups(3).
Indexgetthelengthofthetextmatchedbythegroupanditsindexinthesubjectstring,relativetothestartofthesubjectstring.
MatchObj.
Groups("name")getsthedetailsofthenamedgroup"name".
Tofindthenextmatchoftheregularexpressioninthesamesubjectstring,callMatchObj.
NextMatch()whichreturnsanewMatchobjectcontainingtheresultsforthesecondmatchattempt.
YoucancontinuecallingMatchObj.
NextMatch()untilMatchObj.
SuccessisFalse.
NotethataftercallingRegexObj.
Match(),theresultingMatchobjectisindependentfromRegexObj.
ThismeansyoucanworkwithseveralMatchobjectscreatedbythesameRegexobjectsimultaneously.
RegularExpressions,LiteralStringsandBackslashesInliteralC#strings,aswellasinC++andmanyother.
NETlanguages,thebackslashisanescapecharacter.
Theliteralstring"\\"isasinglebackslash.
Inregularexpressions,thebackslashisalsoanescapecharacter.
Theregularexpression\\matchesasinglebackslash.
ThisregularexpressionasaC#string,becomes"\\\\".
That'sright:4backslashestomatchasingleone.
Theregex\wmatchesawordcharacter.
AsaC#string,thisiswrittenas"\\w".
113Tomakeyourcodemorereadable,youshoulduseC#verbatimstrings.
Inaverbatimstring,abackslashisanordinarycharacter.
ThisallowsyoutowritetheregularexpressioninyourC#codeasyouwouldwriteitatoollikeRegexBuddyorPowerGREP,orastheuserwouldtypeitintoyourapplication.
Theregextomatchabacklashiswrittenas@"\\"whenusingC#verbatimstrings.
Thebackslashisstillanescapecharacterintheregularexpression,soyoustillneedtodoubleit.
Butdoublingisbetterthanquadrupling.
Tomatchawordcharacter,usetheverbatimstring@"\w".
.
NETFrameworkDemoApplicationusingRegularExpressions(C#Syntax)ToreallygettogripswiththeregexsupportoftheMicrosoft.
NETFramework,IrecommendthatyoustudythedemoapplicationIcreated.
ItiswritteninC#.
Thedemoisfairlysimple,soyoushouldunderstandthesourcecodeevenifyoudonotuseC#yourself.
Thedemocodehaslotsofcommentsthatclearlyindicatewhatmycodedoes,whyIcodeditthatway,andwhichotheroptionsyouhave.
Thedemocodealsocatchesallexceptionsthatmaybethrownbythevariousmethods,somethingIdidnotexplainabove.
ThedemoapplicationcoverseveryaspectoftheSystem.
Text.
RegularExpressionspackage.
Youcanuseittolearnhowtousethepackage,andtoquicklytestregularexpressionswhilecoding.
11411.
C#DemoApplicationusingSystem;usingSystem.
Drawing;usingSystem.
Collections;usingSystem.
ComponentModel;usingSystem.
Windows.
Forms;usingSystem.
Data;//ThislineallowsustouseclasseslikeRegexandMatch//withouthavingtospellouttheentirelocation.
usingSystem.
Text.
RegularExpressions;namespaceRegexDemo{//////Applicationshowingtheuseofregularexpressionsinthe.
NETframework///Copyright(c)2003JanGoyvaerts.
Allrightsreserved.
///Visithttp://www.
regular-expressions.
infoforadetailedtutorialtoregularexpressions.
//////Thissourcecodeisprovidedforeducationalpurposesonly,without///anywarrantyofanykind.
Distributionofthissourcecodeand/orthe///applicationcompiledfromthissourcecodeisprohibited.
Pleaserefer///everybodyinterestedingettingacopyofthesourcecodeto///http://www.
regular-expressions.
infowhereitcanbedownloaded.
///publicclassFormRegex:System.
Windows.
Forms.
Form{//Designer-generatedcodetocreatetheformhasbeenomittedforbrevity115privatevoidcheckDotAll_Click(objectsender,System.
EventArgse){//"Dotall"and"ECMAScript"aremutuallyexclusiveoptions.
if(checkDotAll.
Checked)checkECMAScript.
Checked=false;}privatevoidcheckECMAScript_Click(objectsender,System.
EventArgse){//"Dotall"and"ECMAScript"aremutuallyexclusiveoptions.
if(checkECMAScript.
Checked)checkDotAll.
Checked=false;}privateRegexOptionsgetRegexOptions(){//"Dotall"and"ECMAScript"aremutuallyexclusiveoptions.
//Ifweincludethemboth,thentheRegex()constructororthe//Regex.
Match()methodwillraiseanexceptionSystem.
Diagnostics.
Trace.
Assert(!
(checkDotAll.
Checked&&checkECMAScript.
Checked),"DotAllandECMAScriptoptionsaremutuallyexclusive");//ConstructaRegexOptionsobject//Iftheoptionsarepredetermined,youcansimplypasssomethinglike//RegexOptions.
Multiline|RegexOptions.
Ignorecase//directlytotheRegex()constructorortheRegex.
Match()methodRegexOptionsoptions=newRegexOptions();//Iftrue,thedotmatchesanycharacter,includinganewline//Iffalse,thedotmatchesanycharacter,exceptanewlineif(checkDotAll.
Checked)options|=RegexOptions.
Singleline;//Iftrue,thecaret^matchesafteranewline,andthedollar$matches//beforeanewline,aswellasatthestartandendofthesubjectstring//Iffalse,thecaretonlymatchesatthestartofthestring//andthedollaronlyattheendofthestringif(checkMultiLine.
Checked)options|=RegexOptions.
Multiline;//Iftrue,theregexismatchedcaseinsensitivelyif(checkIgnoreCase.
Checked)options|=RegexOptions.
IgnoreCase;//Iftrue,\w,\dand\smatchASCIIcharactersonly,//and\10isbackreference1followedbyaliteral0//ratherthanoctalescape10.
if(checkECMAScript.
Checked)options|=RegexOptions.
ECMAScript;returnoptions;}privatevoidbtnMatch_Click(objectsender,System.
EventArgse){//Thismethodillustratestheeasiestwaytotestifastringcanbe//matchedbyaregexusingtheSystem.
Text.
RegularExpressions.
Regex.
Match//staticmethod.
Thiswayisrecommendedwhenyouonlywanttovalidate//asinglestringeverynowandthen.
//NotethatIsMatch()willalsoreturnTrueiftheregexmatchespartof//thestringonly.
IfyouonlywantittoreturnTrueiftheregexmatches//theentirestring,simplyprependacaretandappendadollarsign//totheregextoanchoritatthestartandend.
//NotethatwhentypinginaregularexpressionintotextSubject,//backslashesareinterpretedattheregexlevel.
//Sotypingin\(willmatchaliteral(characterand\\matchesa//literalbackslash.
Whenpassingliteralstringsinyoursourcecode,//youneedtoescapebackslashesinstringsasusual.
//Sothestring"\\("matchesaliteral(and"\\\\"matchesasingle//literalbackslash.
//Toreduceconfusion,Isuggestyouuseverbatimstringsinstead:matchesaliteral(and@"\\"matchesaliteralbackslash.
//Youcanomitthelastparameterwiththeregexoptions//ifyoudon'twanttospecifyany.
textReplaceResults.
Text="N/A";try{if(Regex.
IsMatch(textSubject.
Text,textRegex.
Text,getRegexOptions())){textResults.
Text="Theregexmatchespartorallofthesubject";116}else{textResults.
Text="Theregexcannotbematchedinthesubject";}}catch(Exceptionex){//MostlikelycauseisasyntaxerrorintheregularexpressiontextResults.
Text="Regex.
IsMatch()threwanexception:\r\n"+ex.
Message;}}privatevoidbtnGetMatch_Click(objectsender,System.
EventArgse){//Illustratestheeasiestwaytogetthetextofthefirstmatch//usingtheSystem.
Text.
RegularExpressions.
Regex.
Matchstaticmethod.
//Usefulforeasilyextractingastringformanotherstring.
//Youcanomitthelastparameterwiththeregexoptions//ifyoudon'twanttospecifyany.
//Ifthere'snomatch,Regex.
Match.
Valuereturnsanemptystring.
//Ifyouareonlyinterestedinpartoftheregexmatch,youcanuse//.
Groups[3].
Valueinsteadof.
Valuetogetthetextmatchedbetween//thethirdpairofroundbracketsintheregularexpressiontextReplaceResults.
Text="N/A";try{textResults.
Text=Regex.
Match(textSubject.
Text,textRegex.
Text,getRegexOptions()).
Value;}catch(Exceptionex){//MostlikelycauseisasyntaxerrorintheregularexpressiontextResults.
Text="Regex.
Match()threwanexception:\r\n"+ex.
Message;}}privatevoidbtnReplace_Click(objectsender,System.
EventArgse){//Illustratestheeasiestwaytodoaregex-basedsearch-and-replaceon//asinglestringusingtheSystem.
Text.
RegularExpressions.
Regex.
Replace//staticmethod.
ThismethodwillreplaceALLmatchesoftheregexin//thesubjectwiththereplacementtext.
//Iftherearenomatches,Replace()returnsthesubjectstringunchanged.
//Ifyouonlywanttoreplacecertainmatches,youhavetousethemethod//illustratedinbtnRegexObjReplace_click.
//Youcanomitthelastparameterwiththeregexoptions//ifyoudon'twanttospecifyany.
//Inthereplacementtext(textReplace.
Text),youcanuse$&toinsert//theentireregexmatch,and$1,$2,$3,etc.
forthebackreferences//(textmatchedbythepartintheregexbetweenthefirst,second,//third,etc.
pairofroundbrackets)//$$insertsasingle$character//$`(dollarbacktick)insertsthetextinthesubject/totheleftoftheregexmatch//$'(dollarsinglequote)insertsthetextinthesubject//totherightoftheendoftheregexmatch//$_insertstheentiresubjecttexttry{textReplaceResults.
Text=Regex.
Replace(textSubject.
Text,textRegex.
Text,textReplace.
Text,getRegexOptions());textResults.
Text="N/A";}catch(Exceptionex){//MostlikelycauseisasyntaxerrorintheregularexpressiontextResults.
Text="Regex.
Replace()threwanexception:\r\n"+ex.
Message;textReplaceResults.
Text="N/A";}}privatevoidprintSplitArray(string[]array){117textResults.
Text="";for(inti=0;i1){//matchObj.
Groups[0]holdstheentireregexmatchalsoheldby//matchObjitself.
TheotherGroupobjectsholdthematchesfor//capturingparenthesesintheregexfor(inti=1;i3".
\33isinterpretedasthe33rdgroup,andissubstitutedwithnothingiftherearefewergroups.
Ifyouusednamedcapturinggroups,youcanusetheminthereplacementtextwithr"\g".
There.
sub()functionappliesthesamebackslashlogictothereplacementtextasisappliedtotheregularexpression.
Therefore,youshoulduserawstringsforthereplacementtext,asIdidintheexamplesabove.
There.
sub()functionwillalsointerpret\nand\tinrawstrings.
Ifyouwant"c:\temp"asthereplacementtext,eitheruser"c:\\temp"or"c:\\\\temp".
The3rdbackreferenenceisr"\3"or"\\3".
137SplittingStringsre.
split(regex,subject)returnsanarrayofstrings.
Thearraycontainsthepartsofsubjectbetweenalltheregexmatchesinthesubject.
Adjacentregexmatcheswillcauseemptystringstoappearinthearray.
Theregexmatchesthemselvesarenotincludedinthearray.
Iftheregexcontainscapturinggroups,thenthetextmatchedbythecapturinggroupsisincludedinthearray.
Thecapturinggroupsareinsertedbetweenthesubstringsthatappearedtotheleftandrightoftheregexmatch.
Ifyoudon'twantthecapturinggroupsinthearray,convertthemintonon-capturinggroups.
There.
split()functiondoesnotofferanoptiontosuppresscapturinggroups.
Youcanspecifyanoptionalthirdparametertolimitthenumberoftimesthesubjectstringissplit.
Notethatthislimitcontrolsthenumberofsplits,notthenumberofstringsthatwillendupinthearray.
Theunsplitremainderofthesubjectisaddedasthefinalstringtothearray.
Iftherearenocapturinggroups,thearraywillcontainlimit+1items.
MatchDetailsre.
search()andre.
match()returnaMatchobject,whilere.
finditer()generatesaniteratortoiterateoveraMatchobject.
Thisobjectholdslotsofusefulinformationabouttheregexmatch.
IwillusemtosignifyaMatchobjectinthediscussionbelow.
m.
group()returnsthepartofthestringmatchedbytheentireregularexpression.
m.
start()returnstheoffsetinthestringofthestartofthematch.
m.
end()returnstheoffsetofthecharacterbeyondthematch.
m.
span()returnsa2-tupleofm.
start()andm.
end().
Youcanusethem.
start()andm.
end()toslicethesubjectstring:subject[m.
start():m.
stop()].
Ifyouwanttheresultsofacapturinggroupratherthantheoverallregexmatch,specifythenameornumberofthegroupasaparameter.
m.
group(3)returnsthetextmatchedbythethirdcapturinggroup.
m.
group('groupname')returnsthetextmatchedbyanamedgroup'groupname'.
Ifthegroupdidnotparticipateintheoverallmatch,m.
group()returnsanemptystring,whilem.
start()andm.
end()return-1.
Ifyouwanttodoaregularexpressionbasedsearch-and-replacewithoutusingre.
sub(),callm.
expand(replacement)tocomputethereplacementtext.
Thefunctionreturnsthereplacementstringwithbackreferencesetc.
substituted.
RegularExpressionObjectsIfyouwanttousethesameregularexpressionmorethanonce,youshouldcompileitintoaregularexpressionobject.
Regularexpressionobjectsaremoreefficient,andmakeyourcodemorereadable.
Tocreateone,justcallre.
compile(regex)orre.
compile(regex,flags).
Theflagsarethematchingoptionsdescribedaboveforthere.
search()andre.
match()functions.
Theregularexpressionobjectreturnedbyre.
compile()providesallthefunctionsthattheremodulealsoprovidesdirectly:search(),match(),findall(),finditer(),sub()andsplit().
Thedifferenceisthattheyusethepatternstoredintheregexobject,anddonottaketheregexasthefirstparameter.
re.
compile(regex).
search(subject)isequivalenttore.
search(regex,subject).
13813920.
HowtoUseRegularExpressionsinREALbasicREALbasicincludesabuilt-inRegExclass.
Internally,thisclassisbasedontheopensourcePCRElibrary.
WhatthismeanstoyouasaREALbasicdeveloperisthattheRegExclassprovidesyouwitharichflavorofPerl-compatibleregularexpressions.
TheregularexpressiontutorialinthisbookdoesnotexplicitlymentionREALbasic.
EverythingsaidinthetutorialaboutPCRE'sregexflavoralsoappliestoREALbasic.
Theonlyexceptionarethecaseinsensitiveand"multi-line"matchingmodes.
InPCRE,they'reoffbydefault,whileinREALbasicthey'reonbydefault.
REALbasicusestheUTF-8versionofPCRE.
Thismeansthatifyouwanttoprocessnon-ASCIIdatathatyou'veretrievedfromafileorthenetwork,you'llneedtouseREALbasic'sTextConverterclasstoconvertyourstringsintoUTF-8beforepassingthemtotheRegExobject.
You'llalsoneedtousetheTextConvertertoconvertthestringsreturnedbytheRegExclassfromUTF-8backintotheencodingyourapplicationisworkingwith.
TheRegExClassTousearegularexpression,youneedtocreateanewinstanceoftheRegExclass.
AssignyourregularexpressiontotheSearchPatternproperty.
YoucansetvariousoptionsintheOptionsproperty,whichisaninstanceoftheRegExOptionsclass.
Tocheckifaregularexpressionmatchesaparticularstring,calltheSearchmethodoftheRegExobject,andpassthesubjectstringasaparameter.
ThismethodreturnsaninstanceoftheRegExMatchclassifamatchisfound,orNilifnomatchisfound.
Tofindthesecondmatchinthesamesubjectstring,calltheSearchmethodagain,withoutanyparameters.
Donotpassthesubjectstringagain,sincedoingsorestartsthesearchfromthebeginningofthestring.
KeepcallingSearchwithoutanyparametersuntilitreturnsNiltoiterateoverallregularexpressionmatchesinthestring.
TheRegExMatchClassWhentheRegex.
Searchmethodfindsamatch,itstoresthematch'sdetailsinaRegExMatchobject.
Thisobjecthasthreeproperties.
TheSubExpressionCountpropertyreturnsthenumberofcapturinggroupsintheregularexpressionplusone.
E.
g.
itreturns3fortheregex(1)(2).
TheSubExpressionStringpropertyreturnsthesubstringmatchedbytheregularexpressionoracapturinggroup.
SubExpressionString(0)returnsthewholeregexmatch,whileSubExpressionString(1)throughSubExpressionString(SubExpressionCount-1)returnthematchesofthecapturinggroup.
SubExpressionStartBreturnsthebyteoffsetofthestartofthematchofthewholeregexoroneofthecapturinggroupsdependingonthenumericindexyoupassasaparametertotheproperty.
TheRegExOptionsClassTheRegExOptionsclasshasninepropertiestosetvariousoptionsforyourregularexpression.
140SetCaseSensitive(Falsebydefault)toTruetotreatuppercaseandlowercaselettersasdifferentcharacters.
Thisoptionistheinverseof"caseinsensitivemode"or/iinotherprogramminglanguages.
SetDotMatchAll(Falsebydefault)toTruetomakethedotmatchallcharacters,includinglinebreakcharacters.
Thisoptionistheequivalentof"singlelinemode"or/sinotherprogramminglanguages.
SetGreedy(Truebydefault)toFalseifyouwantquantifierstobelazy,effectivelymaking.
*thesameas.
*.
IstronglyrecommendagainstsettingGreedytoFalse.
Simplyusethe.
*syntaxinstead.
Thisway,somebodyreadingyoursourcecodewillclearlyseewhenyou'reusinggreedyquantifiersandwhenyou'reusinglazyquantifierswhentheylookonlyattheregularexpression.
TheLineEndTypeoptionistheonlyonethattakesanIntegerinsteadofaBoolean.
Thisoptionaffectwhichcharacterthecaretanddollartreatasthe"endofline"character.
Thedefaultis0,whichacceptsboth\rand\nasend-of-linecharacters.
Setitto1touseauto-detectthehostplatform,anduse\nwhenyourapplicationrunsonWindowsandLinux,and\rwhenitrunsonaMac.
Setitto2forMac(\r),3forWindows(\n)and4forUNIX(\n).
Irecommendyouleavethisoptionaszero,whichismostlikelytogiveyoutheresultsyouintended.
ThisoptionisactuallyamodificationtothePCRElibrarymadeinREALbasic.
PCREsupportsonlyoption4,whichoftenconfusesWindowsdeveloperssinceitcausestest$tofailagainst"test\r\n"asWindowsuses\r\nforlinebreaks.
SetMatchEmpty(Truebydefault)toFalseifyouwanttoskipzero-lengthmatches.
SetReplaceAllMatches(Falsebydefault)toTrueifyouwanttheRegex.
Replacemethodtosearch-and-replaceallregexmatchesinthesubjectstringratherthanjustthefirstone.
SetStringBeginIsLineBegin(Truebydefault)toFalseifyoudon'twantthestartofthestringtobeconsideredthestartoftheline.
Thiscanbeusefulifyou'reprocessingalargechunkofdataasseveralseparatestrings,whereonlythefirststringshouldbeconsideredasstartingthe(conceptual)overallstring.
Similarly,setStringEndIsLineEnd(Truebydefault)toFalseifthestringyou'repassingtotheSearchmethodisn'treallytheendofthewholechunkofdatayou'reprocessing.
SetTreatTargetAsOneLine(Falsebydefault)tomakethecaretanddollarmatchatthestartandtheendofthestringonly.
Bydefault,theywillalsomatchafterandbeforeembeddedlinebreaks.
Thisoptionistheinverseofthe"multi-linemode"or/minotherprogramminglanguages.
REALbasicRegExSourceCodeExample'PreparearegularexpressionobjectDimmyRegExAsRegExDimmyMatchAsRegExMatchmyRegEx=NewRegExmyRegEx.
Options.
TreatTargetAsOneLine=TruemyRegEx.
SearchPattern="regex"'PopupallmatchesonebyonemyMatch=myRegEx.
Search(SubjectString)WhilemyMatchNilMsgBox(myMatch.
SubExpressionString(0))myMatch=myRegEx.
Search()WendSearchingandReplacingInadditiontofindingregexmatchesinastring,youcanreplacethematcheswithanotherstring.
Todoso,settheReplacementPatternpropertyofyourRegExobject,andthencalltheReplacemethod.
PassthesourcestringasaparametertotheReplacemethod.
Themethodwillreturnacopyofthestringwiththe141replacement(s)applied.
TheRegEx.
Options.
ReplaceAllMatchespropertydeterminesifonlythefirstregexmatchorifallregexmatcheswillbereplaced.
IntheReplacementPatternstring,youcanuse$&,$0or\0toinsertthewholeregularexpressionmatchintothereplacement.
Use$1or\1forthematchofthefirstcapturinggroup,$2or\2forthesecond,etc.
Ifyouwantmorecontroloverhowthereplacementsaremade,youcaniterateovertheregexmatcheslikeinthecodesnippetabove,andcalltheRegExMatch.
Replacemethodforeachmatch.
Thismethodisabitofamisnomer,sinceitdoesn'tactuallyreplaceanything.
Rather,itreturnstheRegEx.
ReplacementPatternstringwithallreferencestothematchandcapturinggroupssubstituted.
Youcanusethisresultstomakethereplacementsonyourown.
Thismethodisalsousefulifyouwanttocollectacombinationofcapturinggroupsforeachregexmatch.
14221.
RegexBuddy:YourPerfectCompanionforWorkingwithRegularExpressionsRegularexpressionsremainacomplexbeast,evenwithadetailedregularexpressiontutorialatyourdisposal.
RegexBuddyisaspecializedtoolthatmakesworkingwithregularexpressionsmucheasier.
RegexBuddylaysoutanyregularexpressioninaneasy-to-grasptreeofregexbuildingblocks.
RegexBuddyupdatesthetreeasyouedittheregularexpression.
Mucheasieristoworkwiththeregextreedirectly.
Deleteandmoveregexbuildingblocks,andaddnewonesbyselectingfromcleardescriptions.
Youcangetagoodoverviewofcomplexregularexpressionsbycollapsinggroupingandalternationblocksinthetree.
143InteractiveRegexTesterandDebuggerEventhoughRegexBuddy'sregextreemakesitveryclearhowaregularexpressionworks,theonlywaytobe100%surewhetheraparticularregexpatterndoeswhatyouwantistotestit.
RegexBuddyprovidesasafeenvironmentwhereyoucaninteractivelytestanddebugyourregularexpressionsonsampletextandfiles.
RegexBuddycanhighlightregexmatchesandcapturinggroups.
Thehighlightingisautomaticallyupdatedasyouedittheregex,soyoucaninstantlyseetheeffectsofyourchanges.
Fordetailedtests,RegexBuddyprovidescompetedetailsaboutmatchesandcapturinggroups.
Youcaneasilytestregexsearch-and-replaceandsplitactions.
ThekeyadvantagesoftestingregularexpressionswithRegexBuddyaresafetyandspeed.
RegexBuddycannotmodifyvaluablefilesandactualdata.
Youonlyseetheeffectwouldbe.
Openingasamplefileorcopyingandpastingsampledatatotestaregularexpressionismuchquickerthantransferringtheregextothetoolorsourcecodeyouwanttouseitwith,andcreatingyourowntestenvironment.
QuicklyDevelopEfficientSoftwareManypopularprogramminglanguagessupportregularexpressions.
Ifyouareaprogrammer,usingregularexpressionsenablesyoutodoinasingleorahandfullinesofcodewhatwouldotherwiserequiredozensorhundreds.
WhenyouuseRegexBuddy,testingasingleregularexpressionisfareasierthandebugginghandwrittencodethatdoesthesame.
Ifothersneedtomaintainyourcodelater,theywillbenefitfromRegexBuddy'sregexanalysistoquicklyunderstandyourcode.
YoucaninsertRegexBuddy'sregextreeasacommentinyoursourcecode.
RegexBuddymakesdevelopingsoftwarewithregexeseveneasierbyprovidingyouwithauto-generatedcodesnippets.
Insteadofrememberingthecorrectclassesandfunctioncalls,andhowtorepresentaregexinsourcecode,justtellRegexBuddywhichlanguageyouareusingandwhatyouwanttodo.
Copyandpasteyourcustom-generatedcodesnippetintoyourcodeeditor,andrun.
Usingregularexpressionsnotonlysavesyoutime.
Unlessyouspendalotoftimehand-optimizingyourowntextsearchingandprocessingcode,usingregularexpressionswillspeedupyoursoftware.
Thisiscertainlytrueifyourlanguagehasabuilt-inregexenginethatworksatalowerlevelthanyourowncodecan.
CollectandSaveRegularExpressionsUseRegexBuddytocollectyourownlibraryofhandyregularexpressions.
Youcansavearegexwithonlyoneclick.
Ifyoutypeinabriefdescriptionwitheachregexyoustore,RegexBuddy'sregexlookupenablesyoutoquicklyfindapreviouslysavedregexthatdoeswhatyouwant.
RegexBuddyalsocomeswithastandardlibraryofcommonregularexpressionsthatyoucanuseinawidevarietyofsituations.
144FindoutMoreandGetYourOwnCopyofRegexBuddyRegexBuddyworksunderWindows98,ME,NT4,2000,XPandVista,aswellasmostversionsLinuxforIntelPentiumandAMDAthlonPCs.
FormoreinformationonRegexBuddy,pleasevisitwww.
regexbuddy.
com.
YouwillquicklyearnthemoneyyoupayforRegexBuddybackmanytimesoverinthetimeandfrustrationyouwillsave.
RegexBuddymakesworkingwithregularexpressionsmucheasier,quickerandefficient.
14522.
UsingRegularExpressionswithRubyRubysupportsregularexpressionsasalanguagefeature.
InRuby,aregularexpressioniswrittenintheformof/pattern/modifierswhere"pattern"istheregularexpressionitself,and"modifiers"areaseriesofcharactersindicatingvariousoptions.
The"modifiers"partisoptional.
ThissyntaxisborrowedfromPerl.
Rubysupportsthefollowingmodifiers:/imakestheregexmatchcaseinsensitive.
/mmakesthedotmatchnewlines.
Rubyindeeduses/m,whereasPerlandmanyotherprogramminglanguagesuse/sfor"dotmatchesnewlines".
/xtellsRubytoignorewhitespacebetweenregextokens.
/ocausesany#{.
.
.
}substitutionsinaparticularregexliteraltobeperformedjustonce,thefirsttimeitisevaluated.
Otherwise,thesubstitutionswillbeperformedeverytimetheliteralgeneratesaRegexpobject.
Youcancombinemultiplemodifiersbystringingthemtogetherasin/regex/is.
InRuby,thecaretanddollaralwaysmatchbeforeandafternewlines.
Rubydoesnothaveamodifiertochangethis.
Use\Aand\Ztomatchatthestartortheendofthestring.
Sinceforwardslashesdelimittheregularexpression,anyforwardslashesthatappearintheregexneedtobeescaped.
E.
g.
theregex1/2iswrittenas/1\/2/inRuby.
HowToUseTheRegexpObject/regex/createsanewobjectoftheclassRegexp.
Youcanassignittoavariabletorepeatedlyusethesameregularexpression,orusetheliteralregexdirectly.
Totestifaparticularregexmatches(partof)astring,youcaneitherusethe=~operator,calltheregexpobject'smatch()method,e.
g.
:print"success"ifsubject=~/regex/orprint"success"if/regex/.
match(subject).
The=~operatorreturnsthecharacterpositioninthestringofthestartofthematch(whichevaluatestotrueinabooleantest),ornilifnomatchwasfound(whichevaluatestofalse).
Thematch()methodreturnsaMatchDataobject(whichalsoevaluatestotrue),ornilifnomatcheswasfound.
Inastringcontext,theMatchDataobjectevaluatestothetextthatwasmatched.
Soprint(/\w+/.
match("test"))prints"test",whileprint(/\w+/=~"test")prints"0".
Thefirstcharacterinthestringhasindexzero.
Switchingtheorderofthe=~operator'soperandsmakesnodifference.
SearchAndReplaceUsethesub()andgsub()methodsoftheStringclasstosearch-and-replacethefirstregexmatch,orallregexmatches,respectively,inthestring.
Specifytheregularexpressionyouwanttosearchforasthefirstparameter,andthereplacementstringasthesecondparameter,e.
g.
:result=subject.
gsub(/before/,"after").
Tore-inserttheregexmatch,use\0inthereplacementstring.
Youcanusethecontentsofcapturinggroupsinthereplacementstringwithbackreferences\1,\2,\3,etc.
Notethatnumbersescapedwithabackslasharetreatedasoctalescapesindouble-quotedstrings.
Octalescapesareprocessedatthelanguagelevel,beforethe146sub()functionseestheparameter.
Topreventthis,youneedtoescapethebackslashesindouble-quotedstrings.
Sotousethefirstbackreferenceasthereplacementstring,eitherpass'\1'or"\\1".
'\\1'alsoworks.
SplittingStringsandCollectingMatchesTocollectallregexmatchesinastringintoanarray,passtheregexpobjecttothestring'sscan()method,e.
g.
:myarray=mystring.
scan(/regex/).
Sometimes,itiseasiertocreatearegextomatchthedelimitersratherthanthetextyouareinterestedin.
Inthatcase,usethesplit()methodinstead,e.
g.
:myarray=mystring.
split(/delimiter/).
Thesplit()methoddiscardsallregexmatches,returningthetextbetweenthematches.
Thescan()methoddoestheopposite.
Ifyourregularexpressioncontainscapturinggroups,scan()returnsanarrayofarrays.
Eachelementintheoverallarraywillcontainanarrayconsistingoftheoverallregexmatch,plusthetextmatchedbyallcapturinggroups.
14723.
TclHasThreeRegularExpressionFlavorsTcl8.
2andlatersupportthreeregularexpressionflavors.
TheTclmanpagesdubthemBasicRegularExpressions(BRE),ExtendedRegularExpressions(ERE)andAdvancedRegularExpressions(ARE).
BREandEREaremainlyforbackwardcompatibilitywithpreviousversionsofTcl.
TheseflavorimplementthetwoflavorsdefinedinthePOSIXstandard.
AREsarenewinTcl8.
2.
They'rethedefaultandrecommendedflavor.
ThisflavorimplementsthePOSIXEREflavor,withawholebunchofaddedfeatures.
MostofthesefeaturesareinspiredbysimilarfeaturesinPerlregularexpressions.
Tcl'sregularexpressionsupportisbasedonalibrarydevelopedforTclbyHenrySpencer.
Thislibraryhassincebeenusedinanumberofotherprogramminglanguagesandapplications,suchasthePostgreSQLdatabaseandthewxWidgetsGUIlibraryforC++.
EverythingsaidaboutTclinthisregularexpressiontutorialappliestoanytoolthatusesHenrySpencer'sAdvancedRegularExpressions.
ThereareanumberofimportantdifferencesbetweenTclAdvancedRegularExpressionsandPerl-styleregularexpressions.
Tcluses\m,\M,\yand\Yforwordboundaries.
Perlandmostothermodernregexflavorsuse\band\B.
InTcl,theselasttwomatchabackspaceandabackslash,respectively.
Tclalsotakesacompletelydifferentapproachtomodemodifiers.
The(letters)syntaxisthesame,buttheavailablemodelettersandtheirmeaningsarequitedifferent.
Insteadofaddingmodemodifierstotheregularexpression,youcanpassmoredescriptiveswitcheslike-nocasetotheregexpandregsubcommandsforsomeofthemodes.
Modemodifierspansinthestyleof(modes:regex)arenotsupported.
Modemodifiersmustappearatthestartoftheregex.
Theyaffectthewholeregex.
Modemodifiersintheregexoverridecommandswitches.
Tclsupportsthesemodes:(i)or-nocasemakestheregexmatchcaseinsensitive.
(c)makestheregexmatchcasesensitive.
Thismodeisthedefault.
(x)or-expandedactivatesthefree-spacingregexpsyntax.
(t)disablesthefree-spacingregexpsyntax.
Thismodeisthedefault.
The"t"standsfor"tight",theoppositeof"expanded".
(b)tellsTcltointerprettheremainderoftheregularexpressionasaBasicRegularExpression.
(e)tellsTcltointerprettheremainderoftheregularexpressionasanExtendedRegularExpression.
(q)tellsTcltointerprettheremainderoftheregularexpressionasplaintext.
The"q"standsfor"quoted".
(s)selects"non-newline-sensitivematching",whichisthedefault.
The"s"standsfor"singleline".
Inthismode,thedotandnegatedcharacterclasseswillmatchallcharacters,includingnewlines.
Thecaretanddollarwillmatchonlyattheverystartandendofthesubjectstring.
(p)or-linestopenables"partialnewline-sensitivematching".
Inthismode,thedotandnegatedcharacterclasseswillnotmatchnewlines.
Thecaretanddollarwillmatchonlyattheverystartandendofthesubjectstring.
(w)or-lineanchorenables"inversepartialnewline-sensitivematching".
The"w"standsfor"weird".
(Don'tlookatme!
Ididn'tcomeupwiththis.
)Inthismode,thedotandnegatedcharacterclasseswillnotmatchnewlines.
Thecaretanddollarwillmatchafterandbeforenewlines.
(n)or-lineenableswhatTclcalls"newline-sensitivematching".
Thedotandnegatedcharacterclasseswillnotmatchnewlines.
Thecaretanddollarwillmatchafterandbeforenewlines.
Specifying(n)or-lineisthesameasspecifying(pw)or-linestop-lineanchor.
(m)isahistoricalsynonymfor(n).
Irecommendyouneveruseit,toavoidconfusionwithPerl's(m).
148IfyouuseregularexpressionswithTclandotherprogramminglanguages,becarefulwhendealingwiththenewline-relatedmatchingmodes.
Tcl'sdesignersfoundPerl's/mand/smodesconfusing.
Theyareconfusing,butatleastPerlhasonlytwo,andtheybothaffectonlyonething.
InPerl,/mor(m)enables"multi-linemode",whichmakesthecaretanddollarmatchafterandbeforenewlines.
Bydefault,theymatchattheverystartandendofthestringonly.
InPerl,/sor(s)enables"singlelinemode".
Thismodemakesthedotmatchallcharacters,includinglinebreak.
Bydefault,itdoesn'tmatchlinebreaks.
Perldoesnothaveamodemodifiertoexcludelinebreaksfromnegatedcharacterclasses.
InPerl,[^a]matchesanythingexcept"a",includingnewlines.
Theonlywaytoexcludenewlinesistowrite[^a\n].
Perl'sdefaultmatchingmodeislikeTcl's(p),exceptforthedifferenceinnegatedcharacterclasses.
WhycompareTclwithPerl.
NET,Java,PCREandPythonsupportthesame(m)and(s)modifierswiththeexactsamedefaultsandeffectsasinPerl.
JavaScriptlacks/sandRubylacks/m,butatleasttheydon'tintroducecompletelydifferentoptions.
Negatedcharacterclassesworkthesameinalltheselanguagesandlibraries.
It'sunfortunatethatTcldidn'tfollowPerl'sstandard,sinceTcl'sfouroptionsarejustasconfusingasPerl'stwooptions.
Togethertheymakeaverynicealphabetsoup.
IfyouignorethefactthatTcl'soptionsaffectnegatedcharacterclasses,youcanusethefollowingtabletotranslatebetweenTcl'snewlinemodesandPerl-stylenewlinemodes.
Notethatthedefaultsaredifferent.
Ifyoudon'tuseanyswitches,(s).
and.
areequivalentinTcl,butnotinPerl.
Tcl:(s)(default)Perl:(s)Dot:StartandendofstringonlyAnchors:AnycharacterTcl:(p)Perl:(default)Dot:StartandendofstringonlyAnchors:AnycharacterexceptnewlinesTcl:(w)Perl:(m)Dot:Startandendofstring,andatnewlinesAnchors:AnycharacterexceptnewlinesTcl:(n)Perl:(sm)Dot:Startandendofstring,andatnewlinesAnchors:AnycharacterRegularExpressionsasTclWordsYoucaninsertregularexpressionsinyourTclsourcecodeeitherbyenclosingthemwithdoublequotes(e.
g.
"myregexp")orbyenclosingthemwithcurlybraces(e.
g.
{myregexp}.
Sincethebracesdon'tdoanysubstitutionlikethequotes,they'rebyfarthebestchoiceforregularexpressions.
Theonlythingyouneedtoworryaboutisthatunescapedbracesintheregularexpressionmustbebalanced.
Escapedbracesdon'tneedtobebalanced,butthebackslashusedtoescapethebraceremainspartoftheregularexpression.
Youcaneasilysatisfytheserequirementsbyescapingallbracesinyourregularexpression,149exceptthoseusedasaquantifier.
Thiswayyourregexwillworkasexpected,andyoudon'tneedtochangeitatallwhenpastingitintoyourTclsourcecode,otherthanputtingapairofbracesaroundit.
Theregularexpression^\{\d{3}\\$matchesastringthatconsistsentirelyofanopeningbrace,threedigitsandonebackslash.
InTcl,thisbecomes{^\{\d+{3}$\\}.
There'snodoublingofbackslashesoranysortofescapingneeded,aslongasyouescapeliteralbracesintheregularexpression.
{and\{arebothvalidregularexpressionstomatchasingleopeningbraceinaTclARE(andanyPerl-styleregexflavor,forthatmatter).
OnlythelatterwillworkcorrectlyinaTclliteralenclosedwithbraces.
FindingRegexMatchesItTcl,youcanusetheregexpcommandtotestifaregularexpressionmatches(partof)astring,andtoretrievethematchedpart(s).
Thesyntaxofthecommandis:regexpswitchesregexpsubjectmatchvargroup1vargroup2var.
.
.
Immediatelyaftertheregexpcommand,youcanplacezeroormoreswitchesfromthelistabovetoindicatehowTclshouldapplytheregularexpression.
Theonlyrequiredparametersaretheregularexpressionandthesubjectstring.
YoucanspecifyaliteralregularexpressionusingbracesasIjustexplained.
Or,youcanreferenceanystringvariableholdingaregularexpressionreadfromafileoruserinput.
Ifyoupassthenameofavariableasanadditionalargument,Tclwillstorethepartofthestringmatchedbytheregularexpressionintothatvariable.
Tclwillnotsetthevariabletoanemptystringifthematchattemptfails.
Iftheregularexpressionshascapturinggroups,youcanaddadditionalvariablenamestocapturethetextmatchedbyeachgroup.
Ifyouspecifyfewervariablesthantheregexhascapturinggroups,thetextmatchedbytheadditionalgroupsisnotstored.
Ifyouspecifymorevariablesthantheregexhascapturinggroups,theadditionalvariableswillbesettoanemptystringiftheoverallregexmatchwassuccessful.
Theregexpcommandreturns1if(partof)thestringcouldbematched,andzeroifthere'snomatch.
Thefollowingscriptappliestheregularexpressionmyregexcaseinsensitivelytothestringstoredinthevariablesubjectstringanddisplaystheresult:if[regexp-nocase{myregex}$subjectstringmatchresult]then{puts$matchresult}else{puts"myregexcouldnotmatchthesubjectstring"}Theregexpcommandsupportsthreemoreswitchesthataren'tregexmodemodifiers.
The-allswitchcausesthecommandtoreturnanumberindicatinghowmanytimestheregexcouldbematched.
Thevariablesstoringtheregexandgroupmatcheswillstorethelastmatchinthestringonly.
The-inlineswitchtellstheregexpcommandtoreturnanarraywiththesubstringmatchedbytheregularexpressionandallsubstringsmatchedbyallcapturinggroups.
Ifyoualsospecifythe-allswitch,thearraywillcontainthefirstregexmatch,allthegroupmatchesofthefirstmatch,thenthesecondregexmatch,thegroupmatchesofthefirstmatch,etc.
The-startswitchmustbefollowedbyanumber(asaseparateTclword)thatindicatesthecharacteroffsetinthesubjectstringatwhichTclshouldattemptthematch.
Everythingbeforethestartingpositionwillbe150invisibletotheregexengine.
Thismeansthat\Awillmatchatthecharacteroffsetyouspecifywith-start,evenifthatpositionisnotatthestartofthestring.
ReplacingRegexMatchesWiththeregsubcommand,youcanreplaceregularexpressionmatchesinastring.
regsubswitchesregexpreplacementsubjectresultvarJustliketheregexpcommand,regsubtakeszeroormoreswitchesfollowedbyaregularexpression.
Itsupportsthesameswitches,exceptfor-inline.
Remembertospecify-allifyouwanttoreplaceallmatchesinthestring.
Theargumentaftertheregexpshouldbethereplacementtext.
Youcanspecifyaliteralreplacementusingthebracesyntax,orreferenceastringvariable.
Theregsubcommandrecognizesafewmetacharactersinthereplacementtext.
Youcanuse\0asaplaceholderforthewholeregexmatch,and\1through\9forthetextmatchedbyoneofthefirstninecapturinggroups.
Youcanalsouse&asasynonymof\0.
Notethatthere'snobackslashinfrontoftheampersand.
&issubstitutedwiththewholeregexmatch,while\&issubstitutedwithaliteralampersand.
Use\\toinsertaliteralbackslash.
Youonlyneedtoescapebackslashesifthey'refollowedbyadigit,topreventthecombinationfrombeingseenasabackreference.
Again,topreventunnecessaryduplicationofbackslashes,youshouldenclosethereplacementtextwithbracesinsteadofdoublequotes.
Thereplacementtext\1becomes{\1}whenusingbraces,and"\\1"whenusingquotes.
Thefinalargumentisoptional.
Ifyoupassavariablereferenceasthefinalargument,thatvariablewillreceivethestringwiththereplacementsapplied,andregsubwillreturnanintegerindicatingthenumberofreplacementsmade.
Ifyouomitthefinalargument,regsubwillreturnthestringwiththereplacementsapplied.
15124.
VBScript'sRegularExpressionSupportVBScripthasbuilt-insupportforregularexpressions.
IfyouuseVBScripttovalidateuserinputonawebpageattheclientside,usingVBScript'sregularexpressionsupportwillgreatlyreducetheamountofcodeyouneedtowrite.
MicrosoftmadesomesignificantenhancementstoVBScript'sregularexpressionsupportinversion5.
5ofInternetExplorer.
Version5.
5implementsquiteafewessentialregexfeaturesthatweremissinginpreviousversionsofVBScript.
InternetExplorer6.
0doesnotexpandtheregularexpressionfunctionality.
WheneverthisbookmentionsVBScript,thestatementsrefertoVBScript'sversion5.
5regularexpressionsupport.
Infact,theregularexpressionflavorusedintheversion5.
5VBScriptobjectisthesameoneusedbyJavaScriptandJScript.
TheregexflavorispartoftheECMA-262standardforJavaScript.
Therefore,everythingsaidaboutJavaScript'sregularexpressionflavorinthisbookalsoappliestoVBScript.
JavaScriptandVBScriptimplementPerl-styleregularexpressions.
However,theylackquiteanumberofadvancedfeaturesavailableinPerlandothermodernregularexpressionflavors:No\Aor\Zanchorstomatchthestartorendofthestring.
Useacaretordollarinstead.
Lookbehindisnotsupportedatall.
Lookaheadisfullysupported.
NoatomicgroupingorpossessivequantifiersNoUnicodesupport,exceptformatchingsinglecharacterswith\uFFFFNonamedcapturinggroups.
Usenumberedcapturinggroupsinstead.
Nomodemodifierstosetmatchingoptionswithintheregularexpression.
Noconditionals.
Noregularexpressioncomments.
DescribeyourregularexpressionwithVBScriptapostrophecommentsinstead,outsidetheregularexpressionstring.
Version1.
0oftheRegExpobjectevenlacksbasicfeatureslikelazyquantifiers.
ThisisthemainreasonthisbookdoesnotdiscussVBScriptRegExp1.
0.
AllversionsofInternetExplorerpriorto5.
5includeversion1.
0oftheRegExpobject.
Therearenootherversionsthan1.
0and5.
5.
HowtoUsetheVBScriptRegExpObjectYoucanuseregularexpressionsinVBScriptbycreatingoneormoreinstancesoftheRegExpobject.
Thisobjectallowsyoutofindregularexpressionmatchesinstrings,andreplaceregexmatchesinstringswithotherstrings.
ThefunctionalityofferedbyVBScript'sRegExpobjectisprettymuchbarebones.
However,it'smorethanenoughforsimpleinputvalidationandoutputformattingtaskstypicallydoneinVBScript.
TheadvantageoftheRegExpobject'sbare-bonesnatureisthatit'sveryeasytouse.
Createone,putinaregex,andletitmatchorreplace.
Onlyfourpropertiesandthreemethodsareavailable.
Aftercreatingtheobject,assigntheregularexpressionyouwanttosearchfortothePatternproperty.
Ifyouwanttousealiteralregularexpressionratherthanauser-suppliedone,simplyputtheregularexpressioninadouble-quotedstring.
Bydefault,theregularexpressioniscasesensitive.
SettheIgnoreCasepropertytoTruetomakeitcaseinsensitive.
Thecaretanddollaronlymatchattheverystartandveryendofthesubjectstringbydefault.
Ifyoursubjectstringconsistsofmultiplelinesseparatedbylinebreaks,youcanmakethecaretanddollarmatchatthestartandtheendofthoselinesbysettingtheMultilinepropertytoTrue.
152VBScriptdoesnothaveanoptiontomakethedotmatchlinebreakcharacters.
Finally,ifyouwanttheRegExpobjecttoreturnorreplaceallmatchesinsteadofjustthefirstone,settheGlobalpropertytoTrue.
'PreparearegularexpressionobjectSetmyRegExp=NewRegExpmyRegExp.
IgnoreCase=TruemyRegExp.
Global=TruemyRegExp.
Pattern="regex"AftersettingtheRegExpobject'sproperties,youcaninvokeoneofthethreemethodstoperformoneofthreebasictasks.
TheTestmethodtakesoneparameter:astringtotesttheregularexpressionon.
TestreturnsTrueorFalse,indicatingiftheregularexpressionmatches(partof)thestring.
Whenvalidatinguserinput,you'lltypicallywanttocheckiftheentirestringmatchestheregularexpression.
Todoso,putacaretatthestartoftheregex,andadollarattheend,toanchortheregexatthestartandendofthesubjectstring.
TheExecutemethodalsotakesonestringparameter.
InsteadofreturningTrueorFalse,itreturnsaMatchCollectionobject.
Iftheregexcouldnotmatchthesubjectstringatall,MatchCollection.
Countwillbezero.
IftheRegExp.
GlobalpropertyisFalse(thedefault),MatchCollectionwillcontainonlythefirstmatch.
IfRegExp.
Globalistrue,Matches>willcontainallmatches.
TheReplacemethodtakestwostringparameters.
Thefirstparameteristhesubjectstring,whilethesecondparameteristhereplacementtext.
IftheRegExp.
GlobalpropertyisFalse(thedefault),Replacewillreturnthesubjectstringwiththefirstregexmatch(ifany)substitutedwiththereplacementtext.
IfRegExp.
Globalpropertyistrue,Matches>willcontainallmatches.
IfRegExp.
Globalistrue,Replacewillreturnthesubjectstringwithallmatchesreplaced.
Youcanspecifyanemptystringasthereplacementtext.
ThiswillcausetheReplacemethodtoreturnthesubjectstringwillallregexmatchesdeletedfromit.
Tore-inserttheregexmatchaspartofthereplacement,include"$&"inthereplacementtext.
E.
g.
toencloseeachregexmatchinthestringbetweensquarebrackets,specify"[$&]"asthereplacementtext.
Iftheregexpcontainscapturingparentheses,youcanusebackreferencesinthereplacementtext.
$1inthereplacementtextinsertsthetextmatchedbythefirstcapturinggroup,$2thesecond,etc.
upto$9.
Toincludealiteraldollarsigninthereplacements,puttwoconsecutivedollarsignsinthestringyoupasstotheReplacemethod.
GettingInformationaboutIndividualMatchesTheMatchCollectionobjectreturnedbytheRegExp.
ExecutemethodisacollectionofMatchobjects.
Ithasonlytworead-onlyproperties.
TheCountpropertyindicateshowmanymatchesthecollectionholds.
TheItempropertytakesanindexparameter(rangingfromzerotoCount-1),andreturnsaMatchobject.
TheItempropertyisthedefaultmember,soyoucanwriteMatchCollection(7)asashorthandtoMatchCollection.
Item(7).
TheeasiestwaytoprocessallmatchesinthecollectionistouseaForEachconstruct,e.
g.
:'PopupamessageboxforeachmatchSetmyMatches=myRegExp.
Execute(subjectString)ForEachmyMatchinmyMatchesmsgboxmyMatch.
Value,0,"FoundMatch"NextTheMatchobjecthasfourread-onlyproperties.
TheFirstIndexpropertyindicatesthenumberofcharactersinthestringtotheleftofthematch.
Ifthematchwasfoundattheverystartofthestring,153FirstIndexwillbezero.
Ifthematchstartsatthesecondcharacterinthestring,FirstIndexwillbeone,etc.
NotethatthisisdifferentfromtheVBScriptMidfunction,whichextractsthefirstcharacterofthestringifyousetthestartparametertoone.
TheLengthpropertyoftheMatchobjectindicatesthenumberofcharactersinthematch.
TheValuepropertyreturnsthetextthatwasmatched.
TheSubMatchespropertyoftheMatchobjectisacollectionofstrings.
Itwillonlyholdvaluesifyourregularexpressionhascapturinggroups.
Thecollectionwillholdonestringforeachcapturinggroup.
TheCountpropertyindicatesthenumberofstringinthecollection.
TheItempropertytakesanindexparameter,andreturnsthetextmatchedbythecapturinggroup.
TheItempropertyisthedefaultmember,soyoucanwriteSubMatches(7)asashorthandtoSubMatches.
Item(7).
Unfortunately,VBScriptdoesnotofferawaytoretrievethematchpositionandlengthofcapturinggroups.
AlsounfortunatelyisthattheSubMatchespropertydoesnotholdthecompleteregexmatchasSubMatches(0).
Instead,SubMatches(0)holdsthetextmatchedbythefirstcapturinggroup,whileSubMatches(SubMatches.
Count-1)holdsthetextmatchedbythelastcapturinggroup.
Thisisdifferentfrommostotherprogramminglanguages.
E.
g.
inVB.
NET,Match.
Groups(0)returnsthewholeregexmatch,andMatch.
Groups(1)returnsthefirstcapturinggroup'smatch.
NotethatthisisalsodifferentfromthebackreferencesyoucanuseinthereplacementtextpassedtotheRegExp.
Replacemethod.
Inthereplacementtext,$1insertsthetextmatchedbythefirstcapturinggroup,justlikemostotherregexflavorsdo.
$0isnotsubstitutedwithanythingbutinsertedliterally.
15425.
VBScriptRegExpExample:RegularExpressionTester0ThenSetmatch=matches(0)msg="Foundmatch"""&match.
Value&_"""atposition"&match.
FirstIndex&vbCRLFIfmatch.
SubMatches.
Count>0ThenForI=0Tomatch.
SubMatches.
Count-1msg=msg&"Group#"&I+1&"matched"""&_match.
SubMatches(I)vbCRLFNextEndIfmsgboxmsg,0,"VBScriptRegularExpressionTester"Elsemsgbox"Nomatch",0,"VBScriptRegularExpressionTester"EndIfEndSubSubbtnMatchGlobal_OnClickSetre=NewRegExpre.
Pattern=document.
demoMatch.
regex.
valuere.
Global=TrueSetmatches=re.
Execute(document.
demoMatch.
subject.
value)Ifmatches.
Count>0Thenmsg="Found"&matches.
Count&"matches:"&vbCRLFForEachmatchInMatchesmsg=msg&"Foundmatch"""&match.
Value&_"""atposition"&match.
FirstIndex&vbCRLFNextmsgboxmsg,0,"VBScriptRegularExpressionTester"Elsemsgbox"Nomatch",0,"VBScriptRegularExpressionTester"EndIfEndSubSubbtnReplace_OnClickSetre=NewRegExpre.
Pattern=document.
demoMatch.
regex.
valuere.
Global=Truedocument.
demoMatch.
result.
value=_re.
Replace(document.
demoMatch.
subject.
value,_document.
demoMatch.
replacement.
value)EndSub'-->Regexp:Subjectstring:155Replacementtext:Result:15626.
HowtoUseRegularExpressionsinVisualBasicUnlikeVisualBasic.
NET,whichhasaccesstotheexcellentregularexpressionsupportofthe.
NETframework,goodoldVisualBasic6doesnotshipwithanyregularexpressionsupport.
However,VB6doesmakeitveryeasytousefunctionalityprovidedbyActiveXandCOMlibraries.
OnesuchlibraryisMicrosoft'sVBScriptscriptinglibrary,whichhasdecentregularexpressioncapabilitiesstartingwithversion5.
5.
ItimplementsthesameregularexpressionflavorusedinJavaScript,asstandardizedintheECMA-262standardforJavaScript.
ThislibraryispartofInternetExplorer5.
5andlater.
ItisavailableonallcomputersrunningWindowsXPorVista,andpreviousversionsofWindowsiftheuserupgradedtoIE5.
5orlater.
ThatincludesalmosteveryWindowsPCthatisusedtoconnecttotheInternet.
TousethislibraryinyourVisualBasicapplication,selectProject|ReferencesintheVBIDE'smenu.
Scrolldownthelisttofindtheitem"MicrosoftVBScriptRegularExpressions5.
5".
It'simmediatelybelowthe"MicrosoftVBScriptRegularExpressions1.
0"item.
Makesuretotickthe5.
5version,notthe1.
0version.
The1.
0versionisonlyprovidedforbackwardcompatibility.
Itscapabilitiesarelessthansatisfactory.
Afteraddingthereference,youcanseewhichclassesandclassmembersthelibraryprovides.
SelectView|ObjectBrowserinthemenu.
IntheObjectBrowser,selectthe"VBScript_RegExp_55"libraryinthedrop-downlistintheupperleftcorner.
Foradetaileddescription,seetheVBScriptregularexpressionreferenceinthisbook.
AnythingsaidaboutJavaScript'sflavorofregularexpressionsinthetutorialalsoappliestoVBScript'sflavor.
TheonlydifferencebetweenVB6andVBScriptisthatyou'llneedtouseaDimstatementtodeclaretheobjectspriortocreatingthem.
Here'sacompletecodesnippet.
It'sthetwocodesnippetsontheVBScriptpageputtogether,withthreeDimstatementsadded.
'PreparearegularexpressionobjectDimmyRegExpAsRegExpDimmyMatchesAsMatchCollectionDimmyMatchAsMatchSetmyRegExp=NewRegExpmyRegExp.
IgnoreCase=TruemyRegExp.
Global=TruemyRegExp.
Pattern="regex"SetmyMatches=myRegExp.
Execute(subjectString)ForEachmyMatchinmyMatchesMsgBox(myMatch.
Value)Next15727.
XMLSchemaRegularExpressionsTheW3CXMLSchemastandarddefinesitsownregularexpressionflavor.
YoucanuseitinthepatternfacetofsimpletypedefinitionsinyourXMLschemas.
E.
g.
thefollowingdefinesthesimpletype"SSN"usingaregularexpressiontorequiretheelementtocontainavalidUSsocialsecuritynumber.
Comparedwithotherregularexpressionflavors,theXMLschemaflavorisquitelimitedinfeatures.
Sinceit'sonlyusedtovalidatewhetheranentireelementmatchesapatternornot,ratherthanforextractingmatchesfromlargeblocksofdata,youwon'treallymissthefeaturesoftenfoundinotherflavors.
Thelimitationsallowschemavalidatorstobeimplementedwithefficienttext-directedengines.
Particularlynoteworthyisthecompleteabsenceofanchorslikethecaretanddollar,wordboundariesandlookaround.
XMLschemaalwaysimplicitlyanchorstheentireregularexpression.
Theregexmustmatchthewholeelementfortheelementtobeconsideredvalid.
Ifyouhavethepatternregexp,theXMLschemavalidatorwillapplyitinthesamewayassayPerl,Javaor.
NETwoulddowiththepattern^regexp$.
Ifyouwanttoacceptallelementswithregex"somewhereinthemiddleoftheircontents,you'llneedtousetheregularexpression.
*regex.
*.
Theto.
*expandthematchtocoverthewholeelement,assumingitdoesn'tcontainlinebreaks.
Ifyouwanttoallowlinebreaks,youcanusesomethinglike[\s\S]*regex[\s\S]*.
Combiningashorthandcharacterclasswithitsnegatedversionresultsinacharacterclassthatmatchesanything.
XMLschemasdonotprovideawaytospecifymatchingmodes.
Thedotnevermatcheslinebreaks,andpatternsarealwaysappliedcasesensitively.
Ifyouwanttoapplyliteralcaseinsensitively,you'llneedtorewriteitas[lL][iI][tT][eE][rR][aA][lL].
XMLregularexpressionsdon'thaveanytokenslike\xFFor\uFFFFtomatchparticular(non-printable)characters.
YoucanusetheXMLsyntaxforthis,orsimplycopythecharacterdirectlyfromacharactermap.
Lazyquantifiersarenotavailable.
Sincethepatternisanchoredatthestartandtheendofthesubjectstringanyway,andonlyasuccess/failureresultisreturned,theonlypotentialdifferencebetweenagreedyandlazyquantifierwouldbeperformance.
Youcannevermakeafullyanchoredpatternmatchorfailbychangingagreedyquantifierintoalazyoneorviceversa.
XMLregularexpressionssupportthefollowing:Characterclasses,includingshorthands,rangesandnegatedclasses.
Characterclasssubtraction.
Thedot,whichmatchesanycharacterexceptlinebreaks.
Alternationandgroups.
Greedyquantifiers,*,+and{n,m}Unicodepropertiesandblocks158XMLCharacterClassesDespiteitslimitations,XMLschemaregularexpressionsintroducetwohandyfeatures.
Thespecialshort-handcharacterclasses\iand\cmakeiteasytomatchXMLnames.
Nootherregexflavorsupportsthese.
Characterclasssubtractionmakesiteasytomatchacharacterthatisinacertainlist,butnotinanotherlist.
E.
g.
[a-z-[aeiou]]matchesanEnglishconsonant.
ThisfeatureisnowalsoavailableintheJGsoftand.
NETregexengines.
ItisparticularlyhandywhenworkingwithUnicodeproperties.
E.
g.
[\p{L}-[\p{IsBasicLatin}]]matchesanyletterthatisnotanEnglishletter.
Part4Reference1611.
BasicSyntaxReferenceCharactersCharacter:AnycharacterexceptDescription:Allcharactersexceptthelistedspecialcharactersmatchasingleinstanceofthemselves.
{and}areliteralcharacters,unlessthey'repartofavalidregularexpressiontoken(e.
g.
the{n}quantifier).
Example:amatchesa"Character:\(backslash)followedbyanyofDescription:Abackslashescapesspecialcharacterstosuppresstheirspecialmeaning.
Example:\+matches+"Character:\Q.
.
.
\EDescription:Matchesthecharactersbetween\Qand\Eliterally,suppressingthemeaningofspecialcharacters.
Example:\Q+-*/\Ematches+-*/"Character:\xFFwhereFFare2hexadecimaldigitsDescription:MatchesthecharacterwiththespecifiedASCII/ANSIvalue,whichdependsonthecodepageused.
Canbeusedincharacterclasses.
Example:\xA9matches"whenusingtheLatin-1codepage.
Character:\n,\rand\tDescription:MatchanLFcharacter,CRcharacterandatabcharacterrespectively.
Canbeusedincharacterclasses.
Example:\r\nmatchesaDOS/WindowsCRLFlinebreak.
Character:\a,\e,\fand\vDescription:Matchabellcharacter(\x07),escapecharacter(\x1B),formfeed(\x0C)andverticaltab(\x0B)respectively.
Canbeusedincharacterclasses.
Character:\cAthrough\cZDescription:MatchanASCIIcharacterControl+AthroughControl+Z,equivalentto\x01through\x1A.
Canbeusedincharacterclasses.
Example:\cM\cJmatchesaDOS/WindowsCRLFlinebreak.
162CharacterClassesorCharacterSets[abc]Character:[(openingsquarebracket)Description:Startsacharacterclass.
Acharacterclassmatchesasinglecharacteroutofallthepossibilitiesofferedbythecharacterclass.
Insideacharacterclass,differentrulesapply.
Therulesinthissectionareonlyvalidinsidecharacterclasses.
Therulesoutsidethissectionarenotvalidincharacterclasses,except\n,\r,\tand\xFFCharacter:Anycharacterexcept^-]\addthatcharactertothepossiblematchesforthecharacterclass.
Description:Allcharactersexceptthelistedspecialcharacters.
Example:[abc]matchesa",b"orc"Character:\(backslash)followedbyanyof^-]\Description:Abackslashescapesspecialcharacterstosuppresstheirspecialmeaning.
Example:matches^"or]"Character:-(hyphen)exceptimmediatelyaftertheopening[Description:Specifiesarangeofcharacters.
(Specifiesahyphenifplacedimmediatelyaftertheopening[)Example:[a-zA-Z0-9]matchesanyletterordigitCharacter:^(caret)immediatelyaftertheopening[Description:Negatesthecharacterclass,causingittomatchasinglecharacternotlistedinthecharacterclass.
(Specifiesacaretifplacedanywhereexceptaftertheopening[)Example:[^a-d]matchesx"(anycharacterexcepta,b,cord)Character:\d,\wand\sDescription:Shorthandcharacterclassesmatchingdigits0-9,wordcharacters(lettersanddigits)andwhitespacerespectively.
Canbeusedinsideandoutsidecharacterclasses.
Example:[\d\s]matchesacharacterthatisadigitorwhitespaceCharacter:\D,\Wand\SDescription:Negatedversionsoftheabove.
Shouldbeusedonlyoutsidecharacterclasses.
(Canbeusedinside,butthatisconfusing.
)Example:\DmatchesacharacterthatisnotadigitCharacter:[\b]Description:Insideacharacterclass,\bisabackspacecharacter.
Example:[\b\t]matchesabackspaceortabcharacterDotCharacter:.
(dot)Description:Matchesanysinglecharacterexceptlinebreakcharacters\rand\n.
Mostregexflavorshaveanoptiontomakethedotmatchlinebreakcharacterstoo.
Example:.
matchesx"or(almost)anyothercharacter163AnchorsCharacter:^(caret)Description:Matchesatthestartofthestringtheregexpatternisappliedto.
Matchesapositionratherthanacharacter.
Mostregexflavorshaveanoptiontomakethecaretmatchafterlinebreaks(i.
e.
atthestartofalineinafile)aswell.
Example:^.
matchesa"in"abc\ndef".
Alsomatchesd"in"multi-line"mode.
Character:$(dollar)Description:Matchesattheendofthestringtheregexpatternisappliedto.
Matchesapositionratherthanacharacter.
Mostregexflavorshaveanoptiontomakethedollarmatchbeforelinebreaks(i.
e.
attheendofalineinafile)aswell.
Alsomatchesbeforetheverylastlinebreakifthestringendswithalinebreak.
Example:.
$matchesf"in"abc\ndef".
Alsomatchesc"in"multi-line"mode.
Character:\ADescription:Matchesatthestartofthestringtheregexpatternisappliedto.
Matchesapositionratherthanacharacter.
Nevermatchesafterlinebreaks.
Example:\A.
matchesa"in"abc"Character:\ZDescription:Matchesattheendofthestringtheregexpatternisappliedto.
Matchesapositionratherthanacharacter.
Nevermatchesbeforelinebreaks,exceptfortheverylastlinebreakifthestringendswithalinebreak.
Example:.
\Zmatchesf"in"abc\ndef"Character:\zDescription:Matchesattheendofthestringtheregexpatternisappliedto.
Matchesapositionratherthanacharacter.
Nevermatchesbeforelinebreaks.
Example:.
\zmatchesf"in"abc\ndef"WordBoundariesCharacter:\bDescription:Matchesatthepositionbetweenawordcharacter(anythingmatchedby\w)andanon-wordcharacter(anythingmatchedby[^\w]or\W)aswellasatthestartand/orendofthestringifthefirstand/orlastcharactersinthestringarewordcharacters.
Example:.
\bmatchesc"in"abc"Character:\BDescription:Matchesatthepositionbetweentwowordcharacters(i.
ethepositionbetween\w\w)aswellasatthepositionbetweentwonon-wordcharacters(i.
e.
\W\W).
Example:\B.
\Bmatchesb"in"abc"164AlternationCharacter:|(pipe)Description:Causestheregexenginetomatcheitherthepartontheleftside,orthepartontherightside.
Canbestrungtogetherintoaseriesofoptions.
Example:abc|def|xyzmatchesabc",def"orxyz"Character:|(pipe)Description:Thepipehasthelowestprecedenceofalloperators.
Usegroupingtoalternateonlypartoftheregularexpression.
Example:abc(def|xyz)matchesabcdef"orabcxyz"QuantifiersCharacter:(questionmark)Description:Makestheprecedingitemoptional.
Greedy,sotheoptionalitemisincludedinthematchifpossible.
Example:abcmatchesab"orabc"Character:Description:Makestheprecedingitemoptional.
Lazy,sotheoptionalitemisexcludedinthematchifpossible.
Thisconstructisoftenexcludedfromdocumentationbecauseofitslimiteduse.
Example:abcmatchesab"orabc"Character:*(star)Description:Repeatsthepreviousitemzeroormoretimes.
Greedy,soasmanyitemsaspossiblewillbematchedbeforetryingpermutationswithlessmatchesoftheprecedingitem,uptothepointwheretheprecedingitemisnotmatchedatall.
Example:matches"def""ghi""in"abc"def""ghi"jkl"Character:*(lazystar)Description:Repeatsthepreviousitemzeroormoretimes.
Lazy,sotheenginefirstattemptstoskipthepreviousitem,beforetryingpermutationswitheverincreasingmatchesoftheprecedingitem.
Example:matches"def""in"abc"def""ghi"jkl"Character:+(plus)Description:Repeatsthepreviousitemonceormore.
Greedy,soasmanyitemsaspossiblewillbematchedbeforetryingpermutationswithlessmatchesoftheprecedingitem,uptothepointwheretheprecedingitemismatchedonlyonce.
Example:matches"def""ghi""in"abc"def""ghi"jkl"Character:+(lazyplus)Description:Repeatsthepreviousitemonceormore.
Lazy,sotheenginefirstmatchesthepreviousitemonlyonce,beforetryingpermutationswitheverincreasingmatchesoftheprecedingitem.
Example:matches"def""in"abc"def""ghi"jkl"165Character:{n}wherenisaninteger>=1Description:Repeatsthepreviousitemexactlyntimes.
Example:a{3}matchesaaa"Character:{n,m}wheren>=1andm>=nDescription:Repeatsthepreviousitembetweennandmtimes.
Greedy,sorepeatingmtimesistriedbeforereducingtherepetitiontontimes.
Example:a{2,4}matchesaa",aaa"oraaaa"Character:{n,m}wheren>=1andm>=nDescription:Repeatsthepreviousitembetweennandmtimes.
Lazy,sorepeatingntimesistriedbeforeincreasingtherepetitiontomtimes.
Example:a{2,4}matchesaaaa",aaa"oraa"Character:{n,}wheren>=1Description:Repeatsthepreviousitematleastntimes.
Greedy,soasmanyitemsaspossiblewillbematchedbeforetryingpermutationswithlessmatchesoftheprecedingitem,uptothepointwheretheprecedingitemismatchedonlyntimes.
Example:a{2,}matchesaaaaa"in"aaaaa"Character:{n,}wheren>=1Description:Repeatsthepreviousitembetweennandmtimes.
Lazy,sotheenginefirstmatchesthepreviousitemntimes,beforetryingpermutationswitheverincreasingmatchesoftheprecedingitem.
Example:a{2,}matchesaa"in"aaaaa"1662.
AdvancedSyntaxReferenceGroupingandBackreferencesCharacter:(regex)Description:Roundbracketsgrouptheregexbetweenthem.
Theycapturethetextmatchedbytheregexinsidethemthatcanbereusedinabackreference,andtheyallowyoutoapplyregexoperatorstotheentiregroupedregex.
Example:(abc){3}matchesabcabcabc".
Firstgroupmatchesabc".
Character:(:regex)Description:Non-capturingparenthesesgrouptheregexsoyoucanapplyregexoperators,butdonotcaptureanythinganddonotcreatebackreferences.
Example:(:abc){3}matchesabcabcabc".
Nogroups.
Character:\1through\9Description:Substitutedwiththetextmatchedbetweenthe1stthrough9thpairofcapturingparentheses.
Someregexflavorsallowmorethan9backreferences.
Example:(abc|def)=\1matchesabc=abc"ordef=def",butnot"abc=def"or"def=abc".
ModifiersCharacter:(i)Description:Turnoncaseinsensitivityfortheremainderoftheregularexpression.
(Olderregexflavorsmayturnitonfortheentireregex.
)Example:te(i)stmatchesteST"butnot"TEST".
Character:(-i)Description:Turnoffcaseinsensitivityfortheremainderoftheregularexpression.
Example:(i)te(-i)stmatchesTEst"butnot"TEST".
167Character:(s)Description:Turnon"dotmatchesnewline"fortheremainderoftheregularexpression.
(Olderregexflavorsmayturnitonfortheentireregex.
)Character:(-s)Description:Turnoff"dotmatchesnewline"fortheremainderoftheregularexpression.
Character:(m)Description:Caretanddollarmatchafterandbeforenewlinesfortheremainderoftheregularexpression.
(Olderregexflavorsmayapplythistotheentireregex.
)Character:(-m)Description:Caretanddollaronlymatchatthestartandendofthestringfortheremainderoftheregularexpression.
Character:(x)Description:Turnonfree-spacingmodetoignorewhitespacebetweenregextokens,andallow#comments.
Character:(-x)Description:Turnofffree-spacingmode.
Character:(i-sm)Description:Turnsontheoptions"i"and"m",andturnsoff"s"fortheremainderoftheregularexpression.
(Olderregexflavorsmayapplythistotheentireregex.
)Character:(i-sm:regex)Description:Matchestheregexinsidethespanwiththeoptions"i"and"m"turnedon,and"s"turnedoff.
Example:(i:te)stmatchesTEst"butnot"TEST".
AtomicGroupingandPossessiveQuantifiersCharacter:(>regex)Description:Atomicgroupspreventtheregexenginefrombacktrackingbackintothegroup(forcingthegrouptodiscardpartofitsmatch)afteramatchhasbeenfoundforthegroup.
Backtrackingcanoccurinsidethegroupbeforeithasmatchedcompletely,andtheenginecanbacktrackpasttheentiregroup,discardingitsmatchentirely.
Eliminatingneedlessbacktrackingprovidesaspeedincrease.
Atomicgroupingisoftenindispensablewhennestingquantifierstopreventacatastrophicamountofbacktrackingastheengineneedlesslytriespointlesspermutationsofthenestedquantifiers.
Example:x(>\w+)xismoreefficientthanx\w+xifthesecondxcannotbematched.
Character:and{m,n}+Description:Possessivequantifiersarealimitedyetsyntacticallycleaneralternativetoatomicgrouping.
Onlyavailableinafewregexflavors.
Theybehaveasnormalgreedyquantifiers,exceptthattheywillnotgiveuppartoftheirmatchforbacktracking.
Example:x++isidenticalto(>x+)168LookaroundCharacter:(=regex)Description:Zero-widthpositivelookahead.
Matchesatapositionwherethepatterninsidethelookaheadcanbematched.
Matchesonlytheposition.
Itdoesnotconsumeanycharactersorexpandthematch.
Inapatternlikeone(=two)three,bothtwoandthreehavetomatchatthepositionwherethematchofoneends.
Example:t(=s)matchesthesecondt"instreets".
Character:(!
regex)Description:Zero-widthnegativelookahead.
Identicaltopositivelookahead,exceptthattheoverallmatchwillonlysucceediftheregexinsidethelookaheadfailstomatch.
Example:t(!
s)matchesthefirstt"instreets".
Character:(regex)Description:Roundbracketsgrouptheregexbetweenthem.
Theycapturethetextmatchedbytheregexinsidethemthatcanbereferencedbythenamebetweenthesharpbrackets.
Thenamemayconsistoflettersanddigits.
Character:('name'regex)Description:Roundbracketsgrouptheregexbetweenthem.
Theycapturethetextmatchedbytheregexinsidethemthatcanbereferencedbythenamebetweenthesinglequotes.
Thenamemayconsistoflettersanddigits.
Character:\kDescription:Substitutedwiththetextmatchedbythecapturinggroupwiththegivenname.
Example:(abc|def)=\kmatchesabc=abc"ordef=def",butnot"abc=def"or"def=abc".
Character:\k'name'Description:Substitutedwiththetextmatchedbythecapturinggroupwiththegivenname.
Example:('group'abc|def)=\k'group'matchesabc=abc"ordef=def",butnot"abc=def"or"def=abc".
Character:((name)then|else)Description:Ifthecapturinggroup"name"tookpartinthematchattemptthusfar,the"then"partmustmatchfortheoverallregextomatch.
Ifthecapturinggroup"name"didnottakepartinthematch,the"else"partmustmatchfortheoverallregextomatch.
Example:(a)((group)b|c)matchesab",thefirstc"andthesecondc"in"babxcac"PythonSyntaxforNamedCaptureandBackreferencesCharacter:(Pregex)Description:Roundbracketsgrouptheregexbetweenthem.
Theycapturethetextmatchedbytheregexinsidethemthatcanbereferencedbythenamebetweenthesharpbrackets.
Thenamemayconsistoflettersanddigits.
Character:(P=name)Description:Substitutedwiththetextmatchedbythecapturinggroupwiththegivenname.
Notagroup,despitethesyntaxusingroundbrackets.
Example:(Pabc|def)=(P=group)matchesabc=abc"ordef=def",butnot"abc=def"or"def=abc".
172XMLCharacterClassesCharacter:\iDescription:MatchesanycharacterthatmaybethefirstcharacterofanXMLname,i.
e.
[_:A-Za-z].
Character:\cDescription:\cmatchesanycharacterthatmayoccurafterthefirstcharacterinanXMLname,i.
e.
[-.
_:A-Za-z0-9]Example:\i\c*matchesanXMLnamelikexml:schema"Character:\IDescription:MatchesanycharacterthatcannotbethefirstcharacterofanXMLname,i.
e.
[^_:A-Za-z].
Character:\CDescription:MatchesanycharacterthatcannotoccurinanXMLname,i.
e.
A-Za-z0-9].
Character:[abc-[xyz]]Description:Subtractscharacterclass"xyz"fromcharacterclass"abc".
Theresultmatchesanysinglecharacterthatoccursinthecharacterclass"abc"butnotinthecharacterclass"xyz".
Example:[a-z-[aeiou]]matchesanyletterthatisnotavowel(i.
e.
aconsonant).
POSIXBracketExpressionsCharacter:[:alpha:]Description:MatchesonecharacterfromaPOSIXcharacterclass.
Canonlybeusedinabracketexpression.
Example:[[:digit:][:lower:]]matchesoneof0"through9"ora"throughz"Character:[.
span-ll.
]Description:MatchesaPOSIXcollationsequence.
Canonlybeusedinabracketexpression.
Example:[[.
span-ll.
]]matchesll"intheSpanishlocaleCharacter:[=x=]Description:MatchesaPOSIXcharacterequivalence.
Canonlybeusedinabracketexpression.
Example:[[=e=]]matchese"andê"intheFrenchlocale1735.
RegularExpressionFlavorComparisonThetablebelowcompareswhichregularexpressionflavorssupportwhichregexfeaturesandsyntax.
Thefeaturesarelistedinthesameorderasintheregularexpressionreference.
Thecomparisonshowsregularexpressionflavorsratherthanparticularapplicationsorprogramminglanguagesimplementingoneofthoseregularexpressionflavors.
JGsoft:ThisflavorisusedbytheJGsoftproducts,includingPowerGREPandEditPadPro.
.
NET:ThisflavorisusedbyprogramminglanguagesbasedontheMicrosoft.
NETframeworkversions1.
x,2.
0or3.
0.
Itisgenerallyalsotheregexflavorusedbyapplicationsdevelopedintheseprogramminglanguages.
Java:Theregexflavorofthejava.
util.
regexpackage,availableintheJava4(JDK1.
4.
x)andlater.
AfewfeatureswereaddedinJava5(JDK1.
5.
x)andJava6(JDK1.
6.
x).
ItisgenerallyalsotheregexflavorusedbyapplicationsdevelopedinJava.
Perl:TheregexflavorusedinthePerlprogramminglanguage,asofversion5.
8.
Versionspriorto5.
6donotsupportUnicode.
PCRE:TheopensourcePCRElibrary.
ThefeaturesetdescribedhereisavailableinPCRE5.
xand6.
x.
ECMA(JavaScript):Theregularexpressionsyntaxdefinedinthe3rdeditionoftheECMA-262standard,whichdefinesthescriptinglanguagecommonlyknownasJavaScript.
Python:TheregexflavorsupportedbyPython'sbuilt-inremodule.
Ruby:TheregexflavorbuiltintotheRubyprogramminglanguage.
TclARE:TheregexflavordevelopedbyHenrySpencerfortheregexpcommandinTcl8.
2and8.
4,dubbedAdvancedRegularExpressions.
POSIXBRE:BasicRegularExpressionsasdefinedintheIEEEPOSIXstandard1003.
2.
POSIXERE:ExtendedRegularExpressionsasdefinedintheIEEEPOSIXstandard1003.
2.
XML:TheregularexpressionflavordefinedintheXMLSchemastandard.
Applicationsandlanguagesimplementingoneoftheaboveflavorsare:AceText:Version2andlaterusetheJGsoftengine.
Version1didnotsupportregularexpressionsatall.
C#:Asa.
NETprogramminglanguage,C#canusetheSystem.
Text.
RegularExpressionsclasses,listedas".
NET"below.
Delphifor.
NET:Asa.
NETprogramminglanguage,the.
NETversionofDelphicanusetheSystem.
Text.
RegularExpressionsclasses,listedas".
NET"below.
DelphiforWin32:DelphiforWin32doesnothavebuilt-inregularexpressionsupport.
ManyfreePCREwrappersareavailable.
EditPadPro:Version6andlaterusetheJGsoftengine.
EarlierversionsusedPCRE,withoutUnicodesupport.
egrep:ThetraditionalUNIXegrepcommandusesthe"POSIXERE"flavor,thoughnotallimplementationsfullyadheretothestandard.
grep:ThetraditionalUNIXgrepcommandusesthe"POSIXBRE"flavor,thoughnotallimplementationsfullyadheretothestandard.
Java:Theregexflavorofthejava.
util.
regexpackageislistedas"Java"inthetablebelow.
JavaScript:JavaScript'sregexflavorislistedas"ECMA"inthetablebelow.
MySQL:MySQLusesPOSIXExtendedRegularExpressions,listedas"POSIXERE"inthetablebelow.
174Oracle:OracleDatabase10gimplementsPOSIXExtendedRegularExpressions,listedas"POSIXERE"inthetablebelow.
Oraclesupportsbackreferences\1through\9,thoughthesearenotpartofthePOSIXEREstandard.
Perl:Perl'sregexflavorislistedas"Perl"inthetablebelow.
PHP:PHP'seregfunctionsimplementthe"POSIXERE"flavor,whilethepregfunctionsimplementthe"PCRE"flavor.
PostgreSQL:PostgreSQL7.
4andlaterusesHenrySpencer's"AdvancedRegularExpressions"flavor,listedas"TclARE"inthetablebelow.
EarlierversionsusedPOSIXExtendedRegularExpressions,listedasPOSIXERE.
PowerGREP:Version3andlaterusetheJGsoftengine.
EarlierversionsusedPCRE,withoutUnicodesupport.
Python:Python'sregexflavorislistedas"Python"inthetablebelow.
REALbasic:REALbasic'sRegExclassisawrapperaroundPCRE.
RegexBuddy:Version3andlateruseaspecialversionoftheJGsoftenginethatemulatesalltheregularexpressionflavorsinthiscomparison.
Version2supportedtheJGsoftregexflavoronly.
Version1usedPCRE,withoutUnicodesupport.
Ruby:Ruby'sregexflavorislistedas"Ruby"inthetablebelow.
Tcl:Tcl'sAdvancedRegularExpressionflavor,thedefaultflavorinTcl8.
2andlater,islistedas"TclARE"inthetablebelow.
Tcl'sExtendedRegularExpressionandBasicRegularExpressionflavorsarelistedas"POSIXERE"and"POSIXBRE"inthetablebelow.
VBScript:VBScript'sRegExpobjectusesthesameregexflavorasJavaScript,whichislistedas"ECMA"inthetablebelow.
VisualBasic6:VisualBasic6doesnothavebuilt-insupportforregularexpressions,butcaneasilyusethe"MicrosoftVBScriptRegularExpressions5.
5"COMobject,whichimplementsthe"ECMA"flavorlistedbelow.
VisualBasic.
NET:Asa.
NETprogramminglanguage,VB.
NETcanusetheSystem.
Text.
RegularExpressionsclasses,listedas".
NET"below.
XML:TheXMLSchemaregularexpressionflavorislistedas"XML"inthetablebelow.
CharactersFeature:BackslashescapesonemetacharacterSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:\Q.
.
.
\EescapesastringofmetacharactersSupportedby:JGsoft,Java,Perl,PCREFeature:\x00through\xFF(ASCIIcharacter)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREFeature:\n(LF),\r(CR)and\t(tab)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,XMLFeature:\f(formfeed)and\v(vtab)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREFeature:\a(bell)and\e(escape)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,Ruby,TclARE175Feature:\b(backspace)and\B(backslash)Supportedby:TclAREFeature:\cAthrough\cZ(controlcharacter)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,TclAREFeature:\cathrough\cz(controlcharacter)Supportedby:JGsoft,.
NET,Perl,PCRE,JavaScript,TclARECharacterClassesorCharacterSetsFeature:[abc]characterclassSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:[a-z]characterclassrangeSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:[^abc]negatedcharacterclassSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:BackslashescapesonecharacterclassmetacharacterSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,XMLFeature:\Q.
.
.
\EescapesastringofcharacterclassmetacharactersSupportedby:JGsoft,Java,Perl,PCREFeature:\d,\wand\sshorthandcharacterclassesSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,XMLFeature:\D,\Wand\SshorthandnegatedcharacterclassesSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,XMLFeature:[\b]backspaceSupportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREDotFeature:.
(dot;anycharacterexceptlinebreak)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XML176AnchorsFeature:^(startofstring/line)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXEREFeature:$(endofstring/line)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXEREFeature:\A(startofstring)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,Ruby,TclAREFeature:\Z(endofstring,beforefinallinebreak)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,Ruby,TclAREFeature:\z(endofstring)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,RubyWordBoundariesFeature:\b(atthebeginningorendofaword)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,RubyFeature:\B(NOTatthebeginningorendofaword)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,RubyFeature:\y(atthebeginningorendofaword)Supportedby:JGsoft,TclAREFeature:\Y(NOTatthebeginningorendofaword)Supportedby:JGsoft,TclAREFeature:\m(atthebeginningofaword)Supportedby:JGsoft,TclAREFeature:\M(attheendofaword)Supportedby:JGsoft,TclAREAlternationFeature:|(alternation)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXERE,XML177QuantifiersFeature:(0or1)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXERE,XMLFeature:*(0ormore)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:+(1ormore)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXERE,XMLFeature:{n}(exactlyn)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:{n,m}(betweennandm)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:{n,}(normore)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:afteranyoftheabovequantifierstomakeit"lazy"Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREGroupingandBackreferencesFeature:(regex)(numberedcapturinggroup)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBRE,POSIXERE,XMLFeature:(:regex)(non-capturinggroup)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREFeature:\1through\9(backreferences)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclARE,POSIXBREFeature:\10through\99(backreferences)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREModifiersFeature:(i)(caseinsensitive)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,Ruby,TclARE178Feature:(s)(dotmatchesnewlines)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,RubyFeature:(m)(^and$matchatlinebreaks)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,PythonFeature:(x)(free-spacingmode)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,Python,Ruby,TclAREFeature:(n)(explicitcapture)Supportedby:JGsoft,.
NETFeature:(-ismxn)(turnoffmodemodifiers)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,RubyFeature:(ismxn:group)(modemodifierslocaltogroup)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,RubyAtomicGroupingandPossessiveQuantifiersFeature:(>regex)(atomicgroup)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,RubyFeature:and{m,n}+(possessivequantifiers)Supportedby:JGsoft,Java,PCRELookaroundFeature:(=regex)(positivelookahead)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREFeature:(!
regex)(negativelookahead)Supportedby:JGsoft,.
NET,Java,Perl,PCRE,JavaScript,Python,Ruby,TclAREFeature:(regex)(namedcapturinggroup)Supportedby:JGsoft,.
NET181Feature:('name'regex)(namedcapturinggroup)Supportedby:JGsoft,.
NETFeature:\k(namedbackreference)Supportedby:JGsoft,.
NETFeature:\k'name'(namedbackreference)Supportedby:JGsoft,.
NETPythonSyntaxforNamedCaptureandBackreferencesFeature:(Pregex)(namedcapturinggroupSupportedby:JGsoft,PCRE,PythonFeature:(P=name)(namedbackreference)Supportedby:JGsoft,PCRE,PythonXMLCharacterClassesFeature:\i,\I,\cand\CshorthandXMLnamecharacterclassesSupportedby:XMLFeature:[abc-[abc]]characterclasssubtractionSupportedby:JGsoft,.
NET,XMLPOSIXBracketExpressionsFeature:[:alpha:]POSIXcharacterclassSupportedby:JGsoft,Perl,PCRE,Ruby,TclARE,POSIXBRE,POSIXEREFeature:\p{Alpha}POSIXcharacterclassSupportedby:JavaFeature:\P{Alpha}negatedPOSIXcharacterclassSupportedby:JavaFeature:[.
span-ll.
]POSIXcollationsequenceSupportedby:TclARE,POSIXBRE,POSIXEREFeature:[=x=]POSIXcharacterequivalenceSupportedby:TclARE,POSIXBRE,POSIXERE1826.
ReplacementTextReferenceThetablebelowcomparesthevarioustokensthatthevarioustoolsandlanguagesdiscussedinthisbookrecognizeinthereplacementtextduringsearch-and-replaceoperations.
Thelistofreplacementtextflavorsisnotthesameasthelistofregularexpressionflavorsintheregexfeaturescomparison.
Thereasonisthatthereplacementsarenotmadebytheregularexpressionengine,butbythetoolorprogramminglibraryprovidingthesearch-and-replacecapability.
Theresultisthattoolsorlanguagesusingthesameregexenginemaybehavedifferentlywhenitcomestomakingreplacements.
E.
g.
ThePCRElibrarydoesnotprovideasearch-and-replacefunction.
AlltoolsandlanguagesimplementingPCREusetheirownsearch-and-replacefeature,whichmayresultindifferencesinthereplacementtextsyntax.
Sothesearelistedseparately.
Tomakethetableeasiertoread,Ididgrouptoolsandlanguagesthatusetheexactsamereplacementtextsyntax.
Thelabelsforthereplacementtextflavorsareonlyrelevantinthetablebelow.
E.
g.
the.
NETframeworkdoeshavebuilt-insearch-and-replacefunctioninitsRegexclass,whichisusedbyalltoolsandlanguagesbasedonthe.
NETframework.
Sothesearelistedtogetherunder".
NET".
Notethattheescaperulesbelowonlyrefertothereplacementtextsyntax.
Ifyoutypethereplacementtextinaninputboxintheapplicationyou'reusing,orifyouretrievethereplacementtextfromuserinputinthesoftwareyou'redeveloping,thesearetheonlyescaperulesthatapply.
Ifyoupassthereplacementtextasaliteralstringinprogramminglanguagesourcecode,you'llneedtoapplythelanguage'sstringescaperulesontopofthereplacementtextescaperules.
Aflavorcanhavefourlevelsofsupport(ornon-support)foraparticulartoken.
A"YES"inthetablebelowindicatesthetokenwillbesubstituted.
A"no"indicatesthetokenwillremaininthereplacementasliteraltext.
Notethatlanguagesthatusevariableinterpolationinstringsmaystillreplacetokensindicatedasunsupportedbelow,ifthesyntaxofthetokencorrespondswiththevariableinterpolationsyntax.
E.
g.
inPerl,$0isreplacedwiththenameofthescript.
Finally,"error"indicatesthetokenwillresultinanerrorconditionorexception,preventinganyreplacementsbeingmadeatall.
JGsoft:ThisflavorisusedbytheJGsoftproducts,includingPowerGREP,EditPadProandAceText.
.
NET:ThisflavorisusedbyprogramminglanguagesbasedontheMicrosoft.
NETframeworkversions1.
x,2.
0or3.
0.
Itisgenerallyalsotheregexflavorusedbyapplicationsdevelopedintheseprogramminglanguages.
Java:Theregexflavorofthejava.
util.
regexpackage,availableintheJava4(JDK1.
4.
x)andlater.
AfewfeatureswereaddedinJava5(JDK1.
5.
x)andJava6(JDK1.
6.
x).
ItisgenerallyalsotheregexflavorusedbyapplicationsdevelopedinJava.
Perl:TheregexflavorusedinthePerlprogramminglanguage,asofversion5.
8.
ECMA(JavaScript):Theregularexpressionsyntaxdefinedinthe3rdeditionoftheECMA-262standard,whichdefinesthescriptinglanguagecommonlyknownasJavaScript.
TheVBscriptRegExpobject,whichisalsocommonlyusedinVB6applicationsusesthesameimplementationwiththesamesearch-and-replacefeatures.
Python:TheregexflavorsupportedbyPython'sbuilt-inremodule.
Ruby:TheregexflavorbuiltintotheRubyprogramminglanguage.
Tcl:TheregexflavorusedbytheregsubcommandinTcl8.
2and8.
4,dubbedAdvancedRegularExpressionsintheTclmanpages.
PHPereg:Thereplacementtextsyntaxusedbytheereg_replaceanderegi_replacefunctionsinPHP.
PHPpreg:Thereplacementtextsyntaxusedbythepreg_replacefunctioninPHP.
183REALbasic:ThereplacementtextsyntaxusedbytheReplaceTextpropertyoftheRegExclassinREALbasic.
Oracle:ThereplacementtextsyntaxusedbytheREGEXP_REPLACEfunctioninOracleDatabase10g.
Postgres:Thereplacementtextsyntaxusedbytheregexp_replacefunctioninPostgreSQL.
SyntaxUsingBackslashesFeature:\&(wholeregexmatch)Supportedby:JGsoft,Ruby,PostgresFeature:\0(wholeregexmatch)Supportedby:JGsoft,Python,Ruby,Tcl,PHPereg,PHPpreg,REALbasicFeature:\1through\9(backreference)Supportedby:JGsoft,Perl,Python,Ruby,Tcl,PHPereg,PHPpreg,REALbasic,Oracle,PostgresFeature:\10through\99(backreference)Supportedby:JGsoft,Python,PHPpreg,REALbasicFeature:\10through\99treatedas\1through\9(andaliteraldigit)iffewerthan10groupsSupportedby:JGsoftFeature:\g(namedbackreference)Supportedby:JGsoft,PythonFeature:\`(backtick;subjecttexttotheleftofthematch)Supportedby:JGsoft,RubyFeature:\'(straightquote;subjecttexttotherightofthematch)Supportedby:JGsoft,RubyFeature:\+(highest-numberedparticipatinggroup)Supportedby:JGsoft,RubyFeature:Backslashescapesonebackslashand/ordollarSupportedby:JGsoft,Java,Perl,Python,Ruby,Tcl,PHPereg,PHPpreg,REALbasic,Oracle,PostgresFeature:UnescapedbackslashasliteraltextSupportedby:JGsoft,.
NET,Perl,JavaScript,Python,Ruby,Tcl,PHPereg,PHPpreg,REALbasic,Oracle,PostgresSyntaxUsingDollarSignsFeature:$&(wholeregexmatch)Supportedby:JGsoft,.
NET,Perl,JavaScript,REALbasic184Feature:$0(wholeregexmatch)Supportedby:JGsoft,.
NET,Java,PHPpreg,REALbasicFeature:$1through$9(backreference)Supportedby:JGsoft,.
NET,Java,Perl,JavaScript,PHPpreg,REALbasicFeature:$10through$99(backreference)Supportedby:JGsoft,.
NET,Java,Perl,JavaScript,PHPpreg,REALbasicFeature:$10through$99treatedas$1through$9(andaliteraldigit)iffewerthan10groupsSupportedby:JGsoft,Java,JavaScriptFeature:${1}through${99}(backreference)Supportedby:JGsoft,.
NET,Perl,PHPpregFeature:${group}(namedbackreference)Supportedby:JGsoft,.
NETFeature:$`(backtick;subjecttexttotheleftofthematch)Supportedby:JGsoft,.
NET,Perl,JavaScript,REALbasicFeature:$'(straightquote;subjecttexttotherightofthematch)Supportedby:JGsoft,.
NET,Perl,JavaScript,REALbasicFeature:$_(entiresubjectstring)Supportedby:JGsoft,.
NET,Perl,JavaScriptFeature:$+(highest-numberedparticipatinggroup)Supportedby:JGsoft,PerlFeature:$+(highest-numberedgroupintheregex)Supportedby:.
NET,JavaScriptFeature:$$(escapedollarwithanotherdollar)Supportedby:JGsoft,.
NET,JavaScriptFeature:$(unescapeddollarasliteraltext)Supportedby:JGsoft,.
NET,JavaScript,Python,Ruby,Tcl,PHPereg,PHPpreg,Oracle,PostgresTokensWithoutaBackslashorDollarFeature:&(wholeregexmatch)Supportedby:Tcl185GeneralReplacementTextBehaviorFeature:Backreferencestonon-existentgroupsaresilentlyremovedSupportedby:JGsoft,Perl,Ruby,Tcl,PHPpreg,REALbasic,Oracle,PostgresHighest-NumberedCapturingGroupThe$+tokenislistedtwice,becauseitdoesn'thavethesamemeaninginthelanguagesthatsupportit.
ItwasintroducedinPerl,wherethe$+variableholdsthetextmatchedbythehighest-numberedcapturinggroupthatactuallyparticipatedinthematch.
Inseverallanguagesandlibrariesthatintendedtocopythisfeature,suchas.
NETandJavaScript,$+isreplacedwiththehighest-numberedcapturinggroup,whetheritparticipatedinthematchornot.
E.
g.
intheregexa(\d)|x(\w)thehighest-numberedcapturinggroupisthesecondone.
Whenthisregexmatchesa4",thefirstcapturinggroupmatches4",whilethesecondgroupdoesn'tparticipateinthematchattemptatall.
InPerl,$+willholdthe4"matchedbythefirstcapturinggroup,whichisthehighest-numberedgroupthatactuallyparticipatedinthematch.
In.
NETorJavaScript,$+willbesubstitutedwithnothing,sincethehighest-numberedgroupintheregexdidn'tcaptureanything.
Whenthesameregexmatchesxy",Perl,.
NETandJavaScriptwillallstorey"in$+.
Alsonotethat.
NETnumbersnamedcapturinggroupsafterallnon-namedgroups.
Thismeansthatin.
NET,$+willalwaysbesubstitutedwiththetextmatchedbythelastnamedgroupintheregex,whetheritisfollowedbynon-namedgroupsornot,andwhetheritactuallyparticipatedinthematchornot.
Index$seedollarsign(seeroundbracket)seeroundbracket*seestar.
seedotseequestionmark[seesquarebracket\seebackslash\1seebackreference\aseebell\b.
seewordboundary\cseecontrolcharactersorXMLnames\CseecontrolcharactersorXMLnames\d.
seedigit\D.
seedigit\eseeescape\fseeformfeed\G.
seepreviousmatch\iseeXMLnames\IseeXMLnames\m.
seewordboundary\n.
seelinefeed\rseecarriagereturn\sseewhitespace\Sseewhitespace\tseetab\vseeverticaltab\w.
seewordcharacter\W.
seewordcharacter\yseewordboundary]seesquarebracket^seecaret{seecurlybraces|seeverticalbar+seeplusAdvancedRegularExpressions131,147alternation.
21POSIX.
130anchor15,42,49,54anycharacter.
13ARE.
147ASCII.
6assertion.
49asteriskseestarawk7\b.
seewordboundarybackreference.
27.
NET.
28EditPadPro.
27inacharacterclass.
30number.
28Perl.
28PowerGREP.
27repetition.
85backslash.
5,6inacharacterclass.
9backtracking.
25,80BasicRegularExpressions.
129,147beginfile16beginline15beginstring.
15bell.
6bracesseecurlybracesbracketseesquarebracketorparenthesisbracketexpressions.
61BRE.
129,147\cseecontrolcharactersorXMLnames\CseecontrolcharactersorXMLnamesC#see.
NETC/C+123canonicalequivalenceJava.
41,102capturinggroup27caret.
5,15,42inacharacterclass.
9carriagereturn.
6caseinsensitive.
42.
NET.
115Java.
102Perl.
124catastrophicbacktracking80characterclass.
9negated9negatedshorthand.
11repeating.
11shorthand.
10specialcharacters.
9subtract.
59XMLnames.
59characterequivalents64characterrange.
9charactersetseecharacterclasscharacters.
5ASCII6categories34control.
6digit.
10inacharacterclass.
9invisible6metacharacters5non-printable.
6non-word10,18special.
5Unicode.
6,33whitespace.
10word.
10,18choice.
21class9closingbracket.
36closingquote.
36coach.
142codepoint.
34collatingsequences.
64collectinformation.
134combiningcharacter35combiningmark.
33combiningmultipleregexes.
21comments.
65,66compatibility3conditionif-then-else.
56conditionsmanyinoneregex52continuefrompreviousmatch.
54controlcharacters.
6,36crossseepluscurlybraces.
24currencysign.
35\d.
seedigit\D.
seedigitdash36data.
3databaseMySQL.
110Oracle.
121PostgreSQL.
131date.
76DFAengine.
7digit.
10,35digits.
10distance.
79DLL.
123dollar42dollarsign.
5,15dot.
5,13,42misuse.
81newlines.
13vs.
negatedcharacterclass.
14dotnet.
see.
NETdoublequote.
6duplicatelines78eager7,21ECMAScript107,115EditPadPro.
4,92backreference27group27egrep.
7,95else.
56emailaddress.
73enclosingmark.
35endfile16endline15endofline.
6endstring.
15engine.
3,7entirestring15ERE.
129,147ereg.
126escape.
5,6inacharacterclass.
9exampledate.
76duplicatelines.
78exponentialnumber72,78floatingpointnumber.
72HTMLtags.
69integernumber.
78keywords.
78notmeetingacondition77number.
78prependlines.
16quotedstring14reservedwords.
78scientificnumber72,78trimmingwhitespace.
69wholeline.
77ExtendedRegularExpressions.
129,147flavor.
3flex.
7floatingpointnumber72formfeed.
6free-spacing.
66fullstop.
seedotGNUgrep95grapheme.
33greedy.
23,24grep.
95multi-line.
133PowerGREP.
133group.
27.
NET.
28capturing27EditPadPro.
27inacharacterclass.
30named.
31nested80Perl.
28PowerGREP.
27repetition.
85HenrySpencer.
147HTMLtags.
69hyphen36inacharacterclass.
9\iseeXMLnames\IseeXMLnamesif-then-else.
56ignorewhitespace.
66informationcollecting.
134integernumber78invisiblecharacters.
6Java.
97appendReplacement(106appendTail.
106canonicalequivalence.
102caseinsensitive.
102compile(103dotall.
103find(104literalstrings99Matcherclass.
98,103matcher(103matches(101multi-line.
103Patternclass.
98,103replaceAll(101,105split(102Stringclass.
97java.
util.
regex97JavaScript.
107JDK1.
4.
97keywords.
78languageC/C+123ECMAScript.
107Java.
97JavaScript107Perl.
124PHP126Python.
135REALbasic.
139Ruby.
145Tcl.
147VBScript.
151VisualBasic.
156lazy.
25betteralternative.
25leftmostmatch.
7letter35.
seewordcharacterlex7line.
15begin15duplicate.
78end15notmeetingacondition77prepend16linebreak6,42linefeed6lineseparator.
35lineterminator6Linuxgrep95Linuxlibrary.
123literalcharacers.
5locale61lookahead49lookaround.
49manyconditionsinoneregex.
52lookbehind50limitations50lowercaseletter.
35\m.
seewordboundarymanyconditionsinoneregex.
52mark35match.
3matchmode42mathematicalsymbol.
35mb_ereg.
127metacharacters.
5inacharacterclass.
9Microsoft.
NET.
see.
NETmodemodifier.
42modemodifiersPostgreSQL.
147Tcl.
147modespan.
43modifier42modifierspan.
43multi-line.
42.
NET.
115Java.
103multi-linegrep.
133multi-linemode15multipleregexescombined.
21MySQL7,110.
NET.
111backreference28ECMAScript.
115group28groups.
117IgnoreCase.
115IsMatch(115Matchobject.
117Match(116,118MultiLine.
115NextMatch(119Regex(118RegexOptions115Replace.
116Replace(119SingleLine115Split(117,120namedgroup.
31near.
79negatedcharacterclass.
9negatedshorthand.
11negativelookahead.
49negativelookbehind.
50nestedgrouping.
80newline.
13,42NFAengine.
7non-printablecharacters6non-spacingmark.
35number.
10,35,78backreference28exponential72,78floatingpoint.
72scientific72,78onceormore.
24openingbracket.
36openingquote.
36option.
21,23,24oronecharacteroranother9oneregexoranother.
21Oracle.
121paragraphseparator35parenthesis.
seeroundbracketpattern.
3PCRE.
123period.
seedotPerl124backreference28group28PHP.
126ereg.
126mb_ereg127preg.
127split127pipesymbol.
seeverticalbarplus5,24possessivequantifiers.
44positivelookahead49positivelookbehind50POSIX61,129possessive.
44PostgreSQL.
131PowerGREP.
133backreference27group27precedence.
21,27preg.
127prependlines.
16previousmatch.
54Procmail.
7programmingJava.
97MySQL.
110Oracle.
121Perl.
124PostgreSQL.
131Tcl.
147propertiesUnicode.
34punctuation36Python.
135quantifierbackreference85backtracking25curlybraces.
24greedy24group85lazy.
25nested80onceormore.
24once-only44plus.
24possessive.
44questionmark.
23reluctant25specificamount.
24star.
24ungreedy.
25zeroormore.
24zerooronce.
23questionmark.
5,23commonmistake72lazyquantifiers25quote6quotedstring.
14rangeofcharacters.
9REALbasic.
139regexengine7regextool.
133RegexBuddy.
142regex-directedengine.
7regularexpression3reluctant.
25repetitionbackreference85backtracking25curlybraces.
24greedy24group85lazy.
25nested80onceormore.
24once-only44plus.
24possessive.
44questionmark.
23reluctant25specificamount.
24star.
24ungreedy.
25zeroormore.
24zerooronce.
23replacementtext.
27requirementsmanyinoneregex52reservedcharacters.
5reusepartofthematch.
27roundbracket.
5,27Ruby.
145\sseewhitespace\Sseewhitespacesawtooth83script.
36searchandreplace.
4,133preview.
133texteditor.
93separator.
35severalconditionsinoneregex.
52shorthandcharacterclass.
10negated11XMLnames.
59singlequote6single-line.
42single-linemode13spaceseparator35spacingcombiningmark.
35specialcharacters.
5inacharacterclass.
9inprogramminglanguages.
6specificamount24SQL.
110,121,131squarebracket.
5,9star.
5,24commonmistake72startfile.
16startline15startstring.
15statistics.
134string.
3begin15end15matchingentirely15quoted.
14subtractcharacterclass.
59surrogate.
36symbol.
35syntaxcoloring.
93System.
Text.
RegularExpressions.
111tab.
6Tcl.
147wordboundaries.
19terminatelines.
6text.
3texteditor4,92text-directedengine7titlecaseletter35toolEditPadPro.
92egrep95GNUgrep.
95grep.
95Linuxgrep.
95PowerGREP.
133RegexBuddy.
142specializedregextool.
133texteditor.
92trimmingwhitespace.
69tutorial.
3underscore.
10ungreedy25Unicode33blocks37canonicalequivalence.
41categories34characters.
33codepoint.
34combiningmark33grapheme33Java.
40,102normalization41Perl.
40properties.
34ranges.
37scripts36UNIXgrep.
95uppercaseletter.
35VB.
156VBScript151verticalbar.
5,21POSIX.
130verticaltab.
6VisualBasic.
156VisualBasic.
NET.
see.
NET\w.
seewordcharacter\W.
seewordcharacterW3C157whitespace.
10,35,69ignore.
66wholeline15,77wholeword18,19WindowsDLL.
123word18,19wordboundary18Tcl.
19wordcharacter.
10,18wordskeywords.
78XML.
157XMLnames59\yseewordboundaryzeroormore24zerooronce.
23zero-lengthmatch16zero-width15,49

新注册NameCheap账户几天后无法登录原因及解决办法

中午的时候有网友联系提到自己前几天看到Namecheap商家开学季促销活动期间有域名促销活动的,于是就信注册NC账户注册域名的。但是今天登录居然无法登录,这个问题比较困恼是不是商家跑路等问题。Namecheap商家跑路的可能性不大,前几天我还在他们家转移域名的。这里简单的记录我帮助他解决如何重新登录Namecheap商家的问题。1、检查邮件让他检查邮件是不是有官方的邮件提示。比如我们新注册账户是需...

6元虚拟主机是否值得购买

6元虚拟主机是否值得购买?近期各商家都纷纷推出了优质便宜的虚拟主机产品,其中不少6元的虚拟主机,这种主机是否值得购买,下面我们一起来看看。1、百度云6元体验三个月(活动时间有限抓紧体验)体验地址:https://cloud.baidu.com/campaign/experience/index.html?from=bchPromotion20182、Ucloud 10元云主机体验地址:https:...

CloudCone2核KVM美国洛杉矶MC机房机房2.89美元/月,美国洛杉矶MC机房KVM虚拟架构2核1.5G内存1Gbps带宽,国外便宜美国VPS七月特价优惠

近日CloudCone发布了七月的特价便宜优惠VPS云服务器产品,KVM虚拟架构,性价比最高的为2核心1.5G内存1Gbps带宽5TB月流量,2.89美元/月,稳定性还是非常不错的,有需要国外便宜VPS云服务器的朋友可以关注一下。CloudCone怎么样?CloudCone服务器好不好?CloudCone值不值得购买?CloudCone是一家成立于2017年的美国服务器提供商,国外实力大厂,自己开...

29ff.com为你推荐
太空国家目前共有几个国家登上太空?permissiondeniedpermission denied是什么意思啊?咏春大师被ko练了十几年的 “ 咏春高手”~~被练一年空手道的轻易打败,难道如今的国术就像国足,不堪一击~~老虎数码1200万相素的数码相机都有哪些款?大概价钱是多少?百花百游迎得春来非自足,百花千卉共芬芳什么意思www.36ybyb.com有什么网址有很多动漫可以看的啊?我知道的有www.hnnn.net.很多好看的!但是...都看了!我想看些别人哦!还有优酷网也不错...ww.66bobo.comfq55点com是什么网站www.175qq.com请帮我设计个网名本冈一郎只想问本冈一郎的效果真的和说的一样吗?大概多长时间可以管用呢?用过的进!彪言彪语寻找一个电影和里面的一首歌,国产的,根据真实故事改编的校园爱情电影,里面的男主角是个屌丝但很会弹钢
老左 linode日本 stablehost softlayer 59.99美元 qq数据库 一元域名 帽子云 怎样建立邮箱 北京双线 中国网通测速 免费dns解析 33456 中国电信测速网站 江苏徐州移动 国内空间 服务器防御 xshell5注册码 zcloud 学生机 更多