Regex In R Cheat Sheet



regex {base}R Documentation

Regular Expressions as used in R

Pdf

Description

This help page documents the regular expression patterns supported bygrep and related functions grepl, regexpr,gregexpr, sub and gsub, as well as bystrsplit and optionally by agrep andagrepl.

  1. # The regular call: strextract (fruit, 'nana') # Is shorthand for strextract (fruit, regex ('nana')) You will need to use regex explicitly if you want to override the default options, as you’ll see in examples below.
  2. Regular expressions are not as difficult as regex haters make them seem. While regex are intimidating, this cheat sheet will help you overcome that. My experience with regex I have always stayed far away from regex.

Regular Expressions (Regex) Character Classes Cheat Sheet POSIX Character Classes for Regular Expressions & their meanings. Character Class Meaning :alpha:.

Details

A ‘regular expression’ is a pattern that describes a set ofstrings. Two types of regular expressions are used in R,extended regular expressions (the default) andPerl-like regular expressions used by perl = TRUE.There is also fixed = TRUE which can be considered to use aliteral regular expression.

Other functions which use regular expressions (often via the use ofgrep) include apropos, browseEnv,help.search, list.files and ls.These will all use extended regular expressions.

Patterns are described here as they would be printed by cat:(do remember that backslashes need to be doubled when entering Rcharacter strings, e.g. from the keyboard).

Long regular expression patterns may or may not be accepted: the POSIXstandard only requires up to 256 bytes.

Extended Regular Expressions

This section covers the regular expressions allowed in the defaultmode of grep, grepl, regexpr, gregexpr,sub, gsub, regexec and strsplit. They usean implementation of the POSIX 1003.2 standard: that allows some scopefor interpretation and the interpretations here are those currentlyused by R. The implementation supports some extensions to thestandard.

Regular expressions are constructed analogously to arithmeticexpressions, by using various operators to combine smallerexpressions. The whole expression matches zero or more characters(read ‘character’ as ‘byte’ if useBytes = TRUE).

The fundamental building blocks are the regular expressions that matcha single character. Most characters, including all letters anddigits, are regular expressions that match themselves. Anymetacharacter with special meaning may be quoted by preceding it witha backslash. The metacharacters in extended regular expressions are. | ( ) [ { ^ $ * + ?, but note that whether these have aspecial meaning depends on the context.

Escaping non-metacharacters with a backslash isimplementation-dependent. The current implementation interpretsa as BEL, e as ESC, f asFF, n as LF, r as CR andt as TAB. (Note that these will be interpreted byR's parser in literal character strings.)

A character class is a list of characters enclosed between[ and ] which matches any single character in that list;unless the first character of the list is the caret ^, when itmatches any character not in the list. For example, theregular expression [0123456789] matches any single digit, and[^abc] matches anything except the characters a,b or c. A range of characters may be specified bygiving the first and last characters, separated by a hyphen. (Becausetheir interpretation is locale- and implementation-dependent,character ranges are best avoided. Some but not all implementationsinclude both cases in ranges when doing caseless matching.) The onlyportable way to specify all ASCII letters is to list them all as thecharacter class
[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz].
(Thecurrent implementation uses numerical order of the encoding, normally asingle-byte encoding or Unicode points.)

Certain named classes of characters are predefined. Theirinterpretation depends on the locale (see locales); theinterpretation below is that of the POSIX locale.

[:alnum:]

Alphanumeric characters: [:alpha:]and [:digit:].

[:alpha:]

Alphabetic characters: [:lower:] and[:upper:].

[:blank:]

Blank characters: space and tab, andpossibly other locale-dependent characters such as non-breakingspace.

[:cntrl:]

Control characters. In ASCII, these characters have octal codes000 through 037, and 177 (DEL). In another character set,these are the equivalent characters, if any.

[:digit:]

Digits: 0 1 2 3 4 5 6 7 8 9.

[:graph:]

Graphical characters: [:alnum:] and[:punct:].

[:lower:]

Lower-case letters in the current locale.

[:print:]

Printable characters: [:alnum:], [:punct:] and space.

[:punct:]

Punctuation characters:
! ' # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.

[:space:]

Space characters: tab, newline, vertical tab, form feed, carriagereturn, space and possibly other locale-dependent characters.

[:upper:]

Upper-case letters in the current locale.

[:xdigit:]

Hexadecimal digits:
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.

For example, [[:alnum:]] means [0-9A-Za-z], except thelatter depends upon the locale and the character encoding, whereas theformer is independent of locale and character set. (Note that thebrackets in these class names are part of the symbolic names, and mustbe included in addition to the brackets delimiting the bracket list.)Most metacharacters lose their special meaning inside a characterclass. To include a literal ], place it first in the list.Similarly, to include a literal ^, place it anywhere but first.Finally, to include a literal -, place it first or last (or,for perl = TRUE only, precede it by a backslash). (Only^ - ] are special inside character classes.)

The period . matches any single character. The symbolw matches a ‘word’ character (a synonym for[[:alnum:]_], an extension) and W is its negation([^[:alnum:]_]). Symbols d, s, Dand S denote the digit and space classes and their negations(these are all extensions).

The caret ^ and the dollar sign $ are metacharactersthat respectively match the empty string at the beginning and end of aline. The symbols < and > match the empty string atthe beginning and end of a word. The symbol b matches theempty string at either edge of a word, and B matches theempty string provided it is not at an edge of a word. (Theinterpretation of ‘word’ depends on the locale andimplementation: these are all extensions.)

A regular expression may be followed by one of several repetitionquantifiers:

?

The preceding item is optional and will be matchedat most once.

*

The preceding item will be matched zero or moretimes.

+

Basic Regular Expressions In R Cheat Sheet

The preceding item will be matched one or moretimes.

{n}

The preceding item is matched exactly ntimes.

{n,}
Regex In R Cheat Sheet

The preceding item is matched n or moretimes.

{n,m}

The preceding item is matched at least ntimes, but not more than m times.

By default repetition is greedy, so the maximal possible number ofrepeats is used. This can be changed to ‘minimal’ by appending? to the quantifier. (There are further quantifiers that allowapproximate matching: see the TRE documentation.)

Regular expressions may be concatenated; the resulting regularexpression matches any string formed by concatenating the substringsthat match the concatenated subexpressions.

Two regular expressions may be joined by the infix operator |;the resulting regular expression matches any string matching eithersubexpression. For example, abba|cde matches either thestring abba or the string cde. Note that alternationdoes not work inside character classes, where | has its literalmeaning.

Repetition takes precedence over concatenation, which in turn takesprecedence over alternation. A whole subexpression may be enclosed inparentheses to override these precedence rules.

The backreference N, where N = 1 ... 9, matchesthe substring previously matched by the Nth parenthesizedsubexpression of the regular expression. (This is anextension for extended regular expressions: POSIX defines them onlyfor basic ones.)

Perl-like Regular Expressions

The perl = TRUE argument to grep, regexpr,gregexpr, sub, gsub and strsplit switchesto the PCRE library that implements regular expression patternmatching using the same syntax and semantics as Perl 5.x,with just a few differences.

For complete details please consult the man pages for PCRE, especiallyman pcrepattern and man pcreapi, on your system orfrom the sources at https://www.pcre.org. (The version in use can befound by calling extSoftVersion. It need not be the versiondescribed in the system's man page. PCRE1 (reported as version < 10.00 byextSoftVersion) has been feature-frozen for some time(essentially 2012), the man pages athttps://www.pcre.org/original/doc/html/ should be a good match.PCRE2 (PCRE version >= 10.00) has man pages athttps://www.pcre.org/current/doc/html/).

Perl regular expressions can be computed byte-by-byte or(UTF-8) character-by-character: the latter is used in all multibytelocales and if any of the inputs are marked as UTF-8 (seeEncoding, or as Latin-1 except in a Latin-1 locale.

All the regular expressions described for extended regular expressionsare accepted except < and >: in Perl all backslashedmetacharacters are alphanumeric and backslashed symbols always areinterpreted as a literal character. { is not special if itwould be the start of an invalid interval specification. There can bemore than 9 backreferences (but the replacement in subcan only refer to the first 9).

Character ranges are interpreted in the numerical order of thecharacters, either as bytes in a single-byte locale or as Unicode codepoints in UTF-8 mode. So in either case [A-Za-z] specifies theset of ASCII letters.

In UTF-8 mode the named character classes only match ASCII characters:see p below for an alternative.

The construct (?...) is used for Perl extensions in a varietyof ways depending on what immediately follows the ?.

Perl-like matching can work in several modes, set by the options(?i) (caseless, equivalent to Perl's /i), (?m)(multiline, equivalent to Perl's /m), (?s) (single line,so a dot matches all characters, even new lines: equivalent to Perl's/s) and (?x) (extended, whitespace data characters areignored unless escaped and comments are allowed: equivalent to Perl's/x). These can be concatenated, so for example, (?im)sets caseless multiline matching. It is also possible to unset theseoptions by preceding the letter with a hyphen, and to combine settingand unsetting such as (?im-sx). These settings can be appliedwithin patterns, and then apply to the remainder of the pattern.Additional options not in Perl include (?U) to set‘ungreedy’ mode (so matching is minimal unless ? is usedas part of the repetition quantifier, when it is greedy). Initiallynone of these options are set.

If you want to remove the special meaning from a sequence ofcharacters, you can do so by putting them between Q andE. This is different from Perl in that $ and @ arehandled as literals in Q...E sequences in PCRE, whereas inPerl, $ and @ cause variable interpolation.

The escape sequences d, s and w representany decimal digit, space character and ‘word’ character(letter, digit or underscore in the current locale: in UTF-8 mode onlyASCII letters and digits are considered) respectively, and theirupper-case versions represent their negation. Vertical tab was notregarded as a space character in a C locale before PCRE 8.34.Sequences h, v, H and V matchhorizontal and vertical space or the negation. (In UTF-8 mode, thesedo match non-ASCII Unicode code points.)

There are additional escape sequences: cx iscntrl-x for any x, ddd is theoctal character (for up to three digits unlessinterpretable as a backreference, as 1 to 7 alwaysare), and xhh specifies a character by two hex digits.In a UTF-8 locale, x{h...} specifies a Unicode code pointby one or more hex digits. (Note that some of these will beinterpreted by R's parser in literal character strings.)

Outside a character class, A matches at the start of asubject (even in multiline mode, unlike ^), Z matchesat the end of a subject or before a newline at the end, zmatches only at end of a subject. and G matches at firstmatching position in a subject (which is subtly different from Perl'send of the previous match). C matches a singlebyte, including a newline, but its use is warned against. In UTF-8mode, R matches any Unicode newline character (not just CR),and X matches any number of Unicode characters that form anextended Unicode sequence. X, R and B cannot beused inside a character class (with PCRE1, they are treated as charactersX, R and B; with PCRE2 they cause an error).

A hyphen (minus) inside a character class is treated as a range, unless itis first or last character in the class definition. It can be quoted torepresent the hyphen literal (-). PCRE1 allows an unquoted hyphenat some other locations inside a character class where it cannot representa valid range, but PCRE2 reports an error in such cases.

In UTF-8 mode, some Unicode properties may be supported viap{xx} and P{xx} which match characters with andwithout property xx respectively. For a list of supportedproperties see the PCRE documentation, but for example Lu is‘upper case letter’ and Sc is ‘currency symbol’.(This support depends on the PCRE library being compiled with‘Unicode property support’ which can be checked viapcre_config. PCRE2 when compiled with Unicode support alwayssupports also Unicode properties.)

The sequence (?# marks the start of a comment which continuesup to the next closing parenthesis. Nested parentheses are notpermitted. The characters that make up a comment play no part at all inthe pattern matching.

If the extended option is set, an unescaped # character outsidea character class introduces a comment that continues up to the nextnewline character in the pattern.

The pattern (?:...) groups characters just as parentheses dobut does not make a backreference.

Patterns (?=...) and (?!...) are zero-width positive andnegative lookahead assertions: they match if an attempt tomatch the ... forward from the current position would succeed(or not), but use up no characters in the string being processed.Patterns (?<=...) and (?<!...) are the lookbehindequivalents: they do not allow repetition quantifiers nor Cin ....

regexpr and gregexpr support ‘named capture’. Ifgroups are named, e.g., '(?<first>[A-Z][a-z]+)' then thepositions of the matches are also returned by name. (Namedbackreferences are not supported by sub.)

Atomic grouping, possessive qualifiers and conditionaland recursive patterns are not covered here.

Author(s)

This help page is based on the TRE documentation and the POSIXstandard, and the pcre2pattern man page from PCRE2 10.35.

See Also

Regex In R Cheat Sheet Download

grep, apropos, browseEnv,glob2rx, help.search, list.files,ls, strsplit and agrep.

The TRE regexp syntax.

The POSIX 1003.2 standard athttps://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html.

The pcre2pattern or pcrepatternman page(found as part of https://www.pcre.org/original/pcre.txt), anddetails of Perl's own implementation athttps://perldoc.perl.org/perlre.