Here we have Regular Expressions Cheat Sheet, Regular Expressions (regex or regexp) are a pattern of characters that describe an amount of text. Regular expressions are one of the most widely used tools in natural language processing and allow you to supercharge common text data manipulation tasks. Use this cheat sheet as a handy reminder when working with regular expressions.
More on Regular Expressions
To process regexes, you will use a “regex engine.” Each of these engines use slightly different syntax called regex flavor. A list of popular engines can be found here. Two common programming languages we discuss on myTechMint are Python and R which each have their own engines.
Since regex describes patterns of text, it can be used to check for the existence of patterns in a text, extract substrings from longer strings, and help make adjustments to text. Regex can be very simple to describe specific words, or it can be more advanced to find vague patterns of characters like the top-level domain in a url.
Definitions
- Literal Character: A literal character is the most basic regular expression you can use. It simply matches the actual character you write. So if you are trying to represent an “r,” you would write r.
- Metacharacter: Metacharacters signify to the regex engine that the following character has a special meaning. You typically include a \ in front of the metacharacter and they can do things like signify the beginning of a line, end of a line, or to match any single character.
- Character Class: A character class (or character set) tells the engine to look for one of a list of characters. It is signified by [ and ] with the characters you are looking for in the middle of the brackets.
- Capture Group: A capture group is signified by opening and closing, round parenthesis. They allow you to group regexes together to apply other regex features like quantifiers (see below) to the group.
Anchors
Anchors match a position before or after other characters.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| match start of line |
| rabbit raccoon | parrot ferret |
| match end of line |
| rabbit foot | trap star |
| match start of line |
| rabbit raccoon | parrot ferret |
| match end of line |
| rabbit foot | trap star |
| match characters at the start or end of a word |
| the red fox ran the fox ate | foxtrot foxskin scarf |
| match characters in the middle of other non-space characters |
| trees beef | bee tree |
Matching types of character
Rather than matching specific characters, you can match specific types of characters such as letters, numbers, and more.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| Anything except for a linebreak |
| clean cheap | acert cent |
| match a digit |
| 6060–842 2b|^2b | two **___ |
| Match a non-digit |
| The 5 cats ate | 52 10032 |
| Match word characters |
| trees bee4 | The bee eels eat meat |
| Match non-word characters |
| At bat Swing the bat fast | wombat bat53 |
| Match whitespace |
| the fox ate his fox ran | it’s the fox. foxfur |
| Match non-whitespace |
| trees beef | the bee stung The tall tree |
| Escape a metacharacter to match on the metacharacter |
| The cat ate. 2^3 | the cat ate 23 |
Character classes
Character classes are sets or ranges of characters.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| match several characters |
| gray grey | green greek |
| match a range of characters |
| amber brand | fox join |
| Does not match several characters |
| green greek | gray grey |
| match metacharacters inside the character class |
| 4^3 4.2 | 44 23 |
Repetition
Rather than matching single instances of characters, you can match repeated characters.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| match zero or more times |
| cacao carrot | arugula artichoke |
| match one or more times |
| green tree | trap ruined |
| Match zero or one times |
| roast rant | root rear |
| match m times |
| deer seer | red enter |
| match m or more times |
| 671-2224 2222224 | 224 123 |
| match between m and n times |
| 1234 1222384 | 15335 1222223 |
| match the minimum number of times – known as a lazy quantifier |
| tree freeeee | trout roasted |
Capturing, alternation & backreferences
In order to extract specific parts of a string, you can capture those parts, and even name the parts that you captured.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| capturing a pattern |
| Mississippi missed | mist persist |
| create a group without capturing |
| Match: abcd Group 1: cd | acbd |
| create a named capture group |
| Match: 1325 first: 1 second: 3 | 2 hello |
| match several alternative patterns |
| red banter | rant bear |
| reference previous captures where n is the group index starting at 1 |
| blob bribe | bear bring |
| reference named captures |
| 51245 55 | 523 51 |
Lookahead
You can specify that specific characters must appear before or after you match, without including those characters in the match.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| looks ahead at the next characters without using them in the match |
| banana Mississippi | band missed |
| looks ahead at next characters to not match on |
| fail brail | faint train |
| looks at previous characters for a match without using those in the match |
| trail translate | bear streak |
| looks at previous characters to not match on |
| bear translate | trail strained |
Literal matches and modifiers
Modifiers are settings that change the way the matching rules work.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| match start to finish |
| tell \d | I’ll tell you this I have 5 coins |
| set the regex string to case-insensitive |
| sTep tEach | Trench bear |
| regex ignores whitespace |
| tap tapdance | c a t rot a potato |
| turns on single-line/DOTALL mode which makes the “.” include new-line symbols (\n) in addition to everything else |
| first and Second and third | first and second and third |
| Changes ^ and $ to be end of line rather than end of string |
| eat and sleep eat and sleep | treat and sleep eat and sleep. |
Unicode
Regular expressions can work beyond the Roman alphabet, with things like Chinese characters or emoji.
- Code Points: The hexadecimal number used to represent an abstract character in a system like unicode.
- Graphemes: Is either a codepoint or a character. All characters are made up of one or more graphemes in a sequence.
Syntax | Description | Example pattern | Example matches | Example non-matches |
| match graphemes |
| @gmail www.email@gmail | gmail @aol |
| Match special characters like ones with an accent |
| è | e |