Regular Expressions (RegEx)

Basic introduction to Regular Expressions

Overview

Regular expressions are a way to identify patterns in data. A Regular Expression (RegEx) uses a sequence of characters to specify a search pattern, and it can represent something as simple as "every value containing a zero" or something complicated like "every value that's between 10-12 characters in length and doesn't contain a capital A, a lowercase g, or a zero"

The backslash \ character is an "escape" character, which signals that whatever comes after it should be treated specially. (this means . returns different results than \. , as shown in the structures guide below)

Regular Expressions can be quite overwhelming at first, but most standard needs in Data-flo will be accomplished using a small number of structures. There are numerous resources online for learning RegEx, including a straightforward tutorial at RegexOne.

Basic RegEx structures guide

The following table shows the structure of a piece of RegEx, what that structure represents, and Examples of what it might return. See below for specific examples showing real-world use of these structures in Data-flo adaptor arguments.

RegEx structureRegEx meaningExamples

abcABC...

letters

Text

123...

digits (numbers)

9825

\d

any digit (number)

4

\D

any non-digit character

B; or _

.

any character

4; or B; or _

\.

full stop (period)

.

[abc]

only a, b, or c

[gb]et matches get and bet, but doesn't match let or net

[^abc]

Not a, b, or c

[^ln]et matches get and bet, but doesn't match let or net

[a-z]

characters a to z

[a-z]101 matches m101 but not 2101

[0-9]

numbers 0 to 9

[0-9]101 matches 2101 but not a101

\w

any alphanumeric character

a; or T; or 7

\W

any non-alphanumeric character

_; or @

{m}

m repetitions

a{3} matches aaa; [wxy]{3} can match www, xxx, wyy, etc.; [0-9]{2} matches any two-digit number

{m,n}

m to n repetitions

a{2,4} matches aa or aaa or aaaa; .{2,3} matches any two- or three-character string

*

zero or more repetitions

+

one or more repetitions

AB+ matches AB or ABAB or ABCAB, but not BA or BACB

?

optional character

ba?123 matches ba123 or b123 but not a123

\s

any whitespace (space, tab, new-line, carriage return)

a\sb matches a b

\S

any non-whitespace character (anything but space, tab, new-line, carriage return)

a\Sb matches aab but not a b

^...$

starts and ends (anchors to the beginning and end of a field) (Note: $ in a reference is different than in a pattern; in a reference, $ references a specific capture group)

^123$ matches 123 but not 1123 or 1233

(...)

capture group

(a(bc))

capture sub-group

(.*)

capture all

(abc|def)

matches abc or def

Examples

/^.+-CGPS+-[0-9]{6}/ matches someamountoftext-CGPS-123454 and matches someamountoftext-CGPS-1234540 and matches x-CGPS-0000000000 and matches x-CGPS-CGPS-000000 because in all cases, the start of the line ^ is followed by any character . any number of times + , followed by the very specific text -CGPS any number of times +, followed by any six {6} digits [0-9] (and doesn't specify what happens after the six digits).

If a dollar-sign is added at the end, it signifies that there are six digits and that's the end of the line, so /^.+-CGPS+-[0-9]{6}$/ matches someamountoftext-CGPS-123454 but not someamountoftext-CGPS-1234540 .

Converting lat/long to negative numbers

/(.+)[W|w|West|WEST|west|S|s|South|south]/ matches 100.67W and matches 90.7 S and can be used as the pattern when converting latitude and longitude to negative numbers, with the replacement value -$1 turning those values into -100.67 and -90.7 respectively.

Select and reference everything in a field

  • pattern is everything in the field: /^(.+)$/

    • ^ and $anchor the start & end of the field, () designate a capture group to reference in the replacement, and .+ means any characters any number of times (at least once).

  • replacement is #$1 (where $1 means everything in the first capture group, which here is everything you've selected)

  • This example shows the two different uses of the dollar sign $ character. In the pattern, it means the end of the field. In the reference, it signifies a capture group.

External Resources (unrelated to CGPS)

RegexOne

To get a step-by-step walk-through of how Regular Expressions work, and more information about the structures involved, visit https://regexone.com/.

Autoregex

This resource (www.autoregex.xyz) allows you to write plain English and return RegEx, which can be a good way to familiarize yourself with the concepts and get started creating a complicated pattern.

RegEx101

Regex101 is a good place to test and debug RegEx functionality, although some users find the interface unintuitive.

RegExr

Another place to build, test, and debug RegEx is https://reg.exr.com/.

Last updated