Regular Expressions (RegEx)
Basic introduction to Regular Expressions
Overview
Regular expressions are a way to identify patterns in data. A Regular Expression (RegEx) uses a sequence of characters to specify a search pattern, and it can represent something as simple as "every value containing a zero" or something complicated like "every value that's between 10-12 characters in length and doesn't contain a capital A, a lowercase g, or a zero"
The backslash \
character is an "escape" character, which signals that whatever comes after it should be treated specially. (this means .
returns different results than \.
, as shown in the structures guide below)
Regular Expressions can be quite overwhelming at first, but most standard needs in Data-flo will be accomplished using a small number of structures. There are numerous resources online for learning RegEx, including a straightforward tutorial at RegexOne.
Basic RegEx structures guide
The following table shows the structure of a piece of RegEx, what that structure represents, and Examples of what it might return. See below for specific examples showing real-world use of these structures in Data-flo adaptor arguments.
abcABC...
letters
Text
123...
digits (numbers)
9825
\d
any digit (number)
4
\D
any non-digit character
B; or _
.
any character
4; or B; or _
\.
full stop (period)
.
[abc]
only a, b, or c
[gb]et matches get and bet, but doesn't match let or net
[^abc]
Not a, b, or c
[^ln]et matches get and bet, but doesn't match let or net
[a-z]
characters a to z
[a-z]101 matches m101 but not 2101
[0-9]
numbers 0 to 9
[0-9]101 matches 2101 but not a101
\w
any alphanumeric character
a; or T; or 7
\W
any non-alphanumeric character
_; or @
{m}
m repetitions
a{3} matches aaa; [wxy]{3} can match www, xxx, wyy, etc.; [0-9]{2} matches any two-digit number
{m,n}
m to n repetitions
a{2,4} matches aa or aaa or aaaa; .{2,3} matches any two- or three-character string
*
zero or more repetitions
+
one or more repetitions
AB+ matches AB or ABAB or ABCAB, but not BA or BACB
?
optional character
ba?123 matches ba123 or b123 but not a123
\s
any whitespace (space, tab, new-line, carriage return)
a\sb matches a b
\S
any non-whitespace character (anything but space, tab, new-line, carriage return)
a\Sb matches aab but not a b
^...$
starts and ends (anchors to the beginning and end of a field) (Note: $ in a reference is different than in a pattern; in a reference, $ references a specific capture group)
^123$ matches 123 but not 1123 or 1233
(...)
capture group
(a(bc))
capture sub-group
(.*)
capture all
(abc|def)
matches abc or def
Examples
/^.+-CGPS+-[0-9]{6}/
matches someamountoftext-CGPS-123454
and matches someamountoftext-CGPS-1234540
and matches x-CGPS-0000000000
and matches x-CGPS-CGPS-000000
because in all cases, the start of the line ^
is followed by any character .
any number of times +
, followed by the very specific text -CGPS
any number of times +
, followed by any six {6}
digits [0-9]
(and doesn't specify what happens after the six digits).
If a dollar-sign is added at the end, it signifies that there are six digits and that's the end of the line, so /^.+-CGPS+-[0-9]{6}$/
matches someamountoftext-CGPS-123454
but not someamountoftext-CGPS-1234540
.
Converting lat/long to negative numbers
/(.+)[W|w|West|WEST|west|S|s|South|south]/
matches 100.67W
and matches 90.7 S
and can be used as the pattern when converting latitude and longitude to negative numbers, with the replacement value -$1
turning those values into -100.67
and -90.7
respectively.
Select and reference everything in a field
pattern is everything in the field:
/^(.+)$/
^
and$
anchor the start & end of the field,()
designate a capture group to reference in the replacement, and.+
means any characters any number of times (at least once).
replacement is
#$1
(where$1
means everything in the first capture group, which here is everything you've selected)This example shows the two different uses of the dollar sign $ character. In the pattern, it means the end of the field. In the reference, it signifies a capture group.
External Resources (unrelated to CGPS)
RegexOne
To get a step-by-step walk-through of how Regular Expressions work, and more information about the structures involved, visit https://regexone.com/.
Autoregex
This resource (www.autoregex.xyz) allows you to write plain English and return RegEx, which can be a good way to familiarize yourself with the concepts and get started creating a complicated pattern.
RegEx101
Regex101 is a good place to test and debug RegEx functionality, although some users find the interface unintuitive.
RegExr
Another place to build, test, and debug RegEx is https://reg.exr.com/.
Last updated