Regular Expressions (RegEx)
Basic introduction to Regular Expressions
Overview
Regular expressions are a way to identify patterns in data. A Regular Expression (RegEx) uses a sequence of characters to specify a search pattern, and it can represent something as simple as "every value containing a zero" or something complicated like "every value that's between 10-12 characters in length and doesn't contain a capital A, a lowercase g, or a zero"
The backslash \
character is an "escape" character, which signals that whatever comes after it should be treated specially. (this means .
returns different results than \.
, as shown in the structures guide below)
Regular Expressions can be quite overwhelming at first, but most standard needs in Data-flo will be accomplished using a small number of structures. There are numerous resources online for learning RegEx, including a straightforward tutorial at RegexOne.
Basic RegEx structures guide
The following table shows the structure of a piece of RegEx, what that structure represents, and Examples of what it might return. See below for specific examples showing real-world use of these structures in Data-flo adaptor arguments.
RegEx structure | RegEx meaning | Examples |
---|---|---|
abcABC... | letters | Text |
123... | digits (numbers) | 9825 |
\d | any digit (number) | 4 |
\D | any non-digit character | B; or _ |
. | any character | 4; or B; or _ |
\. | full stop (period) | . |
[abc] | only a, b, or c | [gb]et matches get and bet, but doesn't match let or net |
[^abc] | Not a, b, or c | [^ln]et matches get and bet, but doesn't match let or net |
[a-z] | characters a to z | [a-z]101 matches m101 but not 2101 |
[0-9] | numbers 0 to 9 | [0-9]101 matches 2101 but not a101 |
\w | any alphanumeric character | a; or T; or 7 |
\W | any non-alphanumeric character | _; or @ |
{m} | m repetitions | a{3} matches aaa; [wxy]{3} can match www, xxx, wyy, etc.; [0-9]{2} matches any two-digit number |
{m,n} | m to n repetitions | a{2,4} matches aa or aaa or aaaa; .{2,3} matches any two- or three-character string |
* | zero or more repetitions | |
+ | one or more repetitions | AB+ matches AB or ABAB or ABCAB, but not BA or BACB |
? | optional character | ba?123 matches ba123 or b123 but not a123 |
\s | any whitespace (space, tab, new-line, carriage return) | a\sb matches |
\S | any non-whitespace character (anything but space, tab, new-line, carriage return) | a\Sb matches aab but not |
^...$ | starts and ends (anchors to the beginning and end of a field) (Note: $ in a reference is different than in a pattern; in a reference, $ references a specific capture group) | ^123$ matches 123 but not 1123 or 1233 |
(...) | capture group | |
(a(bc)) | capture sub-group | |
(.*) | capture all | |
(abc|def) | matches abc or def |
Examples
/^.+-CGPS+-[0-9]{6}/
matches someamountoftext-CGPS-123454
and matches someamountoftext-CGPS-1234540
and matches x-CGPS-0000000000
and matches x-CGPS-CGPS-000000
because in all cases, the start of the line ^
is followed by any character .
any number of times +
, followed by the very specific text -CGPS
any number of times +
, followed by any six {6}
digits [0-9]
(and doesn't specify what happens after the six digits).
If a dollar-sign is added at the end, it signifies that there are six digits and that's the end of the line, so /^.+-CGPS+-[0-9]{6}$/
matches someamountoftext-CGPS-123454
but not someamountoftext-CGPS-1234540
.
Converting lat/long to negative numbers
/(.+)[W|w|West|WEST|west|S|s|South|south]/
matches 100.67W
and matches 90.7 S
and can be used as the pattern when converting latitude and longitude to negative numbers, with the replacement value -$1
turning those values into -100.67
and -90.7
respectively.
Select and reference everything in a field
pattern is everything in the field:
/^(.+)$/
^
and$
anchor the start & end of the field,()
designate a capture group to reference in the replacement, and.+
means any characters any number of times (at least once).
replacement is
#$1
(where$1
means everything in the first capture group, which here is everything you've selected)This example shows the two different uses of the dollar sign $ character. In the pattern, it means the end of the field. In the reference, it signifies a capture group.
External Resources (unrelated to CGPS)
RegexOne
To get a step-by-step walk-through of how Regular Expressions work, and more information about the structures involved, visit https://regexone.com/.
Autoregex
This resource (www.autoregex.xyz) allows you to write plain English and return RegEx, which can be a good way to familiarize yourself with the concepts and get started creating a complicated pattern.
RegEx101
Regex101 is a good place to test and debug RegEx functionality, although some users find the interface unintuitive.
RegExr
Another place to build, test, and debug RegEx is https://reg.exr.com/.
Last updated