Regular Expressions Functions and Operators¶
A regular expression is a powerful way to specify a pattern for a complex search. The functions and operators described in this section performs regular expression matching on a string.
ECMAScript Regular Expressions Pattern Syntax¶
To implement regular expression support, CUBRID uses the standard C++ <regex> library, which conforms the ECMA-262 RegExp grammar. The following sub-sections describes supported regular expression grammars with several examples.
Note
Compatibility Considerations
In the prior version of CUBRID 11, CUBRID used Henry Spencer’s implementation of regular expressions. From the CUBRID 12, CUBRID uses C++ <regex> standard library to support regular expression functions and operators.
1. The Henry Spencer’s implementation of regular expressions operates in byte-wise fashion. So the REGEXP and RLIKE were not multibyte safe, they only worked as ASCII encoding without considering the collation of operands.
2. The Spencer library supports the POSIX collating sequence expressions ([.character.]). But it does not support anymore. Also, character equivalents ([=word=]) does not support. CUBRID occurs an error when these collating element syntax is given.
3. The Spencer library matches line-terminator characters for the dot operator (.) But it does not.
4. The word-beginning and word-end boundary ([[:<:]] and [[:>:]]) doesn’t support anymore. Instead, the word boundary notation (\b) can be used.
Note
Multibyte Character Comparision Considerations
C++ <regex> performs multibyte comparision by C++ <locale> standard dependent on system-supplied locales. Therefore, system locale should be installed on your system for locale-sensitive functions.
Special Pattern Characters¶
Special pattern characters are characters (or sequences of characters) that have a special meaning when they appear in a regular expression pattern, either to represent a character that is difficult to express in a string, or to represent a category of characters. Each of these special pattern characters is matched in a string against a single character (unless a quantifier specifies otherwise).
Characters |
Description |
---|---|
. |
Any character except line terminators (LF, CR, LS, PS). |
\t |
A horizontal tab character (same as \u0009). |
\n |
A newline (line feed) character (same as \u000A). |
\v |
A vertical tab character (same as \u000B). |
\f |
A form feed character (same as \u000C). |
\r |
A carriage return character (same as \u000D) |
\cletter |
A control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. |
\xhh |
A a character whose code unit value has an hex value equivalent to the two hex digits hh. |
\uhhhh |
A character whose code unit value has an hex value equivalent to the four hex digits hhhh. |
\0 |
A null character (same as \u0000). |
\num |
The result of the submatch whose opening parenthesis is the num-th. See groups below for more info. |
\d |
A decimal digit character (same as [[:digit:]]). |
\D |
Any character that is not a decimal digit character (same as [^[:digit:]]). |
\s |
A whitespace character (same as [[:space:]]). |
\S |
Any character that is not a whitespace character (same as [^[:space:]]). |
\w |
An alphanumeric or underscore character (same as [_[:alnum:]]). |
\W |
Any character that is not an alphanumeric or underscore character (same as [^_[:alnum:]]). |
\character |
The character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
|
[class] |
A string is part of the class. see POSIX-based character classes below. |
[^class] |
A string is not part of the class. see POSIX-based character classes below. |
-- .: match any character
SELECT ('cubrid dbms' REGEXP '^c.*$');
('cubrid dbms' regexp '^c.*$')
================================
1
To match special characters such as “\n”, “\t”, “\r”, and “\\”, some must be escaped with the backslash (\) by specifying the value of no_backslash_escapes (default: yes) to no. For details on no_backslash_escapes, see Escape Special Characters.
-- \n : match a special character, when no_backslash_escapes=yes (default)
SELECT ('new\nline' REGEXP 'new\\nline');
('new\nline' REGEXP 'new\\nline');
=====================================
1
-- \n : match a special character, when no_backslash_escapes=no
SELECT ('new\nline' REGEXP 'new
line');
('new
line' regexp 'new
line')
=====================================
1
Quantifiers¶
Quantifiers follow a character or a special pattern character. They can modify the amount of times that character is repeated in the match:
Characters |
Description |
---|---|
* |
The preceding is matched 0 or more times. |
+ |
The preceding is matched 1 or more times. |
? |
The preceding is optional (matched either 0 times or once). |
{num} |
The preceding is matched exactly num times. |
{num,} |
The preceding is matched num or more times. |
{min,max} |
The preceding is matched at least min times, but not more than max. |
-- a+ : match any sequence of one or more a characters. case insensitive.
SELECT ('Aaaapricot' REGEXP '^A+pricot');
('Aaaapricot' regexp '^A+pricot')
================================
1
-- a? : match either zero or one a character.
SELECT ('Apricot' REGEXP '^Aa?pricot');
('Apricot' regexp '^Aa?pricot')
==========================
1
SELECT ('Aapricot' REGEXP '^Aa?pricot');
('Aapricot' regexp '^Aa?pricot')
===========================
1
SELECT ('Aaapricot' REGEXP '^Aa?pricot');
('Aaapricot' regexp '^Aa?pricot')
============================
0
-- (cub)* : match zero or more instances of the sequence abc.
SELECT ('cubcub' REGEXP '^(cub)*$');
('cubcub' regexp '^(cub)*$')
==========================
1
By default, all these quantifiers perform in a greedy way which takes as many characters that meet the condition as possible. And this behavior can be overridden to non-greedy by adding a question mark (?) after the quantifier.
-- (a+), (a+?) : match with quantifiers performs greedy and ungreedy respectively.
SELECT REGEXP_SUBSTR ('aardvark', '(a+)'), REGEXP_SUBSTR ('aardvark', '(a+?)');
regexp_substr('aardvark', '(a+)') regexp_substr('aardvark', '(a+?)')
============================================
'aa' 'a'
Groups¶
Groups allow to apply quantifiers to a sequence of characters (instead of a single character). There are two kinds of groups:
Characters |
Description |
---|---|
(subpattern) |
Group which creates a backreference. |
(?:subpattern) |
Passive group which does not create a backreference. |
-- The captured group can be referenced with $int
SELECT REGEXP_REPLACE ('hello cubrid','([[:alnum:]]+)','$1!');
regexp_replace('hello cubrid','([[:alnum:]]+)','$1!')
==========================
'hello! cubrid!'
When a group creates a backreference, the characters that represent the subpattern in a string are stored as a submatch. Each submatch is numbered after the order of appearance of their opening parenthesis (the first submatch is number 1, the second is number 2, and so on…). These submatches can be used in the regular expression itself to specify that the entire subpattern should appear again somewhere else (see int in the special characters list). They can also be used in the replacement string or retrieved in the match_results object filled by some regex operations.
-- performs regexp_substr without groups. the following is the case that fully matched.
SELECT REGEXP_SUBSTR ('abckabcjabc', '[a-c]{3}k[a-c]{3}j[a-c]{3}');
-- ([a-c]{3}) creates a backreference, \1
SELECT REGEXP_SUBSTR ('abckabcjabc', '([a-c]{3})k\1j\1');
regexp_substr('abckabcjabc', '[a-c]{3}k[a-c]{3}j[a-c]{3}')
======================
'abckabcjabc'
regexp_substr('abckabcjabc', '([a-c]{3})k\1j\1')
======================
'abckabcjabc'
Assertions¶
Assertions are conditions that do not consume characters in a string: they do not describe a character, but a condition that must be fulfilled before or after a character.
Characters |
Description |
---|---|
^ |
The beginning of a string, or follows a line terminator |
$ |
The end of a string, or precedes a line terminator |
\b |
The previous character is a word character and the next is a non-word character (or vice-versa). |
\B |
The previous and next characters are both word characters or both are non-word characters. |
(?=subpattern) |
Positive lookahead. The characters following the charcter must match subpattern, but no characters are consumed. |
(?!subpattern) |
Negative lookahead. The characters following the assertion must not match subpattern, but no characters are consumed. |
-- ^ : match the beginning of a string
SELECT ('cubrid dbms' REGEXP '^cub');
('cubrid dbms' regexp '^cub')
===============================
1
-- $ : match the end of a string
SELECT ('this is cubrid dbms' REGEXP 'dbms$');
('this is cubrid dbms' regexp 'dbms$')
========================================
1
-- (?=subpattern): positive lookahead
SELECT REGEXP_REPLACE ('cubrid dbms cubrid sql cubrid rdbms', 'cubrid(?= sql)', 'CUBRID');
-- (?!subpattern): nagative lookahead
SELECT REGEXP_REPLACE ('cubrid dbms cubrid sql cubrid rdbms', 'cubrid(?! sql)', 'CUBRID');
regexp_replace('cubrid dbms cubrid sql cubrid rdbms', 'cubrid(?= sql)', 'CUBRID')
======================
'cubrid dbms CUBRID sql cubrid rdbms'
regexp_replace('cubrid dbms cubrid sql cubrid rdbms', 'cubrid(?! sql)', 'CUBRID')
======================
'CUBRID dbms cubrid sql CUBRID rdbms'
Alternatives¶
A pattern can include different alternatives:
Characters |
Description |
---|---|
| |
Separates two alternative patterns or subpatterns. |
-- a|b : matches any character that is either a or b.
SELECT ('a' REGEXP 'a|b');
SELECT ('d' REGEXP 'a|b');
('a' regexp 'a|b')
==============================
1
('d' regexp 'a|b')
==============================
0
A regular expression can contain multiple alternative patterns simply by separating them with the separator operator (|): The regular expression will match if any of the alternatives match, and as soon as one does. Subpatterns (in groups or assertions) can also use the separator operator to separate different alternatives.
-- a|b|c : matches any character that is either a, b or c.
SELECT ('a' REGEXP 'a|b|c');
SELECT ('d' REGEXP 'a|b|c');
('a' regexp 'a|b|c')
==============================
1
('d' regexp 'a|b|c')
==============================
0
Character classes¶
Character classes syntax matches one of characters or a category of characters within square brackets.
Individual characters
Any character specified is considered part of the class (except the characters \, [, ]).
-- [abc] : matches any character that is either a, b or c.
SELECT ('a' REGEXP '[abc]');
SELECT ('d' REGEXP '[abc]');
('a' regexp '[abc]')
==============================
1
('d' regexp '[abc]')
==============================
0
Ranges
To represent a range of characters, use the dash character (-) between two valid characters. For example, “[a-z]” matches any alphabet letter whereas “[0-9]” matches any single number.
SELECT ('adf' REGEXP '[a-f]');
SELECT ('adf' REGEXP '[g-z]');
('adf' regexp '[a-f]')
================================
1
('adf' regexp '[g-z]')
================================
0
-- [0-9]+: matches number sequence in a string
SELECT REGEXP_SUBSTR ('aas200gjb', '[0-9]+');
regexp_substr('aas200gjb', '[0-9]+')
======================
'200'
SELECT ('strike' REGEXP '^[^a-dXYZ]+$');
('strike' regexp '^[^a-dXYZ]+$')
================================
1
POSIX-based character classes
The POSIX-based character class ([:classname:]) defines categories of characters as shown below. [:d:], [:w:] and [:s:] are an extension to the ECMAScript grammar.
Class |
Description |
---|---|
[:alnum:] |
Alpha-numerical character |
[:alpha:] |
Alphabetic character |
[:blank:] |
Blank character |
[:cntrl:] |
Control character |
[:digit:] |
Decimal digit character |
[:graph:] |
Character with graphical representation |
[:lower:] |
Lowercase letter |
[:print:] |
Printable character |
[:punct:] |
Punctuation mark character |
[:space:] |
Whitespace character |
[:upper:] |
Uppercase letter |
[:xdigit:] |
Hexadecimal digit character |
[:d:] |
Decimal digit character |
[:w:] |
Word character |
[:s:] |
Whitespace character |
SELECT REGEXP_SUBSTR ('Samseong-ro 86-gil, Gangnam-gu, Seoul 06178', '[[:digit:]]{5}');
regexp_substr('Samseong-ro 86-gil, Gangnam-gu, Seoul 06178', '[[:digit:]]{5}')
================================
'06178'
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_REPLACE ('가나다 가나 가나다라', '\b[[:alpha:]]{2}\b', '#');
regexp_replace('가나다 가나 가나다라', '\b[[:alpha:]]{2}\b', '#')
======================
'가나다 # 가나다라'
REGEXP, RLIKE¶
The REGEXP and RLIKE are used interchangeably. It performs a regular expression matcinh of a string. In the below syntax, if expression matches pattern, 1 is returned; otherwise, 0 is returned. If either expression or pattern is NULL, NULL is returned. The second syntax has the same meaning as the third syntax, which both syntaxes are using NOT.
expression REGEXP | RLIKE [BINARY] pattern
expression NOT REGEXP | RLIKE pattern
NOT (expression REGEXP | RLIKE pattern)
expression : Column or input expression
pattern : Pattern used in regular expressions
The difference between REGEXP and LIKE are as follows:
The LIKE operator succeeds only if the pattern matches the entire value.
The REGEXP operator succeeds if the pattern matches anywhere in the value. To match the entire value, you should use “^” at the beginning and “$” at the end.
The LIKE operator is case sensitive, but patterns of regular expressions in REGEXP is not case sensitive. To enable case sensitive, you should use REGEXP BINARY statement.
-- [a-dX], [^a-dX] : matches any character that is (or is not, if ^ is used) either a, b, c, d or X.
SELECT ('aXbc' REGEXP '^[a-dXYZ]+');
('aXbc' regexp '^[a-dXYZ]+')
==============================
1
-- When REGEXP is used in SELECT list, enclosing this with parentheses is required.
-- But used in WHERE clause, no need parentheses.
-- case insensitive, except when used with BINARY.
SELECT name FROM athlete where name REGEXP '^[a-d]';
name
======================
'Dziouba Irina'
'Dzieciol Iwona'
'Dzamalutdinov Kamil'
'Crucq Maurits'
'Crosta Daniele'
'Bukovec Brigita'
'Bukic Perica'
'Abdullayev Namik'
REGEXP_COUNT¶
- REGEXP_COUNT(string, pattern_string[, position[, match_type]])¶
The REGEXP_COUNT function returns the number of occurrences of the regular expression pattern, pattern_string, within a given character string, string. If NULL is specified as an argument, NULL is returned.
- Parameters:
string – Specifies the original string. If the value is NULL, NULL is returned.
pattern_string – Specifies the regular expression pattern string to be searched. If the value is NULL, NULL is returned.
position – Specifies the position of the string to start the search. If the value is ommitted, the default value 1 is applied. If the value is negative or zero, an error will be returned. If the value is NULL, NULL is returned
match_type – Specifies the string to change default matching behavior of the function. If the value is ommitted, the default value ‘i’ is applied. If the value is other than ‘c’ or ‘i’, an error will be returned. If the value is NULL, NULL is returned.
- Return type:
INT
-- it returns NULL when an argument is specified with NULL value
SELECT REGEXP_COUNT('ab123ab111a','[a-d]+',NULL);
regexp_count('ab123ab111a','[a-d]+',NULL)
======================
NULL
-- an empty string pattern doesn't match with any string
SELECT REGEXP_COUNT('ab123ab111a','');
regexp_count('ab123ab111a','')
======================
0
SELECT REGEXP_COUNT('ab123Ab111aAA','[a-d]', 3);
regexp_count('ab123Ab111aAA', '[a-d]', 3)
===========================================
5
-- case insensitive ('i') is the default value
SELECT REGEXP_COUNT('ab123Ab111aAA','[a-d]', 3, 'i');
-- If case sensitive ('c') is specified as match_type, A is not matched.
SELECT REGEXP_COUNT('ab123Ab111aAA','[a-d]', 3, 'c');
regexp_count('ab123Ab111aAA', '[a-d]', 3, 'i')
================================================
5
regexp_count('ab123Ab111aAA', '[a-d]', 3, 'c')
================================================
2
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_COUNT('가나123abc가다abc가가','[가-나]+');
regexp_count('가나123abc가다abc가가','[가-나]+')
======================
2
REGEXP_INSTR¶
- REGEXP_INSTR(string, pattern_string[, position[, occurrence[, return_option[, match_type]]]])¶
The REGEXP_INSTR function returns the beginning or ending position by searching for a regular expression pattern, pattern_string, within a given character string, string, and replaces it with a character string. If NULL is specified as an argument, NULL is returned.
- Parameters:
string – Specifies the original string. If the value is NULL, NULL is returned.
pattern_string – Specifies the regular expression pattern string to be searched. If the value is NULL, NULL is returned.
position – Specifies the position of the string to start the search. If the value is ommitted, the default value 1 is applied. If the value is negative or zero, an error will be returned. If the value is NULL, NULL is returned
occurrence – Specifies the occurrence of replacement. If the value is ommitted, the default value 1 is applied. If the value is negative, an error will be returned. If the value is NULL, NULL is returned.
return_option – Specifies whether to return the position of the match. If the value is 0, the position of the first character of the match is returned. If the value is 1, the position of the character following the match is returned. If the value is ommitted, the default value 0 is applied. If the value is other than 0 or 1, an error will be returned. If the value is NULL, NULL is returned.
match_type – Specifies the string to change default matching behavior of the function. If the value is ommitted, the default value ‘i’ is applied. If the value is other than ‘c’ or ‘i’, an error will be returned. If the value is NULL, NULL is returned.
- Return type:
INT
-- it returns NULL when an argument is specified with NULL value
SELECT REGEXP_INSTR('12345abcdeabcde','[abc]',NULL);
regexp_instr('12345abcdeabcde', '[abc]', null)
======================
NULL
-- an empty string pattern doesn't match with any string
SELECT REGEXP_INSTR('12345abcdeabcde','');
regexp_instr('12345abcdeabcde', '')
======================
0
-- it returns the position of the first character of the match.
SELECT REGEXP_INSTR('12354abc5','[:alpha:]+',1,1,0);
regexp_instr('12354abc5','[:alpha:]+', 1, 1, 0);
======================
6
-- it returns the position of the character following the match.
SELECT REGEXP_INSTR('12354abc5','[:alpha:]+',1,1,1);
regexp_instr('12354abc5','[:alpha:]+', 1, 1, 1);
======================
9
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_INSTR('12345가나다라마가나다라마바','[가-다]+');
regexp_instr('12345가나다라마가나다라마바','[가-다]+');
======================
6
REGEXP_LIKE¶
- REGEXP_LIKE(string, pattern_string[, match_type])¶
The REGEXP_LIKE function searches for a regular expression pattern, pattern_string, within a given character string, string. If the pattern matched anywhere in the string, 1 is returned. Otherwise, 0 is returned. If NULL is specified as an argument, NULL is returned.
- Parameters:
string – Specifies the original string. If the value is NULL, NULL is returned.
pattern_string – Specifies the regular expression pattern string to be searched. If the value is NULL, NULL is returned.
match_type – Specifies the string to change default matching behavior of the function. If the value is ommitted, the default value ‘i’ is applied. If the value is other than ‘c’ or ‘i’, an error will be returned. If the value is NULL, NULL is returned.
- Return type:
INT
SELECT REGEXP_LIKE('abbbbc','ab+c');
regexp_like('abbbbc', 'ab+c');
======================
1
-- an empty string pattern doesn't match with any string
SELECT REGEXP_LIKE('abbbbc','');
regexp_like('abbbbc', '');
======================
0
SELECT REGEXP_LIKE('abbbbc','AB+C', 'c');
regexp_like('abbbbc', 'AB+C');
======================
0
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_LIKE('가나다','가나?다');
SELECT REGEXP_LIKE('가나라다','가나?다');
regexp_like('가나다', '가나?다')
===============================
1
regexp_like('가나라다, '가나?다')
================================
0
REGEXP_REPLACE¶
- REGEXP_REPLACE(string, pattern_string, replacement_string[, position[, occurrence[, match_type]]])¶
The REGEXP_REPLACE function searches for a regular expression pattern, pattern_string, within a given character string, string, and replaces it with a character string, replacement_string. If NULL is specified as an argument, NULL is returned.
- Parameters:
string – Specifies the original string. If the value is NULL, NULL is returned.
pattern_string – Specifies the regular expression pattern string to be searched. If the value is NULL, NULL is returned.
replacement_string – Specifies the string to replace the matched string by pattern_string. If the value is NULL, NULL is returned.
position – Specifies the position of the string to start the search. If the value is ommitted, the default value 1 is applied. If the value is negative or zero, an error will be returned. If the value is NULL, NULL is returned
occurrence – Specifies the occurrence of replacement. If the value is ommitted, the default value 0 is applied. If the value is negative, an error will be returned. If the value is NULL, NULL is returned.
match_type – Specifies the string to change default matching behavior of the function. If the value is ommitted, the default value ‘i’ is applied. If the value is other than ‘c’ or ‘i’, an error will be returned. If the value is NULL, NULL is returned.
- Return type:
STRING
-- it returns NULL when an argument is specified with NULL value
SELECT REGEXP_REPLACE('12345abcdeabcde','[a-d]',NULL);
regexp_replace('12345abcdeabcde', '[a-d]', null)
======================
NULL
-- an empty string pattern doesn't match with any string
SELECT REGEXP_REPLACE('12345abcdeabcde','','#');
regexp_replace('12345abcdeabcde', '', '#')
======================
'12345abcdeabcde'
SELECT REGEXP_REPLACE('12345abDEKBcde','[a-d]','#');
regexp_replace('12345abDEKBcde', '[a-d]', '#')
======================
'12345###EK###e'
-- case insensitive ('i') is the default value
SELECT REGEXP_REPLACE('12345abDEKBcde','[a-d]','#', 1, 0, 'i');
-- match_type is specified as case sensitive ('c'). 'B' and 'D' are not matched.
SELECT REGEXP_REPLACE('12345abDEKBcde','[a-d]','#', 1, 0, 'c');
regexp_replace('12345abDEKBcde', '[a-d]', '#', 1, 0, 'i')
======================
'12345###EK###e'
regexp_replace('12345abDEKBcde', '[a-d]', '#', 1, 0, 'c')
======================
'12345##DEKB##e'
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_REPLACE('a1가b2나다라','[가-다]','#',6);
regexp_replace('a1가b2나다라', '[가-다]', '#', 6);
======================
'a1가b2##라'
REGEXP_SUBSTR¶
- REGEXP_SUBSTR(string, pattern_string[, position[, occurrence[, match_type]]])¶
The REGEXP_SUBSTR function extracts a character string matched for a regular expression pattern, pattern_string, within a given character string, string. If NULL is specified as an argument, NULL is returned.
- Parameters:
string – Specifies the original string. If the value is NULL, NULL is returned.
pattern_string – Specifies the regular expression pattern string to be searched. If the value is NULL, NULL is returned.
position – Specifies the position of the string to start the search. If the value is ommitted, the default value 1 is applied. If the value is negative or zero, an error will be returned. If the value is NULL, NULL is returned
occurrence – Specifies the occurrence of replacement. If the value is ommitted, the default value 0 is applied. If the value is negative, an error will be returned. If the value is NULL, NULL is returned.
match_type – Specifies the string to change default matching behavior of the function. If the value is ommitted, the default value ‘i’ is applied. If the value is other than ‘c’ or ‘i’, an error will be returned. If the value is NULL, NULL is returned.
- Return type:
STRING
-- if pattern is not matched, null is returned
SELECT REGEXP_SUBSTR('12345abcdeabcde','[k-z]+');
regexp_substr('12345abcdeabcde','[k-z]+');
======================
NULL
-- an empty string pattern doesn't match with any string
SELECT REGEXP_SUBSTR('12345abcdeabcde','');
regexp_substr('12345abcdeabcde', '')
======================
NULL
SELECT REGEXP_SUBSTR('Samseong-ro, Gangnam-gu, Seoul',',[^,]+,');
regexp_substr('Samseong-ro, Gangnam-gu, Seoul', ',[^,]+,')
======================
', Gangnam-gu,'
SET NAMES utf8 COLLATE utf8_ko_cs;
SELECT REGEXP_SUBSTR('삼성로, 강남구, 서울특별시','[[:alpha:]]+',1,2);
regexp_substr('삼성로, 강남구, 서울특별시', [[:alpha:]]+', 1, 2);
======================
'강남구'