Originální popis anglicky:
regcomp, regerror, regexec, regfree - regular expression matching
Návod, kniha: POSIX Programmer's Manual
#include <regex.h>
int regcomp(regex_t *restrict
preg, const char
*restrict pattern,
int
cflags );
size_t regerror(int
errcode, const regex_t
*restrict preg,
char *restrict
errbuf , size_t
errbuf_size);
int regexec(const regex_t *restrict
preg, const char
*restrict string,
size_t
nmatch , regmatch_t
pmatch[restrict], int
eflags );
void regfree(regex_t *
preg);
These functions interpret
basic and
extended regular expressions
as described in the Base Definitions volume of
IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.
The
regex_t structure is defined in
<regex.h> and contains
at least the following member:
Member Type |
Member Name |
Description |
size_t |
re_nsub |
Number of parenthesized subexpressions. |
The
regmatch_t structure is defined in
<regex.h> and
contains at least the following members:
Member Type |
Member Name |
Description |
regoff_t |
rm_so |
Byte offset from start of string to start of substring. |
regoff_t |
rm_eo |
Byte offset from start of string of the first character after the
end of substring. |
The
regcomp() function shall compile the regular expression contained in
the string pointed to by the
pattern argument and place the results in
the structure pointed to by
preg. The
cflags argument is the
bitwise-inclusive OR of zero or more of the following flags, which are defined
in the
<regex.h> header:
- REG_EXTENDED
- Use Extended Regular Expressions.
- REG_ICASE
- Ignore case in match. (See the Base Definitions volume of
IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.)
- REG_NOSUB
- Report only success/fail in regexec().
- REG_NEWLINE
- Change the handling of <newline>s, as described in
the text.
The default regular expression type for
pattern is a Basic Regular
Expression. The application can specify Extended Regular Expressions using the
REG_EXTENDED
cflags flag.
If the REG_NOSUB flag was not set in
cflags, then
regcomp() shall
set
re_nsub to the number of parenthesized subexpressions (delimited by
"\(\)" in basic regular expressions or
"()"
in extended regular expressions) found in
pattern.
The
regexec() function compares the null-terminated string specified by
string with the compiled regular expression
preg initialized by
a previous call to
regcomp(). If it finds a match,
regexec()
shall return 0; otherwise, it shall return non-zero indicating either no match
or an error. The
eflags argument is the bitwise-inclusive OR of zero or
more of the following flags, which are defined in the
<regex.h>
header:
- REG_NOTBOL
- The first character of the string pointed to by
string is not the beginning of the line. Therefore, the circumflex
character ( '^' ), when taken as a special character, shall not
match the beginning of string.
- REG_NOTEOL
- The last character of the string pointed to by
string is not the end of the line. Therefore, the dollar sign (
'$' ), when taken as a special character, shall not match the end
of string.
If
nmatch is 0 or REG_NOSUB was set in the
cflags argument to
regcomp(), then
regexec() shall ignore the
pmatch
argument. Otherwise, the application shall ensure that the
pmatch
argument points to an array with at least
nmatch elements, and
regexec() shall fill in the elements of that array with offsets of the
substrings of
string that correspond to the parenthesized
subexpressions of
pattern:
pmatch[
i].
rm_so shall
be the byte offset of the beginning and
pmatch[
i].
rm_eo
shall be one greater than the byte offset of the end of substring
i.
(Subexpression
i begins at the
ith matched open parenthesis,
counting from 1.) Offsets in
pmatch[0] identify the substring that
corresponds to the entire regular expression. Unused elements of
pmatch
up to
pmatch[
nmatch-1] shall be filled with -1. If there are
more than
nmatch subexpressions in
pattern (
pattern
itself counts as a subexpression), then
regexec() shall still do the
match, but shall record only the first
nmatch substrings.
When matching a basic or extended regular expression, any given parenthesized
subexpression of
pattern might participate in the match of several
different substrings of
string, or it might not match any substring
even though the pattern as a whole did match. The following rules shall be
used to determine which substrings to report in
pmatch when matching
regular expressions:
- 1.
- If subexpression i in a regular expression is not
contained within another subexpression, and it participated in the match
several times, then the byte offsets in pmatch[ i] shall
delimit the last such match.
- 2.
- If subexpression i is not contained within another
subexpression, and it did not participate in an otherwise successful
match, the byte offsets in pmatch[ i] shall be -1. A
subexpression does not participate in the match when: '*' or
"\{\}" appears immediately after the subexpression in a
basic regular expression, or '*' , '?' , or
"{}" appears immediately after the subexpression in an
extended regular expression, and the subexpression did not match (matched
0 times)
or:
'|' is used in an extended regular expression to select this
subexpression or another, and the other subexpression matched.
- 3.
- If subexpression i is contained within another
subexpression j, and i is not contained within any other
subexpression that is contained within j, and a match of
subexpression j is reported in pmatch[ j], then the
match or non-match of subexpression i reported in pmatch[
i] shall be as described in 1. and 2. above, but within the
substring reported in pmatch[ j] rather than the whole
string. The offsets in pmatch[ i] are still relative to the
start of string.
- 4.
- If subexpression i is contained in subexpression
j, and the byte offsets in pmatch[ j] are -1, then
the pointers in pmatch[ i] shall also be -1.
- 5.
- If subexpression i matched a zero-length string,
then both byte offsets in pmatch[ i] shall be the byte
offset of the character or null terminator immediately following the
zero-length string.
If, when
regexec() is called, the locale is different from when the
regular expression was compiled, the result is undefined.
If REG_NEWLINE is not set in
cflags, then a <newline> in
pattern or
string shall be treated as an ordinary character. If
REG_NEWLINE is set, then <newline> shall be treated as an ordinary
character except as follows:
- 1.
- A <newline> in string shall not be matched by
a period outside a bracket expression or by any form of a non-matching
list (see the Base Definitions volume of
IEEE Std 1003.1-2001, Chapter 9, Regular Expressions).
- 2.
- A circumflex ( '^' ) in pattern, when used to
specify expression anchoring (see the Base Definitions volume of
IEEE Std 1003.1-2001, Section 9.3.8, BRE Expression
Anchoring), shall match the zero-length string immediately after a
<newline> in string, regardless of the setting of
REG_NOTBOL.
- 3.
- A dollar sign ( '$' ) in pattern, when used
to specify expression anchoring, shall match the zero-length string
immediately before a <newline> in string, regardless of the
setting of REG_NOTEOL.
The
regfree() function frees any memory allocated by
regcomp()
associated with
preg.
The following constants are defined as error return values:
- REG_NOMATCH
- regexec() failed to match.
- REG_BADPAT
- Invalid regular expression.
- REG_ECOLLATE
- Invalid collating element referenced.
- REG_ECTYPE
- Invalid character class type referenced.
- REG_EESCAPE
- Trailing '\' in pattern.
- REG_ESUBREG
- Number in "\digit" invalid or in
error.
- REG_EBRACK
- "[]" imbalance.
- REG_EPAREN
- "\(\)" or "()"
imbalance.
- REG_EBRACE
- "\{\}" imbalance.
- REG_BADBR
- Content of "\{\}" invalid: not a number,
number too large, more than two numbers, first larger than second.
- REG_ERANGE
- Invalid endpoint in range expression.
- REG_ESPACE
- Out of memory.
- REG_BADRPT
- '?' , '*' , or '+' not preceded by
valid regular expression.
The
regerror() function provides a mapping from error codes returned by
regcomp() and
regexec() to unspecified printable strings. It
generates a string corresponding to the value of the
errcode argument,
which the application shall ensure is the last non-zero value returned by
regcomp() or
regexec() with the given value of
preg. If
errcode is not such a value, the content of the generated string is
unspecified.
If
preg is a null pointer, but
errcode is a value returned by a
previous call to
regexec() or
regcomp(), the
regerror()
still generates an error string corresponding to the value of
errcode,
but it might not be as detailed under some implementations.
If the
errbuf_size argument is not 0,
regerror() shall place the
generated string into the buffer of size
errbuf_size bytes pointed to
by
errbuf. If the string (including the terminating null) cannot fit in
the buffer,
regerror() shall truncate the string and null-terminate the
result.
If
errbuf_size is 0,
regerror() shall ignore the
errbuf
argument, and return the size of the buffer needed to hold the generated
string.
If the
preg argument to
regexec() or
regfree() is not a
compiled regular expression returned by
regcomp(), the result is
undefined. A
preg is no longer treated as a compiled regular expression
after it is given to
regfree().
Upon successful completion, the
regcomp() function shall return 0.
Otherwise, it shall return an integer value indicating an error as described
in
<regex.h>, and the content of
preg is undefined. If a
code is returned, the interpretation shall be as given in
<regex.h>.
If
regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
return one of the error codes that more precisely describes the error.
Upon successful completion, the
regexec() function shall return 0.
Otherwise, it shall return REG_NOMATCH to indicate no match.
Upon successful completion, the
regerror() function shall return the
number of bytes needed to hold the entire generated string, including the null
termination. If the return value is greater than
errbuf_size, the
string returned in the buffer pointed to by
errbuf has been truncated.
The
regfree() function shall not return a value.
No errors are defined.
The following sections are informative.
#include <regex.h>
/*
* Match string against the extended regular expression in
* pattern, treating errors as no match.
*
* Return 1 for match, 0 for no match.
*/
int
match(const char *string, char *pattern)
{
int status;
regex_t re;
if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
return(0); /* Report error. */
}
status = regexec(&re, string, (size_t) 0, NULL, 0);
regfree(&re);
if (status != 0) {
return(0); /* Report error. */
}
return(1);
}
The following demonstrates how the REG_NOTBOL flag could be used with
regexec() to find all substrings in a line that match a pattern
supplied by a user. (For simplicity of the example, very little error checking
is done.)
(void) regcomp (&re, pattern, 0);
/* This call to regexec() finds the first match on the line. */
error = regexec (&re, &buffer[0], 1, &pm, 0);
while (error == 0) { /* While matches found. */
/* Substring found between pm.rm_so and pm.rm_eo. */
/* This call to regexec() finds the next match. */
error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
}
An application could use:
regerror(code,preg,(char *)NULL,(size_t)0)
to find out how big a buffer is needed for the generated string,
malloc()
a buffer to hold the string, and then call
regerror() again to get the
string. Alternatively, it could allocate a fixed, static buffer that is big
enough to hold most strings, and then use
malloc() to allocate a larger
buffer if it finds that this is too small.
To match a pattern as described in the Shell and Utilities volume of
IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation,
use the
fnmatch() function.
The
regexec() function must fill in all
nmatch elements of
pmatch, where
nmatch and
pmatch are supplied by the
application, even if some elements of
pmatch do not correspond to
subexpressions in
pattern. The application writer should note that
there is probably no reason for using a value of
nmatch that is larger
than
preg->
re_nsub+1.
The REG_NEWLINE flag supports a use of RE matching that is needed in some
applications like text editors. In such applications, the user supplies an RE
asking the application to find a line that matches the given expression. An
anchor in such an RE anchors at the beginning or end of any line. Such an
application can pass a sequence of <newline>-separated lines to
regexec() as a single long string and specify REG_NEWLINE to
regcomp() to get the desired behavior. The application must ensure that
there are no explicit <newline>s in
pattern if it wants to ensure
that any match occurs entirely within a single line.
The REG_NEWLINE flag affects the behavior of
regexec(), but it is in the
cflags parameter to
regcomp() to allow flexibility of
implementation. Some implementations will want to generate the same compiled
RE in
regcomp() regardless of the setting of REG_NEWLINE and have
regexec() handle anchors differently based on the setting of the flag.
Other implementations will generate different compiled REs based on the
REG_NEWLINE.
The REG_ICASE flag supports the operations taken by the
grep -i
option and the historical implementations of
ex and
vi.
Including this flag will make it easier for application code to be written
that does the same thing as these utilities.
The substrings reported in
pmatch[] are defined using offsets from the
start of the string rather than pointers. Since this is a new interface, there
should be no impact on historical implementations or applications, and offsets
should be just as easy to use as pointers. The change to offsets was made to
facilitate future extensions in which the string to be searched is presented
to
regexec() in blocks, allowing a string to be searched that is not
all in memory at once.
The type
regoff_t is used for the elements of
pmatch[] to ensure
that the application can represent either the largest possible array in memory
(important for an application conforming to the Shell and Utilities volume of
IEEE Std 1003.1-2001) or the largest possible file (important
for an application using the extension where a file is searched in chunks).
The standard developers rejected the inclusion of a
regsub() function
that would be used to do substitutions for a matched RE. While such a routine
would be useful to some applications, its utility would be much more limited
than the matching function described here. Both RE parsing and substitution
are possible to implement without support other than that required by the
ISO C standard, but matching is much more complex than substituting.
The only difficult part of substitution, given the information supplied by
regexec(), is finding the next character in a string when there can be
multi-byte characters. That is a much larger issue, and one that needs a more
general solution.
The
errno variable has not been used for error returns to avoid filling
the
errno name space for this feature.
The interface is defined so that the matched substrings
rm_sp and
rm_ep are in a separate
regmatch_t structure instead of in
regex_t. This allows a single compiled RE to be used simultaneously in
several contexts; in
main() and a signal handler, perhaps, or in
multiple threads of lightweight processes. (The
preg argument to
regexec() is declared with type
const, so the implementation is
not permitted to use the structure to store intermediate results.) It also
allows an application to request an arbitrary number of substrings from an RE.
The number of subexpressions in the RE is reported in
re_nsub in
preg. With this change to
regexec(), consideration was given to
dropping the REG_NOSUB flag since the user can now specify this with a zero
nmatch argument to
regexec(). However, keeping REG_NOSUB allows
an implementation to use a different (perhaps more efficient) algorithm if it
knows in
regcomp() that no subexpressions need be reported. The
implementation is only required to fill in
pmatch if
nmatch is
not zero and if REG_NOSUB is not specified. Note that the
size_t type,
as defined in the ISO C standard, is unsigned, so the description of
regexec() does not need to address negative values of
nmatch.
REG_NOTBOL was added to allow an application to do repeated searches for the
same pattern in a line. If the pattern contains a circumflex character that
should match the beginning of a line, then the pattern should only match when
matched against the beginning of the line. Without the REG_NOTBOL flag, the
application could rewrite the expression for subsequent matches, but in the
general case this would require parsing the expression. The need for
REG_NOTEOL is not as clear; it was added for symmetry.
The addition of the
regerror() function addresses the historical need for
conforming application programs to have access to error information more than
"Function failed to compile/match your RE for unknown reasons".
This interface provides for two different methods of dealing with error
conditions. The specific error codes (REG_EBRACE, for example), defined in
<regex.h>, allow an application to recover from an error if it is
so able. Many applications, especially those that use patterns supplied by a
user, will not try to deal with specific error cases, but will just use
regerror() to obtain a human-readable error message to present to the
user.
The
regerror() function uses a scheme similar to
confstr() to deal
with the problem of allocating memory to hold the generated string. The scheme
used by
strerror() in the ISO C standard was considered
unacceptable since it creates difficulties for multi-threaded applications.
The
preg argument is provided to
regerror() to allow an
implementation to generate a more descriptive message than would be possible
with
errcode alone. An implementation might, for example, save the
character offset of the offending character of the pattern in a field of
preg, and then include that in the generated message string. The
implementation may also ignore
preg.
A REG_FILENAME flag was considered, but omitted. This flag caused
regexec() to match patterns as described in the Shell and Utilities
volume of IEEE Std 1003.1-2001, Section 2.13, Pattern Matching
Notation instead of REs. This service is now provided by the
fnmatch()
function.
Notice that there is a difference in philosophy between the
ISO POSIX-2:1993 standard and IEEE Std 1003.1-2001 in how
to handle a "bad" regular expression. The ISO POSIX-2:1993
standard says that many bad constructs "produce undefined results",
or that "the interpretation is undefined".
IEEE Std 1003.1-2001, however, says that the interpretation of
such REs is unspecified. The term "undefined" means that the action
by the application is an error, of similar severity to passing a bad pointer
to a function.
The
regcomp() and
regexec() functions are required to accept any
null-terminated string as the
pattern argument. If the meaning of the
string is "undefined", the behavior of the function is
"unspecified". IEEE Std 1003.1-2001 does not specify
how the functions will interpret the pattern; they might return error codes,
or they might do pattern matching in some completely unexpected way, but they
should not do something like abort the process.
None.
fnmatch() ,
glob() , Shell and Utilities volume of
IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation,
Base Definitions volume of IEEE Std 1003.1-2001, Chapter 9,
Regular Expressions,
<regex.h>,
<sys/types.h>
Portions of this text are reprinted and reproduced in electronic form from IEEE
Std 1003.1, 2003 Edition, Standard for Information Technology -- Portable
Operating System Interface (POSIX), The Open Group Base Specifications Issue
6, Copyright (C) 2001-2003 by the Institute of Electrical and Electronics
Engineers, Inc and The Open Group. In the event of any discrepancy between
this version and the original IEEE and The Open Group Standard, the original
IEEE and The Open Group Standard is the referee document. The original
Standard can be obtained online at http://www.opengroup.org/unix/online.html
.