In order to meaningfully filter log files, we need to learn
how to write good regular expressions. A regular expression
(regexp for short), is a pattern that
matches text. It’s similar to, but more advanced than
wildcards[55], such as
*.pdf. Wildcards are generally only used for
matching filenames etc, whereas regular expressions are used for
most other matching tasks, and are much more powerful.
Regular expressions consist of various meta-characters which have special meaning. There are a handful of regexp flavours, and so there are some differences between them. Since learning regular expressions can be a bit of work, we’ll stick to reasonably easy expressions, enough to get you through Logcheck effectively.
Regular expressions can be immensely powerful, being represented to it’s highest degree in Perl Compatible Regular Expressions, which other languages often have some support for as well. It is largely the power of regular expressions that draws many administrators to Perl.
I’ll show you the POSIX Extended regular
expression syntax, since that’s what logcheck
uses, and it should be preferred compared to the older Basic
syntax. egrep (or grep -E)
is a command that uses the POSIX Extended
syntax, whereas grep uses
POSIX Basic syntax. Logcheck uses
egrep -i internally, the -i
makes the matching case-insensitive.
The following table explains the meta-characters available for use with POSIX Extended regular expression syntax.
In the following examples, a custom tool called
regex_test has been used. It works much the
same as egrep, but its output is more useful
for learning how regular expressions work. In particular, it will
only show the first match. This tool should
be available in the Lab Resources/Regular
Expressions folder. On your workstation, copy the file
regex_test.c from the Lab Resources into your
virtual machines shared folder, and from inside your virtual
machine compile it:
$cd /media/host/$make regex_testcc regex_test.c -o regex_test$sudo install -o root -g root -m 0755 regex_test /usr/local/bin/
Table 3. Regular Expression Cheatsheet
| Syntax | Description |
|---|---|
. |
Matches any single character, but not a new-line character.
|
* |
Matches zero or more occurences of the previous item.
|
+ |
Matches one or more
occurences of the previous item.
|
? |
Matches zero or one of the previous item. In other words, the previous item is optional.
|
^ |
Anchors the match to begin at the start of the line.
|
$ |
Anchors the match to finish matching at the start of the line.
|
[…] |
Matches a single input character, which must be one of the characters listed between the square brackets. Most characters inside the square brackets lose any special significance they usually have, though some gain special significance, to allow things like ranges and negation.
|
{ and { |
Bounded repitition. Matches the previous item
|
\ |
Removes (escapes) special meaning from a meta-character.
|
( |
Grouping construct. You could use it with the repetition qualifiers above.
|
( |
Alternation inside a group. For example, the pattern
|
[[:alpha:]] |
Named character classes, can be used to specify things such as a “alphabetic” character, a “lowercase” character, a “digit”, etc. See re_format(7) for further details. Constructions such as You are not expected to try this. Nor are you expected to know how to type in Spanish or Chinese, but it is important to be aware of the differences. Match some alphabetic characters |
[a] Any Unicode character with a “Letter” property is actually what is matched. While explainable, is is likely unexpected in the context of a locale. | |
Here is a simple procedure to show you how to write a simple regular expression to match log file entries.
Use the following log entry as an example for the rest of procedure.
Feb 17 13:17:44 belgarath snmpd[2978]: Connection from 10.18.1.1
Identify the parts that will change, and the parts that
will stay the same. Syslog entries have the date and time
stamp, which will definately
change. belgarath in this example is the
name of the host that submitted the log. That will stay the
same, and is worth including for a point of
reference[56]. snmpd is
the process name the submitting agent gave. That is a key part
to identifying the log message. The number isn’t important,
it’s the PID and will change. The string
Connection from is important. Together
with the snmpd part it practically
identifies the whole line.
We have an IP address in the line. If
the service is used by many different IP addresses in the same
subnet, you may elect not to match only part of it, say
10.18., which can be used for matching
everything in the 10.18.0.0/16 subnet.
Your system’s logcheck will likely start with a consistent header to match the timestamp and hostname part that Syslog prepends to the message. Thus, you can replace the timestamp and hostname with whatever other logcheck entries start with:
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd[2978]: Connection from 10.18.1.1The \w is not documented in
regex(7),
but is mentioned in the GNU “Info” page. It is equivalent to
[[:alnum:]] and is a shorthand notation
borrowed from Perl regular expressions. Consider it a GNU
extension.
Escape every meta-character in the input you wish to
match, by prefixing it with a \ in the
regular expression. Note that not all punctuation characters
are meta-characters. Ignore the header part which we’ve
already fixed up.
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[2978\]: Connection from 10\.18\.1\.1Replace varying numerical sequences with
[0-9]+ This is useful for the process
identifier.
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[[0-9]+\]: ↩
Connection from 10\.18\.[0-9]+\.[0-9]+Your completed regular expression should look like this.
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[[0-9]+\]: ↩
Connection from 10\.18\.[0-9]+\.[0-9]+Test the regexp using egrep -i. We suggest that you put the input text into a file, to save typing.
$echo 'Feb 17 13:17:44 belgarath snmpd[2978]: Connection from 10.18.1.1' \>> log_message.txt$egrep -i 'belgarath snmpd.[0-9]*.: Connection from 10.18.1.1' log_message.txtFeb 17 13:17:44 belgarath snmpd[2978]: Connection from 10.18.1.1 The line was printed, so it matches.