5. Regular Expressions

In order to meaningfully filter log files, we need to learn how to write good regular expressions. A regular expression (regexp for short), is a pattern that matches text. It’s similar to, but more advanced than wildcards[55], such as *.pdf. Wildcards are generally only used for matching filenames etc, whereas regular expressions are used for most other matching tasks, and are much more powerful.

Regular expressions consist of various meta-characters which have special meaning. There are a handful of regexp flavours, and so there are some differences between them. Since learning regular expressions can be a bit of work, we’ll stick to reasonably easy expressions, enough to get you through Logcheck effectively.

Regular expressions can be immensely powerful, being represented to it’s highest degree in Perl Compatible Regular Expressions, which other languages often have some support for as well. It is largely the power of regular expressions that draws many administrators to Perl.

I’ll show you the POSIX Extended regular expression syntax, since that’s what logcheck uses, and it should be preferred compared to the older Basic syntax. egrep (or grep -E) is a command that uses the POSIX Extended syntax, whereas grep uses POSIX Basic syntax. Logcheck uses egrep -i internally, the -i makes the matching case-insensitive.

The following table explains the meta-characters available for use with POSIX Extended regular expression syntax.

In the following examples, a custom tool called regex_test has been used. It works much the same as egrep, but its output is more useful for learning how regular expressions work. In particular, it will only show the first match. This tool should be available in the Lab Resources/Regular Expressions folder. On your workstation, copy the file regex_test.c from the Lab Resources into your virtual machines shared folder, and from inside your virtual machine compile it:

$ cd /media/host/
$ make regex_test
cc regex_test.c -o regex_test
$ sudo install -o root -g root -m 0755 regex_test /usr/local/bin/

Table 3. Regular Expression Cheatsheet


Matches any single character, but not a new-line character.

$ echo "hello" | regex_test '.'

Matches zero or more occurences of the previous item.

$ echo "hello" | regex_test '.*'
$ echo "abba" | regex_test 'a*'
$ echo "ac" | regex_test 'ab*'
$ echo "abba" | regex_test 'b*'
  (successfully matched 0 characters from index 0 to 0)
BEWARE: Leftmost longest match, especially with *
$ echo "abba" | regex_test 'ab*'

Matches one or more occurences of the previous item. a+ is much like aa*, but much more convenient when you have larger things to repeat, as we see in the grouping operator below.

$ echo "ac" | regex_test 'ab+c'
$ echo "abc" | regex_test 'ab+c'
$ echo "abbc" | regex_test 'ab+c'

Matches zero or one of the previous item. In other words, the previous item is optional.

$ echo "ac" | regex_test 'ab?c'
$ echo "abc" | regex_test 'ab?c'
$ echo "abbc" | regex_test 'ab?c'

Anchors the match to begin at the start of the line.

$ echo "spline" | regex_test 'line'
$ echo "spline" | regex_test '^line'
$ echo "line #1" | regex_test '^line'
line #1

Anchors the match to finish matching at the start of the line.

$ echo "linear" | regex_test 'line'
$ echo "linear" | regex_test 'line$'
$ echo "spline" | regex_test 'line$'
$ echo -e "... in Section 5, where ... two lines of input
> Section 5" | regex_test '^Section 5$'
Section 5

Matches a single input character, which must be one of the characters listed between the square brackets. Most characters inside the square brackets lose any special significance they usually have, though some gain special significance, to allow things like ranges and negation.

$ echo "square" | regex_test '[aeiou]+'
$ echo "—123.45" | regex_test '[0-9]+'
- introduces a range
$ echo "-123.45" | regex_test '[0-9.-]+'
A literal - must be last. . is no longer a meta-character.
$ echo "0x0800C4DF" | regex_test '0[xX][0-9a-fA-F]+'
$ echo 'some "quoted" text' | regex_test '"[^"]*"'
^ at the start negates the match: any but a "
some "quoted" text
$ echo 'but "there are \"limits\" to regexps" so watch out' \
>   | regex_test '"[^"]*"'
but "there are \"limits\" to regexps"
{n} and {m,n}

Bounded repitition. Matches the previous item m times exactly, or between m and n times. The upper or lower bound may be omitted, but leave the comma if you want an the range unbounded high or low.


Removes (escapes) special meaning from a meta-character.

$ echo '123 1.2 123' | regex_test '[0-9].[0-9]'
123 1.2 123
$ echo '123 1.2 123' | regex_test '[0-9]\.[0-9]'
123 1.2 123

Grouping construct. You could use it with the repetition qualifiers above.

$ echo '' | regex_test '([0-9]+\.){3}[0-9]+
$ echo '1.2' | regex_test '([0-9]+\.){3}[0-9]+
1.2  Trap!: beware what comes after

Alternation inside a group. For example, the pattern (foo|bar) would match all of foo or bar.

$ echo '... Figure 1.2 ...
>  ...
>  ... Table 1.1 ...' | regex_test -i '(figure|table) [0-9.]+'
... Figure 1.2 ...
... Table 1.1 ...

Named character classes, can be used to specify things such as a “alphabetic” character, a “lowercase” character, a “digit”, etc. See re_format(7) for further details.

Constructions such as [a-zA-Z] are not sufficient in a non-English locale. For example, in Spanish you would also consider ‘ñ’ as a letter. What is matched depends on the locale: in an English locale, ‘ñ’ would not be expected to matched, but unfortunately, it probably would be[a].

You are not expected to try this. Nor are you expected to know how to type in Spanish or Chinese, but it is important to be aware of the differences.

Match some alphabetic characters
$ echo '¡Español!' | LANG=es_MX.UTF-8 regex_test '[[:alpha:]]+'
Use LANG=C to ensure 7-bit ASCII
$ echo '¡Español!' | LANG=C regex_test '[[:alpha:]]+'
But matching is entirely too inclusive
$ echo '中文!' | LANG=es_MX.utf8 regex_test -i '[[:alpha:]]+'
中文  Chinese characters are not valid Spanish alphabetic characters.

[a] Any Unicode character with a “Letter” property is actually what is matched. While explainable, is is likely unexpected in the context of a locale.

Here is a simple procedure to show you how to write a simple regular expression to match log file entries.

  1. Use the following log entry as an example for the rest of procedure.

    Feb 17 13:17:44 belgarath snmpd[2978]: Connection from
  2. Identify the parts that will change, and the parts that will stay the same. Syslog entries have the date and time stamp, which will definately change. belgarath in this example is the name of the host that submitted the log. That will stay the same, and is worth including for a point of reference[56]. snmpd is the process name the submitting agent gave. That is a key part to identifying the log message. The number isn’t important, it’s the PID and will change. The string Connection from is important. Together with the snmpd part it practically identifies the whole line.

    We have an IP address in the line. If the service is used by many different IP addresses in the same subnet, you may elect not to match only part of it, say 10.18., which can be used for matching everything in the subnet.

  3. Your system’s logcheck will likely start with a consistent header to match the timestamp and hostname part that Syslog prepends to the message. Thus, you can replace the timestamp and hostname with whatever other logcheck entries start with:

    ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd[2978]: Connection from

    The \w is not documented in regex(7), but is mentioned in the GNU “Info” page. It is equivalent to [[:alnum:]] and is a shorthand notation borrowed from Perl regular expressions. Consider it a GNU extension.

  4. Escape every meta-character in the input you wish to match, by prefixing it with a \ in the regular expression. Note that not all punctuation characters are meta-characters. Ignore the header part which we’ve already fixed up.

    ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[2978\]: Connection from 10\.18\.1\.1
  5. Replace varying numerical sequences with [0-9]+ This is useful for the process identifier.

    ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[[0-9]+\]: 
    Connection from 10\.18\.[0-9]+\.[0-9]+
  6. Your completed regular expression should look like this.

    ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ snmpd\[[0-9]+\]: 
    Connection from 10\.18\.[0-9]+\.[0-9]+
  7. Test the regexp using egrep -i. We suggest that you put the input text into a file, to save typing.

    $ echo 'Feb 17 13:17:44 belgarath snmpd[2978]: Connection from' \
    > > log_message.txt
    $ egrep -i 'belgarath snmpd.[0-9]*.: Connection from' log_message.txt
    Feb 17 13:17:44 belgarath snmpd[2978]: Connection from
    The line was printed, so it matches.

5.1. Assessment


Transform the following two log messages into two suitable egrep (POSIX Extended) style regular expressions for use with Logcheck.

Feb 10 18:32:22 belgarath sshd[2947]: Accepted publickey for theauthor from port 34061 ssh2
Feb 17 06:25:02 belgarath su[20870]: + ??? root:nobody

I have indicated those parts of each message that would change each time or will appear differently in each. Some of it you may wish to keep. For example, you may wish to only match part of the IP address, if you routinely have people logging in from 10.18.2.X; or perhaps you have many users and don’t want to match each user explicitly.

In the second example, the + indicates the beginning of a session. The ??? is for the terminal (typically something like /dev/tty1,) but in this case there is no controlling terminal associated with this command. The ? is special to egrep (meaning the preceding is optional,) so you will need to escape it, turning each literal ? into \?. root:nobody indicates that the root user transitioned to the nobody user. Since this is effectively dropping privileges, it’s generally okay to ignore this event.

[55] Referred to as glob patterns.

[56] Syslog can be made to accept remote log messages.