Basic Usage: Tokens

pent understands four kinds of tokens, which match varying types of content. One is an ‘any’ token, which matches an arbitrary span of whitespace and/or non-whitespace content. The other three types are intended to match specific kinds of content within the line of text that are often, but not always, separated from surrounding content by whitespace.

All four kinds of tokens accept a flag that instructs the encapsulating Parser to capture the content matching the token for output. A subset of the tokens accepts a flag that alters how the Parser handles the presence or absence of whitespace following the content matching the token.

Additionally, two of the four token types accept required arguments, which specify with more precision the content that the token should match. These required arguments are explained in the respective sections below.

The ‘Any’ Token: ~

The ‘any’ token will match anything, including a completely blank line. It behaves essentially the same as “.*” in regex.

Currently, the ‘any’ token only accepts the ‘capture’ flag (becoming “~!”). Addition of support for the ‘space-after’ flags is planned (#78).

Note that any content matched by a capturing ‘any’ token will be split at whitespace in Parser output.

The ‘Misc’ Token: &

The ‘misc’ token matches any sequence of non-whitespace characters. Its uses are similar to the ‘any’ token, except that its match is confined to a single whitespace-delimited piece of content. It is mainly intended for use on non-numerical data whose content is not constant, and thus the ‘literal’ token cannot be used.

The ‘misc’ token has one required argument, indicating whether it should match exactly one piece of content (&.) or one-or-more pieces of content (&+). When matching one-or-more, the ‘misc’ token interleaves required whitespace between each reptition.

At this time, the functional difference between “~” and “&+” is minimal.

The ‘misc’ token accepts both the capture flag and the space-after modifier flags.

The ‘Literal’ Token: @

The ‘literal’ token matches an exact sequence of one or more whitespace-delimited characters, which is provided as a required argument in the token definition.

Similar to the ‘misc’ token, the ‘literal’ token also has the quantity specifier as a required argument: either “@.” for exactly one match or “@+” for one-or-more matches.

The argument for the string to be matched follows the quantity argument. Thus, to match the text foo exactly once a suitable token might be “@.foo”.

In the situation where it’s needed to match a literal string containing a space, the entire token can be enclosed in quotes: “’@.this has spaces’”.

The ‘literal’ token differs from the ‘misc’ and ‘number’ tokens in that when the one-or-more argument is used, it prohibits whitespace between the repetitions. This allows, e.g., a long sequence of hyphens to be represented by a token like “@+-”. Similarly, a long sequence of alternating hyphens and spaces could be represented by “’@+- ‘”.

The ‘literal’ token accepts both the capture flag and the space-after modifier flags.

The ‘Number’ Token: #

The ‘number’ token allows for selectively matching numbers of varying types in the text being parsed; in particular, matches can be constrained by sign (positive, negative, or either) or by format (integer, decimal, or scientific notation; or, combinations of these).

The ‘number’ token takes three required, single-character arguments:

  1. Quantity:
    #. for exactly one, or
    #+ for one-or-more.
     

  2. Sign:
    #[.+]+ for positive,
    #[.+]- for negative, or
    #[.+]. for either sign.
     

  3. Number Format:
    #[.+][.-+]i for integer,
    #[.+][.-+]d for decimal,
    #[.+][.-+]s for scientific notation,
    #[.+][.-+]f for float (decimal or scinot)
    #[.+][.-+]g for general (integer or float).

The ability to specify different types of number formatting was implemented for this token because it is often the case that numbers printed in different formats have different semantic significance, and it’s thus useful to be able to filter/capture based on that format. This example illustrates a simplified case of this.

As with the ‘misc’ token, when matching in one-or-more quantity mode, the ‘number’ token interleaves required whitespace between each reptition.

The ‘number’ token accepts both the capture flag and the space-after modifier flags.

Token Flags

Currently, two types of flags can be passed to tokens: capture flag and the space-after modifier flags.

If both flags are used in a given token, the space-after modifier flag must precede the capture flag.

Capture Flag: !

In most cases, not all of the data in a block of text is of interest for downstream processing. Thus, pent provides the token-level ‘capture’ flag, “!”, which marks the content of that token for inclusion in the output of capture_body() and capture_struct(). The ‘capture’ flag is an integral part of all of the tutorial examples.

Space-After Flags: o and x

With no space-after flag provided, all tokens REQUIRE the presence of trailing whitespace (or EOL) in order to match. This is because most content is anticipated to be whitespace-delineated, and thus this default leads to more concise Parser definitions.

However, there are situations where changing this behavior is useful for defining a well-targeted Parser, and some where changing it is necessary in order to compose a functional Parser at all.

As an example, take the following line of text:

The foo is in the foo.

The token “@.foo” would match the first occurrence of the word “foo”, because it has whitespace after it, but it would not match the second occurrence, since it is immediately followed by a period.

In order to match both occurrences, the ‘optional trailing whitespace flag’, “o”, could be added, leading to the token “@o.foo”.

If it were desired only to match the second occurrence, the ‘prohibited trailing whitespace flag’, “x”, could be added, yielding “@x.foo”.

This tutorial example provides further illustration of the use of these flags in more-realistic situations.