Capturing with a Single Parser

This first example is a modified version of the dataset used in the first half of the project README, drawn from a .hess file generated by ORCA:

>>> text = dedent("""\
... $vibrational_frequencies
... 6
... 0          0.000000
... 1          0.000000
... 2       -194.490162
... 3       -198.587114
... 4        389.931897
... 5        402.713910
... """)

A Minimal Parser Body

Focusing first on the main section of the data, the goal here is to retrieve the floats in the right-hand column; the rest of the content is irrelevant. However, the integers in the left-hand column still have to be represented in the pattern, even if they’re not captured.

So, to represent those leading integers, the first token of the body pattern needs to be a single number (#.) that’s not captured (omit !), with a positive sign (+) and integer format (i), leading to #.+i.

Then, to match the second, decimal value on each line, the second token needs to also be a single number (#.) of decimal format (d). But, since we want these values to be captured in output, it’s necessary to insert ! after #. And, since some of the values in this list are negative and some are positive, the token should allow any sign (.). Thus, the second token should be #!..d.

So, a first stab at the body of the Parser would be:

>>> prs = pent.Parser(body="#.+i #!..d")
>>> prs.capture_body(text)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']]]

Works nicely! There are two things to note about the data returned here, though:

First, all of the numerical values are returned as strings. pent tries to maximize flexibility by making no assumptions about what needs to be done with the data. Thus, some post-processing will always be required. For example, to get the captured values from data into a numpy array, one could do the following:

>>> arr = np.asarray(prs.capture_body(text), dtype=float).squeeze()
>>> print(arr)
[   0.          0.       -194.490162 -198.587114  389.931897  402.71391 ]

Second, the captured data is always returned as a nested series of lists. In situations like this one, where a single Parser is used, the nesting will be three levels deep. This is because each matching block of data is returned as a matrix (a list of lists), and each of these matrices is then in turn a member of the outermost list.

In this particular instance, since the body captures exactly one value per line of test parsed, the innermost lists are length-one. And, since there are six lines that match the body pattern, the matrix that is returned is of size 6x1 (a list containing six length-one lists).

This means that if there had been a gap in the data, the outermost list would have had length greater than one:

>>> text2 = dedent("""\
... 0      0.000000
... 1      0.000000
...
... 2   -194.490162
... 3   -198.587114
... """)
>>> prs.capture_body(text2)
[[['0.000000'], ['0.000000']], [['-194.490162'], ['-198.587114']]]

There are two blocks of data here, each with two rows of one value each, so the return value from capture_body() is a length-two list, where each item of that list represents a 2x1 matrix.

Capturing Multiple Values per Line

If one wanted to also capture the integer indices in each row, the only change needed would be to add the ! capturing flag to that first token:

>>> pent.Parser(body="#!.+i #!..d").capture_body(text2)
[[['0', '0.000000'], ['1', '0.000000']], [['2', '-194.490162'], ['3', '-198.587114']]]

Constraining the Parser Match with a head

However, what if there are other datasets in the file that have this same format, but that we don’t want to capture:

>>> text3 = dedent("""\
... $vibrational_frequencies
... 6
... 0          0.000000
... 1          0.000000
... 2       -194.490162
... 3       -198.587114
... 4        389.931897
... 5        402.713910
...
... $unrelated_data
... 3
... 0          3.316
... 1         -4.311
... 2         12.120
... """)

The original Parser will grab both of these blocks of data:

>>> prs.capture_body(text3)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], [['3.316'], ['-4.311'], ['12.120']]]

The Parser can be constrained to only the data we want by introducing a head pattern:

>>> prs2 = pent.Parser(
...     head=["@.$vibrational_frequencies", "#!.+i"],
...     body="#.+i #!..d"
... )
>>> prs2.capture_body(text3)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']]]

This use of head introduces two concepts: (1) the ‘literal string’ token, @, in combination with the “.” quantity marker telling the Parser to match the literal string exactly once; and (2) the pent feature wherein a length-n ordered iterable of pattern strings (here, length-two) will match n lines from the data string. In this case, the first string in the tuple matches the “$vibrational_frequencies” marker in the first line of the header, and the second captures the single positive integer in the second line of the header.

Capturing in head and tail with capture_struct()

In the example immediately above, note that even though the “!” capturing flag is specified in the second element of the head, that captured value does not show up in the capture_body() output. Captures in head and tail must be retrieved using capture_struct():

>>> prs2.capture_struct(text3)
[{<ParserField.Head: 'head'>: [['6']], <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}]
>>> prs2.capture_struct(text3)[0][pent.ParserField.Head]
[['6']]

The return value from capture_struct() has length equal to the number of times the Parser matched within the text. Here, since the pattern only matched once, the return value is of length one.

As a convenience, the lists returned by capture_struct() are actually of type ThruList, a custom subclass of list, which will silently pass through indices/keys to their first argument if and only if they are of length one. Thus, the following would also work for prs2 operating on text3:

>>> prs2.capture_struct(text3)[pent.ParserField.Head]
[['6']]

But, it would break for the original prs, where the overall pattern matched twice:

>>> prs.capture_struct(text3)
[{<ParserField.Head: 'head'>: None, <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}, {<ParserField.Head: 'head'>: None, <ParserField.Body: 'body'>: [['3.316'], ['-4.311'], ['12.120']], <ParserField.Tail: 'tail'>: None}]
>>> prs.capture_struct(text3)[pent.ParserField.Head]
Traceback (most recent call last):
    ...
pent.errors.ThruListError: Invalid ThruList index: Numeric index required for len != 1

As a final note, consider the difference between the head and tail results for the below Parser, where head is defined but has no capturing tokens present (yields [[]]), but tail is not specified (yields None):

>>> pent.Parser(head="#.+i", body="#.+i #!..d").capture_struct(text)
[{<ParserField.Head: 'head'>: [[]], <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}]