The Misc Token

Sometimes, data is laid out in text in a fashion where it cannot be matched using only numerical values. Either some elements of the data of interest are themselves non-numeric, or there are non-numeric portions of content interspersed with the numeric data of interest. pent provides the “misc” token (&) to handle these kinds of situations.

Take the following data, which is an example of the XYZ format for representing the atomic coordinates of a chemical system:

>>> text_xyz = dedent("""
... 5
... Coordinates from MeCl2F_2
...   C      -3.081564      2.283942      0.044943
...   Cl     -1.303141      2.255173      0.064645
...   Cl     -3.706406      3.411601     -1.180577
...   F      -3.541771      2.647036      1.270358
...   H      -3.439068      1.277858     -0.199370
... """)

In this case, pretty much everything in the text block is of interest. The first number indicates how many atoms are present (useful for cross-checking the data import), the line of text is an arbitrary string describing the chemical system, and the data block provides the atomic symbol of each atom and its xyz position in space.

The following Parser will enable capture of the entire contents of the string:

>>> prs_xyz = pent.Parser(
...     head=("#!..i", "~!"),
...     body="&!. #!+.d",
... )

The atomic symbols and coordinates are most easily retrieved with capture_body():

>>> data_atoms = prs_xyz.capture_body(text_xyz)
>>> data_atoms
[[['C', '-3.081564', '2.283942', '0.044943'], ['Cl', '-1.303141', '2.255173', '0.064645'], ['Cl', '-3.706406', '3.411601', '-1.180577'], ['F', '-3.541771', '2.647036', '1.270358'], ['H', '-3.439068', '1.277858', '-0.199370']]]

The atom count and description can be retrieved with capture_struct():

>>> data_struct = prs_xyz.capture_struct(text_xyz)
>>> data_struct[pent.ParserField.Head][0]
['5', 'Coordinates', 'from', 'MeCl2F_2']

Unlike in body, where two-dimensional structure is inferred in captured data, in head and tail all captures are returned as elements of a single, flat list.

Currently, it is not possible to avoid the splitting of all captured content at whitespace, even it it was captured from a single ‘any’ or ‘literal’ token. #26 and/or #62 are planned and will provide mechanism(s) to change this behavior.


As an aside, in this particular case the ‘misc’ token was not strictly necessary in the body, as the capturing ‘any’ token (~!) would also have worked:

>>> prs_any = pent.Parser(
...     head=("#.+i", "~"),
...     body="~! #!+.d",
... )
>>> prs_any.capture_body(text_xyz)
[[['C', '-3.081564', '2.283942', '0.044943'], ['Cl', '-1.303141', '2.255173', '0.064645'], ['Cl', '-3.706406', '3.411601', '-1.180577'], ['F', '-3.541771', '2.647036', '1.270358'], ['H', '-3.439068', '1.277858', '-0.199370']]]

However, there are situations where the ability of the ‘misc’ token to match only a single, arbitrary piece of whitespace-delimited content is useful in order to narrow the specificity of the Parser match.

Another example of the use of the ‘misc’ token is given at *Post-Processing of Captured Data.