Capturing with Nested Parser
s¶
pent
is also able to parse and capture higher-dimensional data
stored as free text. Take the following data string:
>>> text = dedent("""\
... $hessian
... 4
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... 2 3
... 0 0.004379 -0.416666
... 1 -0.022335 0.008067
... 2 0.028807 0.008067
... 3 0.008333 0.420926
... """)
text represents a 4x4 matrix, with the first two columns printed in one section,
and the second two columns printed in a separate, following section.
Each row and column is marked with its respective index.
In order to import this data successfully, the body of the main
Parser
will have to be set to a different, inner Parser
.
Defining the Inner Parser
¶
Each section of data columns starts with a row containing only positive integers, which does not need to be captured. After that leading row are multiple rows with data, each of which leads with a single positive integer, followed by decimal-format data of any sign:
>>> text_inner = dedent("""\
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... """)
One way to construct a Parser
for this internal block is as follows:
>>> prs_inner = pent.Parser(
... head="#++i",
... body="#.+i #!+.d",
... )
>>> prs_inner.capture_body(text_inner)
[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']]]
Note that even though the multiple decimal values in each row of the data block
were matched by the single “#!+.d” token in body,
they were reported as separate values in the output.
As currently implemented, pent
will always split captured content
at any internal whitespace; a further example of this with the ‘any’ token
can be seen here.
Defining the Outer Parser
¶
The outer Parser
then makes use of the inner Parser
as its body,
with the two header lines defined in head:
>>> prs_outer = pent.Parser(
... head=("@.$hessian", "#.+i"),
... body=prs_inner,
... )
>>> data = prs_outer.capture_body(text)
>>> data
[[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']], [['0.004379', '-0.416666'], ['-0.022335', '0.008067'], ['0.028807', '0.008067'], ['0.008333', '0.420926']]]]
Structure of the Returned data¶
The structure of the list returned by capture_body()
nests four levels deep:
>>> arr = np.asarray(data, dtype=float)
>>> arr.shape
(1, 2, 4, 2)
This is because:
Each block of data is returned as a matrix (adds two levels);
The body of prs_outer is a
Parser
(adds one level); andThe
capture_body()
method wraps everything in a list (adds one level).
So, working from left to right, the (1, 2, 4, 2) shape of the data arises because:
The overall prs_outer matched 1 time;
The inner prs_inner, as the body of prs_outer, matched 2 times; and
Both blocks of data matched by prs_inner have 4 rows and 2 columns
Reassembling the Full 4x4 Matrix¶
In cases like this, numpy
’s column_stack()
provides
a simple way to reassemble the full 4x4 matrix of data, though
it is necessary to convert each matrix to an ndarray
separately:
>>> np.column_stack([np.asarray(block, dtype=float) for block in data[0]])
array([[ 0.473532, 0.004379, 0.004379, -0.416666],
[ 0.004785, 0.028807, -0.022335, 0.008067],
[ 0.004785, -0.022335, 0.028807, 0.008067],
[-0.418007, 0.008333, 0.008333, 0.420926]])
data[0] is used instead of data in the generator expression
so that the two inner 4x2 blocks of data are yielded separately to asarray()
.
Coping with Mismatched Data Block Sizes¶
Nothing guarantees that the data in a chunk of text will have properly matched
internal dimensions, however. pent
will still import the data, but
it may not be possible to pull it directly into a numpy
array
as was done above:
>>> text2 = dedent("""\
... $hessian
... 4
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... 2 3
... 0 0.004379 -0.416666
... 1 -0.022335 0.008067
... """)
>>> data2 = prs_outer.capture_body(text2)
>>> data2
[[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']], [['0.004379', '-0.416666'], ['-0.022335', '0.008067']]]]
>>> np.asarray(data2, dtype=float)
Traceback (most recent call last):
...
ValueError: setting an array element with a sequence.
>>> np.column_stack([np.asarray(block, dtype=float) for block in data2[0]])
Traceback (most recent call last):
...
ValueError: all the input array dimensions except for the concatenation axis must match exactly
In situations like this, the returned data structure either must be processed
with methods that can accommodate the missing data, or the missing data must be explicitly
filled in before conversion to ndarray
.