Parsing newline delimited file

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Track title: CC P Beethoven - Piano Sonata No 2 in A

--

Chapters
00:00 Parsing Newline Delimited File
01:50 Accepted Answer Score 3
03:08 Answer 2 Score 0
03:57 Answer 3 Score 0
05:15 Thank you

--

Full question
https://stackoverflow.com/questions/3044...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #fileparsing

#avk47

ACCEPTED ANSWER

Score 3

Adding a while True after the initial skipping should definitely work. Of course you have to get all the details right.

You could try to extend the approach you already have, with a nested while loop inside the outer loop. But it may be easier to think about it as a single loop. For each line, there's only three things you might have to do:

If there is no line, because you're at EOF, break out of the loop, making sure to process the old data (the last block in the file) if there was one first.
If it's a blank line, start a new data, making sure to process the old data if there was one first.
Otherwise, append to the existing data.

So:

with open(loc, 'r') as f:
    for i in range(16):
        f.readline()

    data = []
    while True:
        line = f.readline()
        if not line:
            if data:
                function_call(data)
            break
        if line == "\n":
            if data:
                function_call(data)
                data = []
        else:
            data.append(line)

There are a couple ways you could simplify this further:

Use a for line in f: instead of a while loop that repeatedly does f.readline() and checks it.
Use groupby to transform the iterator of lines into an iterator of blank-line-separated groups of lines.

ANSWER 2

Score 0

In case you are still struggling with this here is an implementation that reads your sample data using itertools.groupby() and a key function search():

from itertools import groupby, repeat

def search(d):
    """Key function used to group our dataset"""

    return d[0] == "\n"

def read_data(filename):
    """Read data from filename and return a nicer data structure"""

    data = []

    with open(filename, "r") as f:
        # Skip first 16 lines
        for _ in repeat(None, 16):
            f.readline()

        # iterate through each data block
        for newblock, records in groupby(f, search):
            if newblock:
                # we've found a new block
                # create a new row of data
                data.append([])
            else:
                # we've found data for the current block
                # add each row to the last row
                for row in records:
                    row = row.strip().split()
                    data[-1].append(row)

    return data

This will result in a data structure that is a nested list of blocks. Each sublist is separate by the \n grouping in your data file.

ANSWER 3

Score 0

The pattern of blocks in your file is they consist of groups of lines terminated by either a blank line or the end of the file. This logic could be encapsulated in a generator function that yielded the blocks of lines from your file iteratively which would simplify the rest of the script.

In the following, getlines() is the generator function. Also note that the first 17 lines of the file are skipped to get to the beginning of the first block.

from pprint import pformat

loc = 'parsing_test_file.txt'

def function(lines):
    print('function called with:\n{}'.format(pformat(lines)))

def getlines(f):
    lines = []
    while True:
        try:
            line = next(f)
            if line != '\n':  # not end of the block?
                lines.append(line)
            else:
                yield lines
                lines = []
        except StopIteration:  # end of file
            if lines:
                yield lines
            break

with open(loc, 'r') as f:
    for i in range(17):
        next(f)

    for lines in getlines(f):
        function(lines)

print('done')

Output using your test file:

function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
done