Parsing newline delimited file
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Track title: CC P Beethoven - Piano Sonata No 2 in A
--
Chapters
00:00 Parsing Newline Delimited File
01:50 Accepted Answer Score 3
03:08 Answer 2 Score 0
03:57 Answer 3 Score 0
05:15 Thank you
--
Full question
https://stackoverflow.com/questions/3044...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #fileparsing
#avk47
ACCEPTED ANSWER
Score 3
Adding a while True after the initial skipping should definitely work. Of course you have to get all the details right.
You could try to extend the approach you already have, with a nested while loop inside the outer loop. But it may be easier to think about it as a single loop. For each line, there's only three things you might have to do:
- If there is no line, because you're at EOF,
breakout of the loop, making sure to process the olddata(the last block in the file) if there was one first. - If it's a blank line, start a new
data, making sure to process the olddataif there was one first. - Otherwise, append to the existing
data.
So:
with open(loc, 'r') as f:
for i in range(16):
f.readline()
data = []
while True:
line = f.readline()
if not line:
if data:
function_call(data)
break
if line == "\n":
if data:
function_call(data)
data = []
else:
data.append(line)
There are a couple ways you could simplify this further:
- Use a
for line in f:instead of awhileloop that repeatedly doesf.readline()and checks it. - Use
groupbyto transform the iterator of lines into an iterator of blank-line-separated groups of lines.
ANSWER 2
Score 0
In case you are still struggling with this here is an implementation that reads your sample data using itertools.groupby() and a key function search():
from itertools import groupby, repeat
def search(d):
"""Key function used to group our dataset"""
return d[0] == "\n"
def read_data(filename):
"""Read data from filename and return a nicer data structure"""
data = []
with open(filename, "r") as f:
# Skip first 16 lines
for _ in repeat(None, 16):
f.readline()
# iterate through each data block
for newblock, records in groupby(f, search):
if newblock:
# we've found a new block
# create a new row of data
data.append([])
else:
# we've found data for the current block
# add each row to the last row
for row in records:
row = row.strip().split()
data[-1].append(row)
return data
This will result in a data structure that is a nested list of blocks. Each sublist is separate by the \n grouping in your data file.
ANSWER 3
Score 0
The pattern of blocks in your file is they consist of groups of lines terminated by either a blank line or the end of the file. This logic could be encapsulated in a generator function that yielded the blocks of lines from your file iteratively which would simplify the rest of the script.
In the following, getlines() is the generator function. Also note that the first 17 lines of the file are skipped to get to the beginning of the first block.
from pprint import pformat
loc = 'parsing_test_file.txt'
def function(lines):
print('function called with:\n{}'.format(pformat(lines)))
def getlines(f):
lines = []
while True:
try:
line = next(f)
if line != '\n': # not end of the block?
lines.append(line)
else:
yield lines
lines = []
except StopIteration: # end of file
if lines:
yield lines
break
with open(loc, 'r') as f:
for i in range(17):
next(f)
for lines in getlines(f):
function(lines)
print('done')
Output using your test file:
function called with:
['MoreInfo MoreInfo MoreInfo MoreInfo MoreInfo\n',
'MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2\n',
'MoreInfo3 MoreInfo3 MoreInfo3 MoreInfo3 MoreInfo3\n',
'MoreInfo4 MoreInfo4\n',
'FieldName1 0001 0001\n',
'FieldName1 0002 0002\n',
'FieldName1 0003 0003\n',
'FieldName1 0004 0004\n',
'FieldName1 0005 0005\n',
'FieldName2 0001 0001\n',
'FieldName3 0001 0001\n',
'FieldName4 0001 0001\n',
'FieldName5 0001 0001\n',
'FieldName6 0001 0001\n']
function called with:
['MoreInfo MoreInfo MoreInfo MoreInfo MoreInfo\n',
'MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2 MoreInfo2\n',
'MoreInfo3 MoreInfo3 MoreInfo3 MoreInfo3 MoreInfo3\n',
'MoreInfo4 MoreInfo4\n',
'FieldName1 0001 0001\n',
'FieldName1 0002 0002\n',
'FieldName1 0003 0003\n',
'FieldName1 0004 0004\n',
'FieldName1 0005 0005\n',
'FieldName2 0001 0001\n',
'FieldName3 0001 0001\n',
'FieldName4 0001 0001\n',
'FieldName5 0001 0001\n',
'FieldName6 0001 0001\n']
done