lundi 13 février 2017

Extracting the longest possible sequence between a set of potential breakpoints

Vote count: 0

I'm trying to parse out a messy itemized document into relevant items by figuring out where a set of breakpoints should go. If the subgroups have an alphabetical classification, then a regex search of the individual lines might yield a sequence like this:

sequence = ('        SECTION X.X','a','i','v','x','a','               Section','b','c','d','e','f','g','h','i','j','i','k','l','i','m','n','               Section','o','i','A','B','C','        SECTION X.X','a','i','v','b','c','d','e','f','g','h','i','i','j','k','l','m','n','o','                                   SECTION')

The goal is, assuming the valid breakpoints are at true Section breakpoints and alphabetical sequences from a-z, to identify the true sequence of level 1 and level 2 breakpoints. Note that whenever it encounters an 'i','v','B', or simply an out of sequence letter, or a Section in this list, those are actually invalid, because they are either false positives (i.e. they began the line, but were actually just inline references to something else) or they are from subsubsections which I want to ignore.

What I attempted to do was to loop over the items, ask if the item was first in the index (in this case it would be ascii_lowercase) and if the previous entry was either non-existant or more than one letter and then append its location to the list. If it didn't meet this criteria, I tried to append an item if it was next in line (reluctantly unless it was 'i' in which case greedily).

def prevl(mylist,n,l = 1):
    if n >= l:
        p = mylist[n-l]
    else:
        p = None
    return p

def next_in_line(item,prev_item,item_index,n = 1):
    try:
        if item not in item_index or prev_item not in item_index:
            isnext = False
        else:
            isnext = item_index.index(item) == item_index.index(prev_item) + n
    except:
        isnext = False 
    return isnext

def index_start(item,item_index):
    try:
        if item not in item_index:
            isstart = False
        else:
            isstart = item_index.index(item) == 0
    except:
        isstart = False
    return isstart

def cps(seq, rindex):
    cp_seq = []
    for n in range(len(seq)):
        if index_start(seq[n], rindex):
            # if not in_index(prevl(seq,n),rindex):
            if len(prevl(seq,n))>1:
                if n>0:
                    cp_seq.append(n-1)
                cp_seq.append(n)
        elif len(cp_seq)>0:
            if next_in_line(seq[n], seq[cp_seq[-1]], rindex): 
                cp_seq.append(n)      # If the letter is next in the sequence, append it
            elif seq[cp_seq[-1]] == 'i' and seq[n] =='i':   # Greedily replace (i) with last appearance of (i) before (j) 
                cp_seq.pop()
                cp_seq.append(n)

    return cp_seq

so

from string import ascii_lowercase
cps(sequence,ascii_lowercase)

results in [0, 1, 7, 8, 9, 11, 12, 13, 15, 16, 17, 19, 20, 22, 23, 25, 30, 31, 34, 35, 36, 37, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48]

While this mostly works, it seems klugy as hell. Whenever I'm writing a particular test for looking forwards or backwards, I'm continually running into either IndexErrors or TypeErrors, so I end up writing all of these subfunctions with Try blocks that yield the boolean outcome I want. This feels wrong. It also seems weird to have to test the length of a list or a slice value in a nested if statement before asking about a list item condition instead of doing so simultaneously in one branch line. I keep screwing up my conditional logic tree when I do that.

Is there a better, more Pythonic way to accomplish this goal?

This function is a prototype for a more general function, where hopefully I can test multiple indices of subsections together (uppercase ascii, roman numerals, ranges of section numbers "3.1" "3.2" etc.)

asked 15 secs ago

Let's block ads! (Why?)



Extracting the longest possible sequence between a set of potential breakpoints

Aucun commentaire:

Enregistrer un commentaire