Skip to content Skip to sidebar Skip to footer

Mapping A List To A Huffman Tree Whilst Preserving Relative Order

I'm having an issue with a search algorithm over a Huffman tree: for a given probability distribution I need the Huffman tree to be identical regardless of permutations of the inpu

Solution 1:

The usual practice is to use Huffman's algorithm only to generate the code lengths. Then a canonical process is used to generate the codes from the lengths. The tree is discarded. Codes are assigned in order from shorter codes to longer codes, and within a code, the symbols are sorted. This gives the codes you are expecting, a = 00, b = 01, etc. This is called a Canonical Huffman code.

The main reason this is done is to make the transmission of the Huffman code more compact. Instead of sending the code for each symbol along with the compressed data, you only need to send the code length for each symbol. Then the codes can be reconstructed on the other end for decompression.

A Huffman tree is not normally used for decoding either. With a canonical code, simple comparisons to determine the length of the next code, and an index using the code value will take you directly to the symbol. Or a table-driven approach can avoid the search for the length.

As for your tree, there are arbitrary choices being made when there are equal frequencies. In particular, on the second step the first node pulled is c with probability 0.2, and the second node pulled is b with probability 0.25. However it would have been equally valid to pull, instead of b, the node that was made in the first step, (e,d), whose probability is also 0.25. In fact that is what you'd prefer for your desired end state. Alas, you have relinquished the control of that arbitrary choice to the heapq library.

(Note: since you are using floating point values, 0.1 + 0.15 is not necessarily exactly equal to 0.25. Though it turns out it is. As another example, 0.1 + 0.2 is not equal to 0.3. You would be better off using integers for the frequencies if you want to see what happens when sums of frequencies are equal to other frequencies or sums of frequencies. E.g. 6,5,4,3,2.)

Some of the wrong ordering can be fixed by fixing some mistakes: change lo.merge(high) to hi.merge(lo), and reverse the order of the bits to: assign_code(node.left, code+'1') followed by assign_code(node.right, code+'0'). Then at least a gets assigned 00 and d is before e and b is before c. The ordering is then adebc.

Now that I think about it, even if you pick (e,d) over b, e.g by setting the probability of b to 0.251, you still don't get the complete order that you're after. No matter what, the probability of (e,d) (0.25) is greater than the probability of c (0.2). So even in that case, the final ordering would be (with the fixes above) abdec instead of your desired abcde. So it is not possible to get what you want assuming a consistent tree ordering and bit assignment with respect to the probabilities of groups of symbols. E.g., assuming that for each branch the stuff on the left has a greater or equal probability than the stuff on the right, and 0 is always assigned to left and 1 is always assigned to right. You would need to do something different.

The different thing that comes to mind is what I said at the start of the answer. Use the Huffman algorithm just to get the code lengths. Then you can assign the codes to the symbols in whatever order you like, and build a new tree. That would be much easier than trying to come up with some sort of scheme to coerce the original tree to be what you want, and proving that that works in all cases.

Solution 2:

I'll flesh out what Mark Adler said with working code. Everything he said is right :-) The high points:

  1. You must not use floating-point weights, or any other scheme that loses information about weights. Use integers. Simple and correct. If, e.g., you have 3-digit floating probabilities, convert each to an integer via int(round(the_probability * 1000)), then maybe fiddle them to ensure the sum is exactly 1000.
  2. heapq heaps are not "stable": nothing is defined about which item is popped if multiple items have the same minimal weight.
  3. So you can't get what you want while building the tree.
  4. A small variation of "canonical Huffman codes" appears to be what you do want. Constructing a tree for that is a long-winded process, but each step is straightforward enough. The first tree built is thrown away: the only information taken from it is the lengths of the codes assigned to each symbol.

Running:

syms = ['a','b','c','d','e']
weights = [30, 25, 20, 15, 10]
t = encode(syms, weights)
print t

prints this (formatted for readability):

[abcde,,
    ([ab,0,
        ([a,00,(None),(None)]),
        ([b,01,(None),(None)])]),
    ([cde,1,
        ([c,10,(None),(None)]),
        ([de,11,
            ([d,110,(None),(None)]),
            ([e,111,(None),(None)])])])]

Best I understand, that's exactly what you want. Complain if it isn't ;-)

EDIT: there was a bug in the assignment of canonical codes, which didn't show up unless weights were very different; fixed it.

classNode(object):
    def__init__(self, data=None, weight=None,
                       left=None, right=None,
                       code=None):
        self.data = data
        self.weight = weight
        self.left = left
        self.right = right
        self.code = code

    defis_symbol(self):
        return self.left is self.right isNonedef__repr__(self):
        return"[%s,%s,(%s),(%s)]" % (self.data,
                                      self.code,
                                      self.left,
                                      self.right)

    def__cmp__(self, a):
        return cmp(self.weight, a.weight)

defencode(syms, weights):
    from heapq import heapify, heappush, heappop

    tree = [Node(data=s, weight=w)
            for s, w inzip(syms, weights)]
    sym2node = {s.data: s for s in tree}
    heapify(tree)
    whilelen(tree) > 1:
        a, b = heappop(tree), heappop(tree)
        heappush(tree, Node(weight=a.weight + b.weight,
                            left=a, right=b))

    # Compute code lengths for the canonical coding.
    sym2len = {}
    defassign_codelen(node, codelen):
        if node isnotNone:
            if node.is_symbol():
                sym2len[node.data] = codelen
            else:
                assign_codelen(node.left, codelen + 1)
                assign_codelen(node.right, codelen + 1)
    assign_codelen(tree[0], 0)

    # Create canonical codes, but with a twist:  instead# of ordering symbols alphabetically, order them by# their position in the `syms` list.# Construct a list of (codelen, index, symbol) triples.# `index` breaks ties so that symbols with the same# code length retain their original ordering.
    triples = [(sym2len[name], i, name)
                for i, name inenumerate(syms)]
    code = oldcodelen = 0for codelen, _, name insorted(triples):
        if codelen > oldcodelen:
            code <<= (codelen - oldcodelen)
        sym2node[name].code = format(code, "0%db" % codelen)
        code += 1
        oldcodelen = codelen

    # Create a tree corresponding to the new codes.
    tree = Node(code="")
    dir2attr = {"0": "left", "1": "right"}
    for snode in sym2node.values():
        scode = snode.code
        codesofar = ""
        parent = tree
        # Walk the tree creating any needed interior nodes.for d in scode:
            assert parent isnotNone
            codesofar += d
            attr = dir2attr[d]
            child = getattr(parent, attr)
            if codesofar == scode:
                # We're at the leaf position.assert child isNonesetattr(parent, attr, snode)
            elif child isnotNone:
                assert child.code == codesofar
            else:
                child = Node(code=codesofar)
                setattr(parent, attr, child)
            parent = child

    # Finally, paste the `data` attributes together up# the tree.  Why?  Don't know ;-)defpaste(node):
        if node isNone:
            return""elif node.is_symbol():
            return node.data
        else:
            result = paste(node.left) + paste(node.right)
            node.data = result
            return result
    paste(tree)

    return tree

Duplicate symbols

Could I swap the sym2node dict to an ordereddict to deal with repeated 'a'/'b's etc?

No, and for two reasons:

  1. No mapping type supports duplicate keys; and,
  2. The concept of "duplicate symbols" makes no sense for Huffman encoding.

So, if you're determined ;-) to pursue this, first you have to ensure that symbols are unique. Just add this line at the start of the function:

syms = list(enumerate(syms))

For example, if the syms passed in is:

['a', 'b', 'a']

that will change to:

[(0, 'a'), (1, 'b'), (2, 'a')]

All symbols are now 2-tuples, and are obviously unique since each starts with a unique integer. The only thing the algorithm cares about is that symbols can be used as dict keys; it couldn't care less whether they're strings, tuples, or any other hashable type that supports equality testing.

So nothing in the algorithm needs to change. But before the end, we'll want to restore the original symbols. Just insert this before the paste() function:

defrestore_syms(node):
        if node isNone:
            returnelif node.is_symbol():
            node.data = node.data[1]
        else:
            restore_syms(node.left)
            restore_syms(node.right)
    restore_syms(tree)

That simply walks the tree and strips the leading integers off the symbols' .data members. Or, perhaps simpler, just iterate over sym2node.values(), and transform the .data member of each.

Post a Comment for "Mapping A List To A Huffman Tree Whilst Preserving Relative Order"