From a64a1a46e482664dcebdf4fee0745a890d0d70dc Mon Sep 17 00:00:00 2001 From: Jean Chalard Date: Tue, 24 Apr 2012 12:13:22 +0900 Subject: [PATCH] Fix a bug where a node size would be seen as increasing. The core reason for this is quite shrewd. When a word is a bigram of itself, the corresponding chargroup will have a bigram referring to itself. When computing bigram offsets, we use cached addresses of chargroups, but we compute the size of the node as we go. Hence, a discrepancy may happen between the base offset as seen by the bigram (which uses the recomputed value) and the target offset (which uses the cached value). When this happens, the cached node address is too large. The relative offset is negative, which is expected, since it points to this very charnode whose start is a few bytes earlier. But since the cached address is too large, the offset is computed as smaller than it should be. On the next pass, the cache has been refreshed with the newly computed size and the seen offset is now correct (or at least, much closer to correct). The correct value is larger than the previously computed offset, which was too small. If it happens that it crosses the -255 or -65335 boundary, the address will be seen as needing 1 more byte than previously computed. If this is the only change in size of this node, the node will be seen as having a larger size than previously, which is unexpected. Debug code was catching this and crashing the program. So this case is very rare, but in an even rarer occurence, it may happen that in the same node, another chargroup happens to decrease it size by the same amount. In this case, the node may be seen as having not been modified. This is probably extremely rare. If on top of this, it happens that no other node has been modified, then the file may be seen as complete, and the discrepancy left as is in the file, leading to a broken file. The probability that this happens is abyssally low, but the bug exists, and the current debug code would not have caught this. To further catch similar bugs, this change also modifies the test that decides if the node has changed. On grounds that all components of a node may only decrease in size with each successive pass, it's theoritically safe to assume that the same size means the node contents have not changed, but in case of a bug like the bug above where a component wrongly grows while another shrinks and both cancel each other out, the new code will catch this. Also, this change adds a check against the number of passses, to avoid infinite loops in case of a bug in the computation code. This change fixes this bug by updating the cached address of each chargroup as we go. This eliminates the discrepancy and fixes the bug. Bug: 6383103 Change-Id: Ia3f450e22c87c4c193cea8ddb157aebd5f224f01 --- .../latin/makedict/BinaryDictInputOutput.java | 30 +++++++++++++++---- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/java/src/com/android/inputmethod/latin/makedict/BinaryDictInputOutput.java b/java/src/com/android/inputmethod/latin/makedict/BinaryDictInputOutput.java index 97df98e34..b1bee4e65 100644 --- a/java/src/com/android/inputmethod/latin/makedict/BinaryDictInputOutput.java +++ b/java/src/com/android/inputmethod/latin/makedict/BinaryDictInputOutput.java @@ -174,6 +174,13 @@ public class BinaryDictInputOutput { private static final int MAX_TERMINAL_FREQUENCY = 255; + // Arbitrary limit to how much passes we consider address size compression should + // terminate in. At the time of this writing, our largest dictionary completes + // compression in five passes. + // If the number of passes exceeds this number, makedict bails with an exception on + // suspicion that a bug might be causing an infinite loop. + private static final int MAX_PASSES = 24; + /** * A class grouping utility function for our specific character encoding. */ @@ -510,14 +517,22 @@ public class BinaryDictInputOutput { * Each node stores its tentative address. During dictionary address computing, these * are not final, but they can be used to compute the node size (the node size depends * on the address of the children because the number of bytes necessary to store an - * address depends on its numeric value. + * address depends on its numeric value. The return value indicates whether the node + * contents (as in, any of the addresses stored in the cache fields) have changed with + * respect to their previous value. * * @param node the node to compute the size of. * @param dict the dictionary in which the word/attributes are to be found. + * @return false if none of the cached addresses inside the node changed, true otherwise. */ - private static void computeActualNodeSize(Node node, FusionDictionary dict) { + private static boolean computeActualNodeSize(Node node, FusionDictionary dict) { + boolean changed = false; int size = getGroupCountSize(node); for (CharGroup group : node.mData) { + if (group.mCachedAddress != node.mCachedAddress + size) { + changed = true; + group.mCachedAddress = node.mCachedAddress + size; + } int groupSize = GROUP_FLAGS_SIZE + getGroupCharactersSize(group); if (group.isTerminal()) groupSize += GROUP_FREQUENCY_SIZE; if (null != group.mChildren) { @@ -538,7 +553,11 @@ public class BinaryDictInputOutput { group.mCachedSize = groupSize; size += groupSize; } - node.mCachedSize = size; + if (node.mCachedSize != size) { + node.mCachedSize = size; + changed = true; + } + return changed; } /** @@ -594,13 +613,14 @@ public class BinaryDictInputOutput { changesDone = false; for (Node n : flatNodes) { final int oldNodeSize = n.mCachedSize; - computeActualNodeSize(n, dict); + final boolean changed = computeActualNodeSize(n, dict); final int newNodeSize = n.mCachedSize; if (oldNodeSize < newNodeSize) throw new RuntimeException("Increased size ?!"); - if (oldNodeSize != newNodeSize) changesDone = true; + changesDone |= changed; } stackNodes(flatNodes); ++passes; + if (passes > MAX_PASSES) throw new RuntimeException("Too many passes - probably a bug"); } while (changesDone); final Node lastNode = flatNodes.get(flatNodes.size() - 1);