There are two problems here. The first one is the tests would send
an invalid unicode character. Although we could want dicttool to
handle this more gracefully, it's fine for now.
The second problem is much more serious. If a node has more than
128 children, then the java code will crash trying to read the
dictionary back because of a bug that this change fixes. In
theory, it's possible that happens when we try to load the user
history dictionary back from the disk - native code is not affected
so there is no other point that may cause a problem.
In the practice, that means you'd need to have 129 words with a
common prefix (including empty string) but all different after
this. It's almost impossible with Google Keyboard since there are
only so many keys on the keyboard that you can make a word out
of, and then again you'd have to do it repeatedly until it
actually enters the user history dictionary, wait for it to get
saved on the disk.
The bad news is, if you manage to get this far, the keyboard will
crash every time and won't be able to get up until you clear
data for the package.
The good news is, the dictionary itself is not corrupted and only
the reading code is wrong. So updating to a newer version would
actually even recover from this situation.
All in all, considering how almost-impossible this is to trigger,
I don't think even a single user actually did hit this bug.
Bug: 8583091
Change-Id: Iabb2a7f47cbd9ed3193d2a3487318d280753e071
Both bugs only affect debug mode. One has the wrong object tested
with equals, the other has the iteration failing in some cases.
Change-Id: Ie9100d257a3f9e3be340cf3e38116f63417bdc1a
The important bug is in findWordInTree. The problem, which is
not obvious, is that we were calling codePointAt() with the
code point index in the string, instead of the char index.
The other bug this change fixes was harmless in the practice,
because it's in the iteration which is only used for debug and
pretty printing purposes. It's very similar in that it would
substract a length in code point to a length in chars and
truncate a StringBuilder at that length, so it would fail in a
quite similar manner. This changes the meaning of the "length"
attribute in Position, but it's clearer this way anyway.
Bug: 8450145
Change-Id: If396f883a9e6449de39351553ba83f5be5bd30f0
We used to make the dictionary that we passed to the
dictionary pack as an initial value based on the locale.
This is wrong - it should be read from the dictionary.
This change fixes that.
Bug: 7005813
Change-Id: Ib08ed31dd9c216f6f7b9c6c3174ca514bf96e06f
The "correct" bigram frequency is now returned by the reading
code. However, as the binary format represents the frequency
in a lossy manner, the frequency is not guaranteed to be the
exact same as the one in the source text format - only a close
enough value. It is however the exact same value seen by the
native code.
Bug: 7395653
Change-Id: I49199ef18901c671189912b3550623e9643baedd
This change has actually been extracted from a change work in progress I4fe423834b8131fb122251892c98228a6e08ba25
Change-Id: I52568fa09da2ea22be7f8bfe9676b7cd73c31fa4
This behaves exactly as the old makedict command. Further
changes will redirect the calls to makedict to this, so as
to consolidate similar code.
Groundwork for
Bug: 6429606
Change-Id: Ibeadbf48bec70f988a15ca36ebf5d1ce3b5b54ea
We don't merge tails anyway, and we can't do it any more
because that would break the bigram lookup algorithm.
The speedup is about 20%, and possibly double this if
there are no bigrams.
Bug: 6394357
Change-Id: I9eec11dda9000451706d280f120404a2acbea304
Rename it, rename parameters, and add a parameter that will
be necessary soon.
Also, rescale the bigram frequency as necessary.
Bug: 6313806
Change-Id: I192543cfb6ab6bccda4a1a53c8e67fbf50a257b0
This is not the Right fix ; the Right fix would be to read
the file in a buffered way. However this delivers tolerable
performance for a minimal amount of code changes.
We may want to skip submitting this patch, but keep it around
in case we need to use the functionality until we have a good
patch.
Change-Id: I1ba938f82acfd9436c3701d1078ff981afdbea60
The core reason for this is quite shrewd. When a word is a bigram
of itself, the corresponding chargroup will have a bigram referring
to itself. When computing bigram offsets, we use cached addresses of
chargroups, but we compute the size of the node as we go. Hence, a
discrepancy may happen between the base offset as seen by the bigram
(which uses the recomputed value) and the target offset (which uses
the cached value).
When this happens, the cached node address is too large. The relative
offset is negative, which is expected, since it points to this very
charnode whose start is a few bytes earlier. But since the cached
address is too large, the offset is computed as smaller than it should
be.
On the next pass, the cache has been refreshed with the newly computed
size and the seen offset is now correct (or at least, much closer to
correct). The correct value is larger than the previously computed
offset, which was too small. If it happens that it crosses the -255 or
-65335 boundary, the address will be seen as needing 1 more byte than
previously computed. If this is the only change in size of this node,
the node will be seen as having a larger size than previously, which
is unexpected. Debug code was catching this and crashing the program.
So this case is very rare, but in an even rarer occurence, it may
happen that in the same node, another chargroup happens to decrease
it size by the same amount. In this case, the node may be seen as
having not been modified. This is probably extremely rare. If on
top of this, it happens that no other node has been modified, then
the file may be seen as complete, and the discrepancy left as is
in the file, leading to a broken file. The probability that this
happens is abyssally low, but the bug exists, and the current debug
code would not have caught this.
To further catch similar bugs, this change also modifies the test
that decides if the node has changed. On grounds that all components
of a node may only decrease in size with each successive pass, it's
theoritically safe to assume that the same size means the node
contents have not changed, but in case of a bug like the bug above
where a component wrongly grows while another shrinks and both cancel
each other out, the new code will catch this. Also, this change adds
a check against the number of passses, to avoid infinite loops in
case of a bug in the computation code.
This change fixes this bug by updating the cached address of each
chargroup as we go. This eliminates the discrepancy and fixes the
bug.
Bug: 6383103
Change-Id: Ia3f450e22c87c4c193cea8ddb157aebd5f224f01