Merge "Switch the AOSP word lists to the combined format."

This commit is contained in:
Jean Chalard 2012-10-31 02:57:25 -07:00 committed by Android (Google) Code Review
commit 63b66b6f39
51 changed files with 38 additions and 17 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View file

@ -0,0 +1,38 @@
# This is a sample wordlist that can be converted to a binary dictionary
# for use by the Latin IME.
# The file is essentially a CSV file, with indent level denoting nesting.
#
# The file starts with a single CSV line with the header attributes. Whatever
# the content, these are included as is in the binary file. The first attribute
# of the file should be `dictionary'. Usual fields are `locale', `description',
# `date', `version', `options'.
#
# Each word has a `word' entry and at least a `f' argument denoting its
# probability, as an integer between 0 and 255 on a logarithmic scale, with
# 255 meaning 1 and each decrement in 1 dividing probability by 1.15.
# As a special case, a weight of 0 is taken to mean profanity - words that
# should not be considered a typo, but that should never be suggested
# explicitly. An entry may be made not a word by adding a `not_a_word'
# field with a value of `true'. The main reason for putting such entries
# into the dictionary is to add shortcut targets and maybe a whitelist
# replacement.
#
# Each word may or may not have any number of shortcut target lines
# starting with a `shortcut' entry and having at least a `f' frequency
# value between 0 and 14, or the special value `whitelist' which becomes
# 15, which is then taken to be the whitelist target of this word.
#
# Each word may also have any number of bigram lines starting with a
# `bigram' entry containing the following word whose frequency should
# override the unigram frequency when following the word this bigram is
# for.
#
dictionary=main:en,locale=en,description=Sample wordlist,date=1351495318,version=1
word=sample,f=200
bigram=wordlist,f=243
word=wordlist,f=180
word=shortcut,f=176
shortcut=target,f=10
word=witelisted,f=10,not_a_word=true
shortcut=whitelisted,f=whitelist
word=profanity,f=0

View file

@ -1,17 +0,0 @@
<!-- This is a sample wordlist that can be converted to a binary dictionary
for use by the Latin IME.
The format of the word list is a flat list of word entries.
Each entry has a frequency between 255 and 0.
Highest frequency words get more weight in the prediction algorithm. As a
special case, a weight of 0 is taken to mean profanity - words that should
not be considered a typo, but that should never be suggested explicitly.
You can capitalize words that must always be capitalized, such as "January".
You can have a capitalized and a non-capitalized word as separate entries,
such as "robin" and "Robin".
-->
<wordlist>
<w f="255">this</w>
<w f="255">is</w>
<w f="128">sample</w>
<w f="1">wordlist</w>
</wordlist>

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.