Commit Graph

11 Commits (b6ea774669137bfdcf4036098f5d65cc0e0576d2)

Author SHA1 Message Date
Adrian Velicu 8dd31a28ae Update dictionaries (possibly_offensive flag)
Correctly encoding possibly offensive words with their correct
frequency and the possibly_offensive flag set.

Continuing to encode with zero frequency only distracters or
words that should never come up.

https://paste.googleplex.com/5167060875214848

Bug: 11031090
Change-Id: Ia394b1827f292ff8d4791cc2f3e6e50b5aff4cbe
2014-10-31 14:49:24 +09:00
Jean Chalard 004cec01a9 Update all dicts to version 44.
Bug: 13164302
Change-Id: I8dc1a839c7dcfaa08a53e26cb6600e9f871447ce
2014-02-24 21:27:25 +09:00
Jean Chalard a267ebed5a Update dictionaries
Add KitKat to all dictionaries.
Version
da, fi, pl : 29 → 40
cs, de, hr, it, lt, lv, nb, nl, sl, sr, sv, tr : 35 → 40
es : 36 → 40
en_gb, en_us, en, fr, pt_br, pt_pt : 39 → 40

Bug: 10958192
Change-Id: I14436616285ced5eb3b70b8c44b9243da94eed4f
2013-09-30 07:12:03 +00:00
Jean Chalard 50b36e2a4b Update dictionaries
>>> dictionaries/en_GB_wordlist.combined.gz
Header :
  date : 1374721653 <=> 1380099152
  version : 36 <=> 39
Body :
Freq changed: gay 127 -> 10
Added: draft 138

>>> dictionaries/en_US_wordlist.combined.gz
Header :
  date : 1374721654 <=> 1380099152
  version : 36 <=> 39
Body :
Freq changed: gay 127 -> 10

>>> dictionaries/en_wordlist.combined.gz
Header :
  date : 1374721663 <=> 1380099172
  version : 36 <=> 39
Body :
Freq changed: gay 127 -> 10

>>> dictionaries/fr_wordlist.combined.gz
Header :
  date : 1376888819 <=> 1380099153
  version : 37 <=> 39
Body :
Added: septembre 150

>>> dictionaries/pt_BR_wordlist.combined.gz
Header :
  date : 1376884524 <=> 1380099168
  version : 37 <=> 39
Body :
Freq changed: atras 87 -> 0
Not a word: atras false -> true
Shortcut added: atras atrás 15
Shortcut added: cade cadê 15
Shortcut added: cafe café 15
Shortcut added: ferias férias 15
Shortcut added: musica música 15
Shortcut added: musicas músicas 15

>>> dictionaries/pt_PT_wordlist.combined.gz
Header :
  date : 1376884536 <=> 1380099168
  version : 37 <=> 39
Body :
Shortcut added: atras atrás 15
Shortcut added: cade cadê 15
Shortcut added: ferias férias 15
Shortcut added: musica música 15
Shortcut added: musicas músicas 15
Added: cafe 0

>>> java/res/raw/main_en.dict
Header :
  date : 1374721663 <=> 1380099172
  version : 36 <=> 39
Body :
Freq changed: gay 127 -> 10

>>> java/res/raw/main_fr.dict
Header :
  date : 1376888819 <=> 1380099153
  version : 37 <=> 39
Body :
Added: septembre 150

>>> java/res/raw/main_pt_br.dict
Header :
  date : 1376884524 <=> 1380099168
  version : 37 <=> 39
Body :
Freq changed: atras 87 -> 0
Not a word: atras false -> true
Shortcut added: atras atrás 15
Shortcut added: cade cadê 15
Shortcut added: cafe café 15
Shortcut added: ferias férias 15
Shortcut added: musica música 15
Shortcut added: musicas músicas 15

Bug: 10504313
Bug: 10507536
Bug: 10561100
Change-Id: I4267c76cf0de221a703523d5f2dd2befbaf020a0
2013-09-26 08:34:53 +00:00
Jean Chalard 5937c03f15 Update dictionaries
Bug: 10354668
Bug: 10188528

>>> dictionaries/fr_wordlist.combined.gz
Header :
  date : 1374634549 <=> 1376888819
  version : 36 <=> 37
Body :
Deleted: color 78
Deleted: men 85
Deleted: o 115
Added: nationaux 120

>>> dictionaries/iw_wordlist.combined.gz
Added. New dictionary.

>>> dictionaries/pt_BR_wordlist.combined.gz
Header :
  date : 1374634563 <=> 1376884524
  version : 36 <=> 37
Body :
Deleted: la 152

>>> dictionaries/pt_PT_wordlist.combined.gz
Header :
  date : 1357790930 <=> 1376884536
  version : 30 <=> 37
Body :
Deleted: la 152

>>> dictionaries/ru_wordlist.combined.gz
Header :
  date : 1372393835 <=> 1376897704
  version : 35 <=> 37
Body :
Freq changed: говно 68 -> 0

>>> java/res/raw/main_fr.dict
Header :
  date : 1374634549 <=> 1376888819
  version : 36 <=> 37
Body :
Deleted: color 78
Deleted: men 85
Deleted: o 115
Added: nationaux 120

>>> java/res/raw/main_pt_br.dict
Header :
  date : 1374634563 <=> 1376884524
  version : 36 <=> 37
Body :
Deleted: la 152

>>> java/res/raw/main_ru.dict
Header :
  date : 1372393835 <=> 1376897704
  version : 35 <=> 37
Body :
Freq changed: говно 68 -> 0

Change-Id: I87a85571c61068ff46a32d291aa43becbb75598a
2013-08-19 16:41:09 +09:00
Jean Chalard f0046aea26 Update dictionaries
en, en_GB, en_US:
Add "id" -> "I'd" whitelist entry
Reinstate "id" and "ID" in the respective dicts

fr:
Remove many words that are not French
Change "google" to "Google"

pt_BR:
Delete "idéia"

Change-Id: I942266ac7995345580926f60de45d202aa257ae7
2013-07-24 12:10:06 +09:00
Jean Chalard 84f932be73 Add words to Portuguese
>>> dictionaries/pt_BR_wordlist.combined.gz
Header :
  date : 1355802839 <=> 1357790917
  version : 29 <=> 30
Body :
Added: à 30
Added: é 30
Added: ò 30
Added: ô 30

>>> dictionaries/pt_PT_wordlist.combined.gz
Header :
  date : 1355802856 <=> 1357790930
  version : 29 <=> 30
Body :
Added: à 30
Added: é 30
Added: ò 30
Added: ô 30

>>> java/res/raw/main_pt_br.dict
Header :
  date : 1355802839 <=> 1357790917
  version : 29 <=> 30
Body :
Added: à 30
Added: é 30
Added: ò 30
Added: ô 30

Bug: 7966948
Change-Id: I71c0986cf616d67926d0a6a0e53099b04b0427d5
2013-01-10 14:14:17 +09:00
Jean Chalard 21dbe3701c Update dictionaries
cs, da, de, el, es, fi, fr, hr, it, lt, lv, nb, nl, pl,
pt_BR, pt_PT, sl, sr, sv, tr : rescale frequencies to match
spec. This has no large effect in the practice except the
dictionary will become stronger vs spatial model (especially in
lower count corpora, like lt, lv, sr)
en* : Small changes (rounding going the other way essentially)
ru : the above rescaling, and remove the following words:
Дре, ОСТа, Планше, легкими, легком, легкому, легкости,
легкую, нелегкие, нелегкий, нелегким, нелегкое, нелегкой,
нелегкую, полулегком and add нелёгкие, нелёгкое, нелёгкую;
other accented forms were already in the dictionary.

Change-Id: I40386c2ebd4d2be38874e822bde89db7cb512ae6
2012-12-18 13:06:48 +09:00
Jean Chalard d5f53710c5 Update dictionaries and fix mistakes
- Combined de dict :
  Remove digraph shortcuts that were in by mistake.
- Combined en dict :
  Set freq of "baton" "batons" "mace" "puff"
  "puffs" and "tasers" to zero. They are offensive
  in en_GB.
- Combined en_GB dict :
  Change freq of "il" to 0 and flag it "not a word". Still
  in the dict as a whitelist entry for "I'll"; for some
  reason it had freq 99.
  Add "milk:122" and "practice:143"
- Combined fr dict :
  Add missing words : "Nostradamus:40" "défendais:30"
  "gmail:50" "générale:140" "hm:0" "hmm:0" "y'en:130"
  "l'apocalypse:31" "m'épuise:30" "recontacter:80"
  "t'annonce:30"
  Set freq of non-word shortcuts for digraphs to 1 instead
  of 0, allowing to gesture them.
- Combined ru dict :
  Remove a lot of two-character non-words.

- Binary de dict :
  Remove the obsolete "options" header, and add the "dictionary"
  header.
- Binary en dict :
  Flag "hoe" "hoes" "il" "shel" as non-words.
  Also drop freq of "il" and "shel" to 0
  Add the "locale" header that was missing.
- Binary es dict :
  Add the "dictionary" header.
- Binary fr dict :
  Add the same words as above. Non-word shortcuts were already
  set to 1.
- Binary it dict :
  Add a "dictionary" header. Also change freq of
  "Šarapova" from 50 to 37; not sure why it was 50.
- Binary pt_BR dict :
  Add a "dictionary" header.
- Binary ru dict :
  Add a "dictionary" header and remove the same words as above.

For all dictionaries : bump the version to 27.

Change-Id: I94fe7f8f42b31fdad223085c00a94115e14d2276
2012-11-21 22:03:24 +09:00
Jean Chalard d0cf96493c Use all Lexiteria sources and update existing directories.
New dictionaries :
- Danish
- Greek
- Finnish
- Lithuanian
- Latvian
- Dutch
- Polish
- Russian
- Slovene
- Serbian
- Swedish
- Turkish

Also, compress those files to reduce the footprint in the
repository.
Also, update and improve English and French dictionaries, and
add the ligatures shortcut into the French dictionary.
Finally, move the Russian binary dictionary here now that it
can at last be open sourced.

Bug: 5587752
Bug: 6775251
Bug: 6995793
Bug: 7149666
Change-Id: Iec9831d4dce425a2b5b0657571e4448436610525
2012-09-21 22:07:23 +09:00
Jean Chalard 383f4d6a69 Fix the name of the resource to lower case
Change-Id: Icbacf10702de20ef1a60d2648ee6440812d13f1d
2012-05-25 15:27:58 +09:00