free­dict.eu

converting dictionaries with the freedict-tools

.tei into .index, .dict.dz and .slob, adding phonetics

The dictionaries from download.freedict.org come in three flavours: .index & .dict.dz for desktop use with programs like Goldendict, they come as .slob for mobile use with Aard2 under Android and as plain .tei xml-file. The dictionaries of the freedict project are created and edited as .tei (stands for: Text Encoding Initiative) and then converted into the different formats for desktop and mobile use.

rationale

While the dictionaries from the freedict project can be downloaded already in converted format you may need to convert a dictionary on your own when it comes to use third party dictionaries like ding (English-German, German-Spanish). While the dictconv program promises to convert tei and some other formats (.bgl, .dct, .ifo) into .index & .dict as well as the StarDict format .ifo, the current version 0.2-7 only operates on an old .tei-format and can not cope with the current TEI v5.0 (TEI P5). You may thus need to convert with the freedict-tools ( github.com/freedict/tools ). This is the same toolset used to convert their own dictionaries available via https://github.com/freedict/fd-dictionaries.

how to do it

If you check out the freedict-tools you may find programs like xmltei2xmldict.pl which promise to do the translation for you. However this program f.i. is hopelessly outdated. It uses very old libraries and when you have succeeded to install and compile everything that is needed to run it, then you notice that it does not appear to work on current TEI P5.

What you really need are the Makefiles under mk/. These files are to be included by your own Makefiles. If you want to discover on your own how to do it, you may get a first impression by checking out their own dictionaries (see last section). Here we want to show you directly how it works so that you can start right out to convert your own dictionaries.

Put the .tei into an own folder and make your Makefile look like this:

DISTFILES=deu-eng.tei FREEDICT_TOOLS=../../tools include $(FREEDICT_TOOLS)/mk/dicts.mk

Note that filename deu-eng reduced by its extension .tei must be the same as the directory name. If you happen to call your dictionary file differently than the directory then the Makefile will simply hang indefinitely without displaying any error message. This is due to the following two lines in dicts.mk:

dictname ?= $(shell basename "$(shell pwd)") … version1 := $(shell sed -e '100q;//!d;s/.*\(.*\)<\/edition>.*/\1/;q' $(wildcard $(dictname).tei))

The first line assigns the directory name to dictname, the second wildcards by this basename +extension, finds no file and uses standard input instead.

You may believe that there are even more dead dogs buried in freedict-tools. Not even the file and directory name can be chosen arbitrarily. They need to be <source language>—<target language> (here: German-English). See for a list of language shorthands in the directory structure of the freedict repository (last section). Also note that FREEDICT_ΤOOLS is not just a convenience variable for our own Makefile but required by mk/dicts.mk.

source_lang = $(firstword $(subst -, ,$(dictname))) … $(TEIADDPHONETICS) --supports-lang $(source_lang) …

Above you can see on how to check with the teiaddphonetics executable (see our freedict-tools-patches.tar.bz2) on whether a language shorthand is valid and known. If you decide to read our last section you will understand the strict naming conventions used by freedict-tools better: They have just been created to convert the dictionaries of the freedict-project, but not necessarily your own ones.

Since the conversion process can last up to a whole day for a dictionary like ding we have examined mk/config.mk to see all the necessary requirements. An executable that is listed there is charlint.pl, a pearl script to canonicalize and check utf-8 text for well-formedness. Letters like 'ã' or 'â' are a combination of a base letter with some diacritics and they can be represented as two different or one combined unicode letter. Certainly a dictionary lookup must not fail because one or the other representation is used. You can download this script from https://www.w3.org/International/charlint/. However it requires UnicodeData.txt to do its work. Current downloadable versions of this file do no more work together with charlint.pl so that we have used a version from Libreboot, stripped by two of 23697 lines. It should still do its work for all the languages supported by the freedict project. Additionally we have patched charlint.pl not to search for UnicodeData.txt in the current directory but in the directory given by the environment variable $FREEDICT_TOOLS. You can download freedict-tools-patches.tar.bz2 from download.freedict.eu/tools+patches/. If you unpack this zip-file it will directly create charlint.pl and UnicodeData.txt in the tools/-directory. Since config.mk claims that charlint.pl should be on path we are invoking make by this little shell script:

export FREEDICT_TOOLS=../../tools PATH=$PATH:$FREEDICT_TOOLS make "$@"

Note that you should also install the dictfmt package and the opensp package for /bin/onsgmls of your distribution.

adding phonetics

If you have followed the instructions in the last chapter you may have noticed the following error message:

> ./mk ../../tools2/mk/dicts.mk:83: Unable to run teiaddphonetics, phoneme generation disabled. Check installed dependencies. xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /home/sources/freedict-tools/wordbooks2/deu-eng/ ../../tools2/xsl/tei2c5.xsl deu-eng.tei >build/dictd/deu-eng.c5 deu-eng: Processed 5000 of 463244 entries (1%). …

Afterwards it still converts the dictionary in all the desired formats. The process that is taking most long is the creation of the .c5-file by xsltproc. The .tei-file which is essentially an xml-file is hereby converted by interpreting an XSLT-Stylesheet (XSL~Extensible Stylesheet Language) which is itself in XML-format.

However before it wants to add phonetics to the dictionary. The phonetics are an own lettering that tell you how to pronounce a word. The teiaddphonetics perl script is used to annotate a dictionary in .tei format with phonetics. It does so by use of espeak-ng which can also be used as speech synthesizer. However current downloadable versions of the freedict-tools do not include this script. Nonetheless we have found it in the Debian package of the freedict-tools. It is hard to say where this file came from but it was (and still is supposed to be) apparently a part of the freedict-tools. You can get that file also from download.freedict.eu/tools+patches/freedict-tools-patches.tar.bz2. However the version we have found with Debian did not work out of the box. First of all mk/dicts.mk tries to call $(TEIADDPHONETICS) -o $@ $< instead of $(TEIADDPHONETICS) -o $@ --infile $<. You need to fix this by applying our patch:

> patch --forward tools/mk/dicts.mk dicts.mk.patch patching file tools/mk/dicts.mk

The charlint.pl and teiaddphonetics patches normally do not need to be applied any more since the patched files become directly extracted into tools/. You can always find the original files in tools.orig/. Teiaddphonetics had to be changed in various ways. The problem was that espeak-ng sometimes generated multiple lines of output for one input line in the source language and then the teiaddphonetics script did no more know which source and destination lines did belong together. First of all the patched version in tools/ now ignores a trailing newline instead of throwing an error for it. It now buffers linewise by stdbuf, more for precautionary reasons than for this being verified to be necessary. The main issue was that certain punctuation marks cause espeak-ng to insert a new line because the phonetic lettering does not know these punctuation signs at all: question mark, exclamation mark, dot, colon, semicolon, comma. If you should ever run into similar problems use teiaddphonetics with --espeak-count 1 which tells the script to execute espeak-ng on just a single line every time.

If everything went fine you will find an output .tei with added phonetics in the build directory and all the created dictionaries will additionally contain phonetics.

additionally converting into .slob

On a default installation the Makefile will quit after creating the .index and .dict.dz (gzipped .dict). If you additionally want a .slob for Aard2/Android then keep tei2slob on path. Our article about mobile dictionary creation for offline use show you how to obtain and install this program including all of its requirements. Finally a sudo ln -s ~/.local/bin/tei2slob /usr/local/bin/ should do what you want. Then just invoke make again.

Here is the output of a successful conversion:

> ./mk mkdir -p build/tei/ ../../tools/teiaddphonetics -o build/tei/deu-eng-phonetics.tei --infile deu-eng.tei xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /home/sources/freedict-tools/wordbooks/deu-eng/ ../../tools/xsl/tei2c5.xsl build/tei/deu-eng-phonetics.tei >build/dictd/deu-eng.c5 deu-eng: Processed 5000 of 463244 entries (1%). … deu-eng: Processed 460000 of 463244 entries (99%). Platform dictd supports this dictionary module. cd build/dictd && \ dictfmt --without-time -t --headword-separator %%% --utf8 deu-eng < deu-eng.c5 464822 headwords dictzip -k build/dictd/deu-eng.dict rm -f build/slob/deu-eng-1.8.1-fd0.2.1.slob tei2slob -w build/slob -o build/slob/deu-eng-1.8.1-fd0.2.1.slob build/tei/deu-eng-phonetics.tei Adding /home/elm/.local/lib/python3.7/site-packages/tei2slob SKIPPING (not included): '__init__.py' SKIPPING (not included): '__pycache__/__init__.cpython-37.pyc' ADDING: '~/js/styleswitcher.js' ADDING: '~/css/default.css' ADDING: '~/css/night.css' Adding content... .................................................. 5000 … .................................................. 460000 ................................ Finished adding content in 0:04:24 Finalizing... Sorting... sorted in 0:00:29 Resolving aliases... Sorting... sorted in 0:00:29 Resolved aliases in 0:00:29 Finalized in 0:00:58 All done in 0:05:22

checking out the freedict dictionaries

You can view all dictionaries of the freedict project online at https://github.com/freedict/fd-dictionaries.

The following will make a sparse checkout of the freedict dictionaries showing you the directory structure and all the Makefiles. We have chosen to split it into two different checkouts to show you how to subsequently check out more files.

> mkdir freedict-master > cd freedict-master/ > git init > git remote add -f origin https://github.com/freedict/fd-dictionaries.git > git config core.sparseCheckout true > echo shared >>.git/info/sparse-checkout > git pull origin master > echo Makefile >>.git/info/sparse-checkout > echo /README.md >>.git/info/sparse-checkout > git read-tree -mu HEAD

Of special interest is the 'shared' directory (“shopt -s extglob; ls -ld !(*-*);”): It contains all the document type definitions (DTDs) for TEI P5.