This is an attempt at a Slovene-English dictionary, intended for FreeDict project and other similar uses.
NOTE: This is still heavily in development. I have yet to create a piece of code to convert current XML files to TEI dictionary.
The project's main content files are stored inside xml
folder. Inside, there are multiple files, each representing one section of the dictionary - for example, in slv_eng-a.xml
there are all entries that start with the letter "a", etc. The files are written in TEI format, but are of the type XML for ease of use and editing.
The text below is more of a "crash course" and not really that detailed or accurate. If one wishes to know much more about the way TEI files are structured, I suggest this documentation: https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates. Alternatively, FreeDict project has a Wiki that explains a few things as well: https://github.com/freedict/fd-dictionaries/wiki.
NOTE: The values and types I mention below are my own restriction. The list can extend and change.
Each XML file contains entries, which represent information on words and phrases. This is not really a TEI structure yet, since it's missing some information, but this would be added later programmatically.
In XML and similar languages for storing information, data is wrapped in tags. These can also have attributes that define additional information. Tags can contain plain-text data and/or other tags, which have their own data.
Below is an example of a TEI dictionary entry as used in this project:
<entry xml:id="a">
<form type="lemma">
<orth>a</orth>
</form>
<gramGrp>
<gram type="pos">conj.</gram>
</gramGrp>
<sense xml:id="a.1">
<usg type="dom">Lit.</usg>
<cit type="trans">
<quote>but</quote>
<quote>however</quote>
</cit>
<cit type="example">
<quote>Iščejo dom, a ga ne najdejo.</quote>
<cit type="trans">
<quote>They are searching for home but cannot find it.</quote>
</cit>
</cit>
</sense>
</entry>
The entry
tag marks a new entry in the dictionary. It usually only has the xml:id
attribute which acts as a unique ID for the entry. It is usually just the word or phrase in the entry.
If there are two entries with the same word, the IDs should be written with a .x
suffix, where x represents an integer. For example, there are two entries with the name atika
, so we add a suffix, and the resulting IDs of the entries are atika.1
and atika.2
.
There are some suggested conventions to follow when writing IDs:
- always use lowercase letters,
- replace any spaces with underscores (example:
abonirati se
->abonirati_se
), - replace non-English letters with their ASCII representations where possible (example:
užaloščen
->uzaloscen
).
Entry usually contains form
s, gramGrp
s, and sense
s.
The form
tag contains much of the information on the original word or phrase. It has the attribute type
which provides information on what kind of information form contains. More in the table below...
Form can contain orth
tag which holds the actual word/phrase, as well as gramGrp
group for any additional grammatical properties which may hold true only for this particular form.
A table of some type values:
Value | Meaning |
---|---|
lemma | The headword - main word that represents the entry |
inflected | Word in other than usual dictionary form |
variant | A variant form |
simple | A single free lexical item |
compound | Word formed from simple lexical items |
derivative | Word derived from headword |
phrase | Multiple-word lexical item |
paradigm | A collection of inflected forms |
The gramGrp
tag groups together grammatical properties that define the word/phrase in question. The tag can be found directly in the body
(see example above), in which case it holds true for all possible forms in the entry, or it can reside in any form
tag, in which case it applies only to this particular form.
A gramGrp
group contains a bunch of gram
tags. Each gram
tag is given a type
attribute to specify what kind of grammatical property it holds. Below is a table of some of these types and values.
Type | About | Values |
---|---|---|
pos | Defines the type of word (noun, verb...) | n. (noun) v. (verb) adj. (adjective) conj. (conjugate) adv. (adverb) int. (interjection) prep. (preposition) pron. (pronoun) art. (article) num. (numeral) pref. (prefix) |
case | Defines the case of the word | nom. (nominative) gen. (genitive) dat. (dative) acc. (accusative) loc. (locative) instr. (instrumental) |
gender | Defines the gender of the word | m. (male) f. (female) n. (neutral) |
mood | Defines the mood of the verb | indic. (indicative) imper. (imperative) condit. (conditional) |
number | Defines the number of the word | sg. (singular) pl. (plural) du. (dual) |
per | Defines the person of the verb | 1st 2nd 3rd |
tns | Tense | Present Future Past |
colloc | A collocate - any sequence of words that co-occur with the headword with significant frequency | example: [+ conj.] |
Sense contains information on the English counterpart to the Slovene word/phrase. It has its own ID, which is almost the same as the entry
ID but with added .x
at the end (where x is an integer).
There can be multiple sense
s in an entry if the word/phrase has many meanings.
A list of tags that can be found in a sense
:
Tag | Description |
---|---|
usg | Defines a type of usage - for example, where is the word used, what kind of situation it is used in, etc. |
cit | It can contain actual translation or example of usage (all of these are stored in |
quote | Holds data |
def | Holds any definitions of words - can be used for extra explanation of the word or when there is no proper translation |
Types of usage:
Type | Description | Values |
---|---|---|
dom | Domain | Adm. (administration) Aero. (aeronautics) Agr. (agriculture) Anat. (anatomy) Antr. (antropology) Arch. (architecture) Archae. (archaeology) Art Astr. (astronomy) Bibl. (bibliography) Biol. (biology) Bot. (botany) Buil. (building trade) Chem. (chemistry) Chess Comp. (computation) Craft. (craftsmanship) Econ. (economy) Engin. (engineering) Film Fin. (finances) For. (forestry) Gast. (gastronomy) Geol. (geology) Geog. (geography) Hist. (history) Hunt. (hunting) Law Lit. (literature) Ling. (linguistics) Math. (mathematics) Med. (medicine) Meteo. (meteorology) Milit. (military) Mus. (music) Myth. (mythology) Naut. (nautic) Pedag. (pedagogics) Pharm. (pharmacy) Phil. (philosophy) Phys. (physics) Psych. (psychiatry) Rail. (rail transport) Rel. (religion) Sci. (science) Sport Tech. (technic) Text. (textile) Theat. (theatre) Vet. (veterinary) War Zoo. (zoology) |
plev | Preference level | rare occas. (occasional) |
geo | Geographic data | dial. (dialect) Inner Carniola (Notranjska) Upper Carniola (Gorenjska) Lower Carniola (Dolenjska) Littoral Region (Primorje) Styria (Štajerska) Prekmurje Carinthia (Koroška) White Carniola (Bela krajina) |
time | Usage by time | archaic old |
register | child. (childlike) slang lingo vulgar formal casual affect. (affectionate) colloq. (colloquial) pejor. (pejorative) iron. (ironicaly) |
|
style | fig. (figurative) lit. (literal) |
<entry xml:id="ah">
<form type="lemma">
<orth>ah</orth>
</form>
<gramGrp>
<gram type="pos">int.</gram>
</gramGrp>
<sense xml:id="ah.1">
<cit type="trans">
<quote>ah</quote>
<quote>oh</quote>
</cit>
<cit type="example">
<quote>Ah, seveda!</quote>
<cit type="trans">
<quote>Oh, right!</quote>
</cit>
</cit>
<def>Expresses awe, contentment, or when getting an idea or thought.</def>
</sense>
<sense xml:id="ah.2">
<cit type="trans">
<quote>ah</quote>
<quote>oh</quote>
</cit>
<cit type="example">
<quote>Ah, ti si.</quote>
<cit type="trans">
<quote>Oh, it's you.</quote>
</cit>
</cit>
<def>Expresses regret, tiredness.</def>
</sense>
</entry>
<entry xml:id="aktuar">
<form type="lemma">
<orth>aktuar</orth>
</form>
<form type="variant">
<orth>aktuarka</orth>
<gramGrp>
<gram type="gender">f.</gram>
</gramGrp>
</form>
<gramGrp>
<gram type="pos">n.</gram>
<gram type="gender">m.</gram>
<gram type="number">sg.</gram>
</gramGrp>
<sense xml:id="aktuar.1">
<cit type="trans">
<quote>actuary</quote>
</cit>
</sense>
</entry>
<entry xml:id="amortizirati_se">
<form type="lemma">
<orth>amortizirati se</orth>
</form>
<gramGrp>
<gram type="pos">v.</gram>
</gramGrp>
<sense xml:id="amortizirati_se.1">
<usg type="dom">Econ.</usg>
<cit type="trans">
<quote>to be depreciated</quote>
</cit>
<cit type="example">
<quote>Avto se amortizira v petih letih.</quote>
<cit type="trans">
<quote>The car is depreciated in five years.</quote>
</cit>
</cit>
</sense>
</entry>