ldt.helpers package

Submodules

ldt.helpers.exceptions module

LDT exceptions classes.

exception ldt.helpers.exceptions.AuthorizationError(message)[source]

Bases: ldt.helpers.exceptions.Error

Exception raised for non-existing relation types.

expression -- input expression in which the error occurred
message -- explanation of the error
exception ldt.helpers.exceptions.DictError(message)[source]

Bases: ldt.helpers.exceptions.Error

Exception raised for non-existing relation types.

expression -- input expression in which the error occurred
message -- explanation of the error
exception ldt.helpers.exceptions.Error[source]

Bases: Exception

Base class for exceptions in this module.

exception ldt.helpers.exceptions.LanguageError(message)[source]

Bases: ldt.helpers.exceptions.Error

Exception raised for non-existing languages.

expression -- input expression in which the error occurred
message -- explanation of the error
exception ldt.helpers.exceptions.ResourceError(message)[source]

Bases: ldt.helpers.exceptions.Error

Exception raised for non-existing languages.

expression -- input expression in which the error occurred
message -- explanation of the error

ldt.helpers.formatting module

Text formatting functions

This section includes a few helper functions for formatting different spelling variants.

ldt.helpers.formatting.dash_suffix(suffix)[source]

A helper function for custom derivation dicts.

Some suffixes are mostly spelled with a dash (e.g. tree-like), and some may be spelled with a dash for stylistic reasons (e.g. work-able). This function ensures that both ways are tracked.

Parameters:suffix (str) – suffix to be dashed or not
Returns:a dashed suffix
Return type:suffix (str)
ldt.helpers.formatting.get_spacing_variants(word)[source]

A helper function for get_relations() that, given a string spaced input, produces a list of different spelling versions of this word (e.g. [“good night”, “good-night”, “good_night”])

Parameters:word (str) – input word
Returns:a list of variants: spaced, dashed and underscored
Return type:(list)
ldt.helpers.formatting.remove_text_inside_brackets(text, brackets='()[]')[source]

A helper function for get_relations(), code from here

Parameters:
  • text (str) – the text to clean from brackets
  • brackets – the list of symbols counting as brackets
Returns:

cleaned-up text

Return type:

(str)

ldt.helpers.formatting.rreplace(word, old_suffix, new_suffix)[source]

Helper for right-replacing suffixes

ldt.helpers.formatting.strip_non_alphabetical_characters(word, ignore=None)[source]

Helper function for removing any non-alphabetical character with optional exclusion list.

Wiktionary etymologies are a mess to parse. This function attempts to extra clean-up cases like (-ness or “king+. Optionally, it will return only strings that are known determined to be words by noise.is_a_word().

Parameters:
  • word (str) – a potential word string to process
  • ignore (tuple) – the characters to not strip (e.g. “-“)
Returns:

cleaned up string (a potential word)

Return type:

str

ldt.helpers.loading module

ldt.helpers.loading.get_object_size(obj, seen=None)[source]

A function that recursively finds size of objects, from https://goshippo.com/blog/measure-real-size-any-python-object/ Object sizes in Python should really not be that hard.

Warning: loading the same file into memory may result in slightly different object sizes.

Parameters:
  • obj – the object for which the size is to be calculated
  • seen – helper variable
Returns:

the size of the object in bytes.

Return type:

(int)

ldt.helpers.loading.load_jsonl_with_filtering(path, wordlist=None)[source]

Loading large jsonl files line-by-line, optionally only storing results that are in a provided wordlist

ldt.helpers.loading.load_language_file(resources_path, language)[source]
ldt.helpers.loading.load_resource(path, format='infer', lowercasing=True, silent=True, wordlist=None)[source]

A helper function for loading various files formats, optionally lowercasing them, and displaying the sizes of the resulting objects (for monitoring huge resources).

Parameters:
  • path (str) – path to file with the resource
  • format (str) –

    the format of the file. By default it is inderred from the file extension, but can also be specified directly. The following formats are supported:

    type freqdict:for tab-separated [Word <tab> Number] file
    type csv_dict:for [Word1 <tab> Word2,Word3,Word4…] or [Word1 <tab> Word2]
    type vocab:for one-word-per-line vocab file
    type json:a json dictionary
    type yaml:a yaml dictionary
    type json_freqdict:
     a json dictionary with frequency dictionaries as entries
    type jsonl:a jasonlines file that will be read line-by-line and optionally filtered by provided wordlist
  • wordlist (list of str) – the wordlist by which to filter the contents of a jsonl resource
Returns:

a set object for vocab files, a dictionary for

everything else

Return type:

(set, dict)

ldt.helpers.resources module

Various resources

Various resources of LDT are loaded and available for use elsewhere.

ldt.helpers.resources.load_stopwords(language)[source]

A function to load NLTK stopword lists for the supported languages. At the moment, that includes danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish

Parameters:language (str) – the language for which NLTK stopwords should be loaded
Returns:the set of stopwords
Return type:(frozenset)
ldt.helpers.resources.lookup_language_by_code(language, reverse=False)[source]

LDT uses mainly 2-letter language codes for language settings; they are also used in Wiktionary abd BabelNet. This function converts canonical language names to codes and vice versa.

Examples

>>> ldt.helpers.resources.lookup_language_by_code("en")
"English"
>>> ldt.helpers.resources.lookup_language_by_code("English", reverse=True)
"en"
Parameters:
  • language (str) – the canonical name of the language, or a `2-letter
  • code <https (language) – //en.wiktionary.org/wiki/Wiktionary

:param : List_of_languages#Two-letter_codes>`_ :param reverse: if True, returns the language code for the language :type reverse: bool

Returns:the canonical name of that language or its 2-letter code
Return type:(str)
Raises:LanguageError – the language was not found
ldt.helpers.resources.update_dict(dict1, dict2)[source]

Helper for _productive_morphology().

ldt.helpers.wiktionary_cache module

Wiktionary cache of existing pages.

It is possible to use Wiktionary API directly to find whether a word has an entry, e.g. # http://en.wiktionary.org/w/api.php?action=query&titles =my_word_of_interest However, that is slow and unkind on the wiktionary servers when running large-scale experiments. LDT caches a list of entries from the latest dump in the ldt_resources folder and uses that.

Cache files are created in the LDT resources directory, which is set in the LDT config file in the user directory.

The naming convention is YYYY-M-D_language_dictionary.vocab. For example:

  • 2018-7-1_en_wikisaurus.vocab
  • 2018-7-1_en_wiktionary.vocab
ldt.helpers.wiktionary_cache.find_vocab_file(language, path_to_cache, wikisaurus=False)[source]

A helper function for finding the timestamped vocab file for the needed language in the resources folder.

Parameters:
  • language (str) – a 2-letter language code
  • path_to_cache (str) – the path to the cache subfolder of ldt resources folder specified in config, where the wiktionary cache files are saved wikisaurus (bool): if False, Wiktionary entry namespace is cached, otherwise Wiktionary thesaurus entries are cached.
Returns:

filename if one was present, or “none”

Return type:

(str)

ldt.helpers.wiktionary_cache.get_cache_dir(path_to_cache='/home/docs/checkouts/readthedocs.org/user_builds/ldt/checkouts/latest/ldt/tests/sample_files/')[source]

Helper function that formats the path to cache and creates it, if necessary.

Parameters:
  • path_to_cache (str) – the path to resource directory. If “cache”
  • does not exist. it will be created. (subfolder) –
Returns:

the path to the cache directory.

Return type:

(str)

ldt.helpers.wiktionary_cache.get_timestamped_vocab_filenames(filename, language='English', wikisaurus=False)[source]

A helper function for update_wiktionary_pages() that provides timestamped filenames.

Parameters:
  • filename (str) – an earlier cache file, or “none” is there wasn’t one.
  • language (str) –

    a 2-letter language code

  • wikisaurus (bool) – if False, Wiktionary entry namespace is cached, otherwise Wiktionary thesaurus entries are cached.
Returns:

a dictionary holding the old and new filenames

Return type:

(dict)

ldt.helpers.wiktionary_cache.load_wiktionary_cache(language='English', lowercasing=True, path_to_cache='/home/docs/checkouts/readthedocs.org/user_builds/ldt/checkouts/latest/ldt/tests/sample_files/', wikisaurus=False, silent=True)[source]
Parameters:
  • language (str) –

    a 2-letter language code

  • lowercasing (bool) – if not set, the global config variable is used. True (default) lowercases all vocab.
  • path_to_cache (str) – the path to ldt resources folder specified in config. The cache files are saved in “cache” subfolder.
  • wikisaurus (bool) – if False, Wiktionary entry namespace is cached, otherwise Wiktionary thesaurus entries are cached.
Returns:

vocab list for the corresponding language, lowercased or not

according to the global or local lowercasing option

Return type:

(set)

ldt.helpers.wiktionary_cache.update_wiktionary_cache(language='English', path_to_cache='/home/docs/checkouts/readthedocs.org/user_builds/ldt/checkouts/latest/ldt/tests/sample_files/', wikisaurus=False)[source]

The main wiktionary cache updating function.

Parameters:
  • language (str) –

    a 2-letter language code

  • path_to_resources (str) – the path to ldt resources folder specified in config. The cache files are saved in “cache” subfolder of this folder.
  • wikisaurus (bool) – if False, Wiktionary entry namespace is cached, otherwise Wiktionary thesaurus entries are cached.
Returns:

True if the cache was updated, False otherwise

Return type:

(bool)