indic_transliteration.deduplication¶

Some useful functions for converting and disambiguating between common alternative orthographies (ways of writing) the same text.

indic_transliteration.deduplication.fix_lazy_anusvaara_itrans(data_in)[source]¶

indic_transliteration.deduplication.get_approx_deduplicating_key(text, encoding_scheme='devanagari')[source]¶

Given some devanAgarI sanskrit text, this function produces a “key” so that

1] The key should be the same for different observed orthographical forms of the same text. For example:

- "dharmma" vs "dharma"
- "rAmaM gacChati" vs "rAma~N gacChati" vs "rAma~N gacChati"
- "kurvan eva" vs "kurvanneva"

2] The key should be different for different for different texts.

“stamba” vs “stambha”

This function attempts to succeed at [1] and [2] almostall the time. Longer the text, probability of failing at [2] decreases, while probability of failing at [1] increases (albeit very slightly).

Sources of orthographically divergent forms:

Phonetically sensible grammar rules
Neglect of sandhi while writing
Punctuation, spaces, avagraha-s.
Regional-language-influenced mistakes (La instead of la.)

Some example applications of this function:

Create a database of quotes or words with minimal duplication.
Search a database of quotes or words while being robust to optional forms.

Also see equivalent function in the scala indic-transliteration package.

indic_transliteration.deduplication¶

indic_transliteration

Navigation

Related Topics