indic_transliteration.deduplication

Some useful functions for converting and disambiguating between common alternative orthographies (ways of writing) the same text.

indic_transliteration.deduplication.fix_lazy_anusvaara_itrans(data_in)[source]
indic_transliteration.deduplication.get_approx_deduplicating_key(text, encoding_scheme='devanagari')[source]

Given some devanAgarI sanskrit text, this function produces a “key” so that

1] The key should be the same for different observed orthographical forms of the same text. For example:

- "dharmma" vs "dharma"
- "rAmaM gacChati" vs "rAma~N gacChati" vs "rAma~N gacChati"
- "kurvan eva" vs "kurvanneva"

2] The key should be different for different for different texts.

  • “stamba” vs “stambha”

This function attempts to succeed at [1] and [2] almostall the time. Longer the text, probability of failing at [2] decreases, while probability of failing at [1] increases (albeit very slightly).

Sources of orthographically divergent forms:

  • Phonetically sensible grammar rules
  • Neglect of sandhi while writing
  • Punctuation, spaces, avagraha-s.
  • Regional-language-influenced mistakes (La instead of la.)

Some example applications of this function:

  • Create a database of quotes or words with minimal duplication.
  • Search a database of quotes or words while being robust to optional forms.

Also see equivalent function in the scala indic-transliteration package.