indic_transliteration.deduplication¶
Some useful functions for converting and disambiguating between common alternative orthographies (ways of writing) the same text.
-
indic_transliteration.deduplication.
get_approx_deduplicating_key
(text, encoding_scheme='devanagari')[source]¶ Given some devanAgarI sanskrit text, this function produces a “key” so that
1] The key should be the same for different observed orthographical forms of the same text. For example:
- "dharmma" vs "dharma" - "rAmaM gacChati" vs "rAma~N gacChati" vs "rAma~N gacChati" - "kurvan eva" vs "kurvanneva"
2] The key should be different for different for different texts.
- “stamba” vs “stambha”
This function attempts to succeed at [1] and [2] almostall the time. Longer the text, probability of failing at [2] decreases, while probability of failing at [1] increases (albeit very slightly).
Sources of orthographically divergent forms:
- Phonetically sensible grammar rules
- Neglect of sandhi while writing
- Punctuation, spaces, avagraha-s.
- Regional-language-influenced mistakes (La instead of la.)
Some example applications of this function:
- Create a database of quotes or words with minimal duplication.
- Search a database of quotes or words while being robust to optional forms.
Also see equivalent function in the scala indic-transliteration package.