indic_transliteration.detect

Example usage:

from indic_transliteration import detect
detect.detect('pitRRIn') == Scheme.ITRANS
detect.detect('pitRRn') == Scheme.HK

When handling a Sanskrit string, it’s almost always best to explicitly state its transliteration scheme. This avoids embarrassing errors with words like pitRRIn. But most of the time, it’s possible to infer the encoding from the text itself.

detect.py automatically detects a string’s transliteration scheme:

detect('pitRRIn') == Scheme.ITRANS
detect('pitRRn') == Scheme.HK
detect('pitFn') == Scheme.SLP1
detect('पितॄन्') == Scheme.Devanagari
detect('পিতৄন্') == Scheme.Bengali

Supported schemes

All schemes are attributes on the Scheme class. You can also just use the scheme name:

Scheme.IAST == 'IAST'
Scheme.Devanagari == 'Devanagari'

Scripts:

  • Bengali ('Bengali')
  • Devanagari ('Devanagari')
  • Gujarati ('Gujarati')
  • Gurmukhi ('Gurmukhi')
  • Kannada ('Kannada')
  • Malayalam ('Malayalam')
  • Oriya ('Oriya')
  • Tamil ('Tamil')
  • Telugu ('Telugu')

Romanizations:

  • Harvard-Kyoto ('HK')
  • IAST ('IAST')
  • ITRANS ('ITRANS')
  • Kolkata ('Kolkata')
  • SLP1 ('SLP1')
  • Velthuis ('Velthuis')
indic_transliteration.detect.BLOCKS = [('Malayalam', 3328), ('Kannada', 3200), ('Telugu', 3072), ('Tamil', 2944), ('Oriya', 2816), ('Gujarati', 2688), ('Gurmukhi', 2560), ('Bengali', 2432), ('Devanagari', 2304)]

Schemes sorted by Unicode code point. Ignore schemes with none defined.

indic_transliteration.detect.BRAHMIC_FIRST_CODE_POINT = 2304

Start of the Devanagari block.

indic_transliteration.detect.BRAHMIC_LAST_CODE_POINT = 3455

End of the Malayalam block.

class indic_transliteration.detect.Regex[source]
IAST_OR_KOLKATA_ONLY = re.compile('[āīūṛṝḷḹēōṃḥṅñṭḍṇśṣḻ]')

Match on special Roman characters

ITRANS_ONLY = re.compile('ee|oo|\\^[iI]|RR[iI]|L[iI]|~N|N\\^|Ch|chh|JN|sh|Sh|\\.a')

Match on ITRANS-only

ITRANS_OR_VELTHUIS_ONLY = re.compile('aa|ii|uu|~n')

Match on chars shared by ITRANS and Velthuis

KOLKATA_ONLY = re.compile('[ēō]')

Match on Kolkata-specific Roman characters

SLP1_ONLY = re.compile('[fFxXEOCYwWqQPB]|kz|Nk|Ng|tT|dD|Sc|Sn|[aAiIuUfFxXeEoO]R|G[yr]|(\\W|^)G')

Match on SLP1-only characters and bigrams

VELTHUIS_ONLY = re.compile('\\.[mhnrltds]|"n|~s')

Match on Velthuis-only characters

indic_transliteration.detect.Scheme

Enum for Sanskrit schemes.

alias of indic_transliteration.detect.Enum

indic_transliteration.detect.detect(text)[source]

Detect the input’s transliteration scheme.

Parameters:text – some text data, either a unicode or a str encoded in UTF-8.