středa, 23. ledna 2008

Removing diacritic (windows 1250) in Ruby

We received data in windows 1250 (cp1250) encoding. It's problem in Rails that works with utf8.
When you save these data in YAML, you will get accented signs as \xe1 for á (in cp1250), but strange sign in UTF8. So we need remove this encoding and replace it by ASCII signs.

You can use following code for removing diacritic:

TABLE1250 = {"e1" => "a", "e4" => "a", "e8" => "c", "ef" => "d", "e9" => "e", "ec" => "e", "ed" => "i", "be" => "l", "e5" => "l", "f2" => "n", "f3" => "o", "f6" => "o", "f5" => "o", "f4" => "o", "f8" => "r", "e0" => "r", "9a" => "s", "9d" => "t", "fa" => "u", "f9" => "u", "fc" => "u", "fb" => "u", "fd" => "y", "9e" => "z", "c1" => "A", "c4" => "A", "c8" => "C", "cf" => "D", "c9" => "E", "cc" => "E", "cd" => "I", "bc" => "L", "c5" => "L", "d2" => "N", "d3" => "O", "d6" => "O", "d5" => "O", "d4" => "O", "d8" => "R", "c0" => "R", "8a" => "S", "8d" => "T", "da" => "U", "d9" => "U", "dc" => "U", "db" => "U", "dd" => "Y", "8e" => "Z"}

def remove_diacritic str
while !str.index("\\x").nil?
idx = str.index("\\x")
str[idx, 4] = "#{TABLE1250[str[idx+2, 2].downcase]}"
end
str
end

1 komentářů:

veny [Václav Sýkora] řekl(a)...

Na todle sem videl prostredek v knihovne Iconv. Neco jako Iconv.new('ASCII//TRANSLIT', 'UTF-8') ...