středa, 23. ledna 2008

Removing diacritic (windows 1250) in Ruby

We received data in windows 1250 (cp1250) encoding. It's problem in Rails that works with utf8.
When you save these data in YAML, you will get accented signs as \xe1 for á (in cp1250), but strange sign in UTF8. So we need remove this encoding and replace it by ASCII signs.

You can use following code for removing diacritic:

TABLE1250 = {"e1" => "a", "e4" => "a", "e8" => "c", "ef" => "d", "e9" => "e", "ec" => "e", "ed" => "i", "be" => "l", "e5" => "l", "f2" => "n", "f3" => "o", "f6" => "o", "f5" => "o", "f4" => "o", "f8" => "r", "e0" => "r", "9a" => "s", "9d" => "t", "fa" => "u", "f9" => "u", "fc" => "u", "fb" => "u", "fd" => "y", "9e" => "z", "c1" => "A", "c4" => "A", "c8" => "C", "cf" => "D", "c9" => "E", "cc" => "E", "cd" => "I", "bc" => "L", "c5" => "L", "d2" => "N", "d3" => "O", "d6" => "O", "d5" => "O", "d4" => "O", "d8" => "R", "c0" => "R", "8a" => "S", "8d" => "T", "da" => "U", "d9" => "U", "dc" => "U", "db" => "U", "dd" => "Y", "8e" => "Z"}

def remove_diacritic str
while !str.index("\\x").nil?
idx = str.index("\\x")
str[idx, 4] = "#{TABLE1250[str[idx+2, 2].downcase]}"
end
str
end

3 komentářů:

veny [Václav Sýkora] řekl(a)...

Na todle sem videl prostredek v knihovne Iconv. Neco jako Iconv.new('ASCII//TRANSLIT', 'UTF-8') ...

Anonymní řekl(a)...

Hello. Often the Internet can see links like [url=http://www.whitehutchinson.com/aboutus/]Buy cialis without prescription[/url] or [url=http://www.rc.umd.edu/bibliographies/]Buy cialis without prescription[/url]. Is it safe to buy in pharmacies such goods?

Anonymní řekl(a)...

Nice Information.. Thx for sharing this

information