How to convert the charset of texts in a Execl which has multi-language text and charset?

Wu_Nan · September 15, 2008, 10:56am

Hello all,

I want to use Ruby to read a excel file’s content and convert them in to
UTF-8.
However, in that file there are many different language texts, such as
Greek, Japanese, Korea, Russia and so on.
So I use Iconv to convert the them into UTF-8.
I searched the internet, some article said the default charset of Excel
is
UTF-16LE.
So I use the codes below:

Iconv.conv(“UTF-8”,“UTF-16”,$excel.Cells(row,col).value.to_s)

And the contents in excel are(each line is a cell)

(Please wait)
(Veuillez attendre)
(Bitte warten)
(Espere un momento)
(Attendere, prego)
(Even geduld aub)
(Ð–Ð´Ð¸Ñ‚Ðµ)
(Aguarde)
(ìž ì‹œë§Œ ê¸°ë‹¤ë ¤ì£¼ì‹ì‹œì˜¤)

After I run it, I get a Error:
in `conv’: “)” (Iconv::InvalidCharacter)

It seems that the in UTF-16, the ( is not ‘(’???

Then I changed the ‘UTF-16’ in to ‘GB2312’(the default charset of my
system),but it cannot convert the Koean character correctly. All the
Koean
characters became ???

I use Ruby 1.8.6 on WinXP Sp3.

How could I resolve it ?

Many thanks,

Nan

Wu_Nan · September 15, 2008, 11:23am

-------- Original-Nachricht --------

Datum: Mon, 15 Sep 2008 17:49:15 +0900
Von: “Wu Nan” [email protected]
An: [email protected]
Betreff: How to convert the charset of texts in a Execl which has multi-language text and charset?

(Ð–Ð´Ð¸Ñ‚Ðµ)
system),but it cannot convert the Koean character correctly. All the Koean
characters became ???

I use Ruby 1.8.6 on WinXP Sp3.

How could I resolve it ?

Many thanks,

Nan

Dear Nan,

after some searching, I found that there is a special encoding for
Korean characters, EUC-KR.
I managed to convert your Korean text from UTF-8 to EUC-KR, write it to
a file and display it correctly in Firefox, once
the right encoding is set in the Preferences (EUC-KR in this case, but I
can also display Korean text in UTF-8.)

So I think you’ll be successful by making sure you convert from EUC-KR
to UTF-8 for the Korean, and to UTF-8 for everything else.

Best regards,

Axel

Wu_Nan · September 15, 2008, 11:58am

Hello Axel,

Many thanks for your answer,

I just test it again, I dump the original text, and display them in
integer.
I found that all the Korean Char became ‘???’ as soon as them were
read
out from the Excel.

I attached the test codes and test excel file. In the excel file there
is
only 1 text.

Do you have any idea about the reason?

2008/9/15 Axel E. [email protected]

Wu_Nan · September 15, 2008, 12:51pm

-------- Original-Nachricht --------

Datum: Mon, 15 Sep 2008 18:50:03 +0900
Von: “Wu Nan” [email protected]
An: [email protected]
Betreff: Re: How to convert the charset of texts in a Execl which has multi-language text and charset?

only 1 text.

Do you have any idea about the reason?

Dear Nan,

right now, I am not on Windows, so in order to check whether the problem
is with Windows or with
Ruby, I’d suggest you try the following (which works on Ubuntu with your
data).

Gem-install parseexcel (http://raa.ruby-lang.org/project/parseexcel/)

Check what the following script gives on your test.xls file (shamelessly
adapted from the website mentioned
above).

require “rubygems”
require ‘parseexcel’
require “iconv”

your first step is always reading in the file.

that gives you a workbook-object, which has one or more worksheets,

just like in Excel you have the possibility of multiple worksheets.

workbook = Spreadsheet::ParseExcel.parse(“/home/axel/Desktop/test.xls”)

usually, you want the first worksheet:

worksheet = workbook.worksheet(0)
p worksheet

now you can either iterate over all rows, skipping the first number of

rows (in case you know they just contain column headers)

skip = 0
worksheet.each(skip) { |row|

a row is actually just an Array of Cells…

first_cell = row.at(0)
p ‘first’
p first_cell

how you get data out of the cell depends on what datatype you

expect:

if you expect a String, you can pass an encoding and (iconv

required) the content of the cell will be converted.

str = row.at(0).to_s(‘EUC-KR’)
p str
f=File.open(“textexcel.html”,“w”)
f.puts str
f.close
}

I could open the file textexcel.html with correctly displayable Korean
characters (now in EUC-KR, but you can
convert these to UTF-8, at least in Ubuntu.

Best regards,

Axel

Wu_Nan · September 15, 2008, 1:52pm

Hello Axel,

Thank you very much.

I’ll try it.

BR
Nan

2008/9/15, Axel E. [email protected]: