so maybe it’s optimizing it and when it doesn’t “have to be” UTF-8 it is
leaving it as ASCII?
I’m not clear what you mean by an example other than what I put in the
original note.
I think I’m going to open a bug report – it might not be a bug but I
sure am confused. The “Pick Axe” book describes a third argument but I
can’t get that to work either. “ri” for Ruby 1.9.1 does not describe
the third argument at all – but it does seem to exist at least.
It appears as if, as you pointed out, if the input string happens to be
ASCII, then the regexp encoding is ascii and there doesn’t seem to be
anything you can do about it.
I’m testing on 1.9.1 p243.
But, due to another discussion thread, I think I want to be in 8 bit
binary anyway in my case. I’m not 100% positive my input is UTF-8. Its
suppose to be but I can’t really trust it.
If I later try to use it on strings of type UTF-8, it can throw an
exception.
I’m not clear what you mean by an example other than what I put in the
original note.
Do you have a small example (like your original) that throws an
exception where you “use it on strings later of type UTF-8” and it
throws an exception?
I think I’m going to open a bug report – it might not be a bug but I
sure am confused.
It’s not a bug(*), and it sure is confusing. My own attempt to document
Ruby 1.9’s encoding rules, which is woefully incomplete but covers about
200 different cases, is at
What you’ve observed is described in section 3.3.
Basically, a Regexp which contains only ASCII characters is given an
encoding of US-ASCII regardless of the original string’s encoding (this
is different to Strings, which might have an encoding of say UTF-8 but
have the ascii_only? property true if they contain only ASCII
characters).
However there is a hidden “fixed_encoding” property you can set on a
Regexp:
Thanks. I’m doing more experimenting and I’m also looking at the source
code. I need to drag down the latest. I’m looking at 1.9.1 p243 right
now.
Regexp.new has a third optional argument – it is sorta described in the
Pick Axe book but the code looks wrong. It can be either ‘n’ or ‘xN’
where x can be anything. Perhaps that is gone in the latest code.
But the “fixed encoding” is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding “fixed”.
The latest code that David posted answers exactly what my original
question was. Thanks!
But the “fixed encoding” is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding “fixed”.
16 is just Regexp::FIXEDENCODING
irb(main):001:0> Regexp::FIXEDENCODING
=> 16
In the 1.9.2 I have here (r24186, 2009-07-18) there is no
Regexp::ENC_UTF8, so it must be relatively new.
irb(main):002:0> Regexp::ENC_UTF8
NameError: uninitialized constant Regexp::ENC_UTF8
from (irb):2
from /usr/local/bin/irb192:12:in `’
irb(main):003:0> Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :FIXEDENCODING]
As for the third arg to Regexp.new, I have no idea. Documentation is not
Ruby’s strong point at the best of times, but it’s nonexistent for the
encoding stuff.
But if I use 16 rather than FIXEDENCODING it works as in the examples
in this thread.
Does anyone know what’s going on here? I used to have a pretty good
handle on encodings. This Ruby encoding stuff is something I’ve been
struggling with for 6 months and I think all that I’ve managed to do
is completely corrupt my understanding of encoding. It’s starting to
look like magic. I know that a bunch of things changed between
1.9.1p243 and 1.9.1p376, but, since I think that what I ‘know’ about
encoding might be completely delusional at this point, I suppose I
don’t really know.
I might help you to know that your constants are the same as mine. I
don’t know how David got his.
Unfortunately, I still have not gotten back to my investigation of this.
Looking at the code in re.c helped me a bit.
Aside from that, I think we are all struggling with this. I’m hoping
that there are a few “bugs” in the code… i.e. Mat has a clear idea of
how things should work but there are just a few mistakes that really
hamper our understanding.
If I later try to use it on strings of type UTF-8, it can throw an
exception.
I’m not clear what you mean by an example other than what I put in the
original note.
Do you have a small example (like your original) that throws an
exception where you “use it on strings later of type UTF-8” and it
throws an exception?
No I don’t. I think that I might have had a string that was not
utf-8. I was fetching strings from a file and just doing a
force_encoding because they were suppose to be utf-8 but maybe they were
not.
I’m not sure. Let me see if I can make an example. My trivial examples
so far don’t throw an exception.
This is exactly the situation I worried about when Matz proposed the
“all encodings” view of Ruby 1.9. Even though many applications won’t
run into this, any that try to deal with >1 encoding at a time will
have a clusterfuck of a time making sure everything fits together. And
this is to say nothing of the implementation effort required, which
still isn’t all there in JRuby (and won’t be until 1.6 or later).
I didn’t read this whole thread, since there’s a lot of “it’s a
bug/it’s not a bug” exploration, but if there’s something we need to
fix in JRuby, please do report it (and try to help fix it, too :)).
re: string1 + string2 + string3 actually working without fear…
One thing that might help would be to set the default encoding, then all
three strings would (might ?) have the same encoding (?)
That depends where the strings came from. If they were returned by a
library function (either Ruby core or 3rd party) you won’t know what
encoding they have unless it is documented what the encoding is or how
it is chosen, and it almost never is.
Equally, if you are writing a library for use by other people, then you
really should not touch global state such as Encoding.default_external.
So you are left with Ruby guessing encodings and forcing them if it
guesses wrongly, e.g.
$ ruby19 -e ‘puts %x{cat /bin/sh}.encoding’
UTF-8
Of course, if you’re saying that your application handles all strings in
the same encoding, then this whole business of tagging every individual string object with its own encoding is a waste of time and
effort, and is just something which you have to fight against.
But we’re flogging a dead horse here. I hate this stuff; other people
seem to love it.