How does anybody do this? 🡸 That is my question.
My idea to find data in the bodies of mail addressed to a specific account (for a purpose common to the expeditor and the recipient) makes me learn stuff that I did not want to know in the first place.
I could impose that all the mail-clients invoved be configured to send plain-text only. That would be the end of the story and lead to the most idyllic communication environment. But alas …
I suppose now (for the time being) that
- Mail-clients send HTML only
- Mail-clients send multipart messages (mixed/alternative/relative)
- multipart/alternative and multipart/related are mixed-up by some clients
- OVH[dot]com exists and does stuff that you have to address if something useful is expected from there (not otherwise)
- The order of content-type and content-transfer-encoding is not determined at the beginning of a mime-part (and line-breaks have to be expected anywhere).
- The declaration of a specific border for each individual mime-part must be taken into consideration.
What I do now in an attempt to progress – as a downright solution is not in sight:
- I match content-type and content-transfer-encoding in every mime-part
- I ignore borders alltogether and split up the mail-body at the content-type declarations
- If there is nothing coming out of this, I look at the Content-Type and Content-Transfer-Encoding mail-headers, supposing that this is not a multipart message.
- Finding something definitive (text/plain or text/html), I decode either the plaintext or the result of
Html2Text::convert
, provided I have an encoding (quoted-printable or base64 for the time).
- and yes – I almost forgot – there is a
force_encoding('utf-8')
somewhere; but that is just the worm-hole to a full-blown alternative universe, so do not put too much weight on it.
PSE comment as you like but do not tell me that this is how it is supposed to work. This is all bull and because I am confronted to bull. I can accept however that the mail-system is a mess, as this has become quite obvious to me.
UPDATE.
You have to do it all:
- match content-type & content-transfer-encoding in every mail-part, as their order is not determined.
- read the content-type & content-transfer-encoding from the mail-header.
- read the first mime-part and store the next boundary.
- split the remaining mail-body at each individual boundary.
Mail is the medium the most simple to secure with point-to-point- and message-encryption, thus avoiding any kind of centralized structure.
But they have deliberately broken it. Now The Stupid use http and believe that organizations will protect them. The reasons are obvious: obstruct, lie and praise facilitations that are none.
The latest nightmare-code.
These fragments belong to a class which parses eml-files. The static methods, below, take a Mail-object (from the Mail gem) as argument. The objective is to retrieve 1 plain-text version of the mail-body.
In the constants, CT means “Content-Type” and CTR “Content-Transfer-Encoding”.
I am posting this, because I do not like it but am currently unable to do better.
# This sub-class represents a part of the message, normally a mime-part with
# content-type, content-transfer-encoding and content.
class MailPart
def initialize(type, encoding, text)
@type = type
@encoding = encoding
@text = text
end
def charset
declarations = @type.split(';') if @type
if declarations && declarations.size > 1
return declarations[1].split('=')[1].strip
else
return nil
end
end
def type
declarations = @type.split(';') if @type
if declarations && !declarations.empty?
return declarations[0].strip
else
return nil
end
end
attr_reader :encoding, :text
end
# end MailPart
# return plain text of 1 incoming mail. The plain text may be found after
# converting html to text, if no text/plain part is found in the mail.
# Quoted-Printable and Base64 encodings are handled and utf-8 is enforced, if
# necessary.
def self::plaintext(mail)
debug('----- start plaintext() -----')
text = nil
encoding = nil
# mail_parts() , see further below for the code of that method.
parts = mail_parts(mail)
debug('mail_parts() returns ' << (parts ? parts.size.to_s : 'no') << ' part' << (parts && parts.size > 1 ? 's' : '') )
if parts && !parts.empty?
part = parts.detect {|p| p if p.type == CT_PLAIN}
if part
debug 'found ' << CT_PLAIN
text = part.text
else
part = parts.detect {|p| p if p.type == CT_HTML}
text = Html2Text.convert(part.text) if part
debug('converted from ' << CT_HTML << "\n ---------- \n" << (text ? text : ' NIL ') )
end
encoding = part.encoding if part
debug('part is encoded ' << (encoding ? encoding.to_s : 'NIL') )
else
debug('No mime-parts identified')
# All decoding is based on the mail-headers.
type = mail.headers.detect{|h| h.name == CTYPE}.value.split(';').detect{|v| v.strip if v.strip.start_with?('text')}
# TODO: Verify that there cannot be other things in the Content-Transfer-Encoding header.
encoding = mail.headers.detect{|h| h if h.name == CTRENCODING}.to_s.split(';')[0]
debug("\t Mail #{CTRENCODING}: " << (encoding ? encoding : 'NIL') )
# decode HTML if need be
text = mail.body.to_s if type == CT_PLAIN
text = Html2Text.convert(mail.body.to_s.force_encoding(UTF8)) if type == CT_HTML
end
if encoding
debug('mail body is encoded ' << encoding.to_s)
lines = text.split(LN)
decoder = nil
case encoding
when 'quoted-printable'
decoder = QP
when 'base64'
decoder = B64
# other encodings stay intact and must be
# honored by the caller.
end
debug('text before decoding ' << encoding)
debug("----------------\n" << text)
text = lines.collect do |l|
decoder.decode(l.gsub(/=$/,'' ))
end.join(LN) if decoder
#text = text.force_encoding('utf-8')
debug("-------after----\n" << text)
return text
else
return text.force_encoding('utf-8')
end
return nil
end
# Retrieve any mail-parts. Returns all parts or nil.
# This looks complicated because it is.
def self::mail_parts(mail)
content_type = mail.headers.detect{|h| h.name == CTYPE}.value
debug 'content-type from header is ' << content_type
# separate values in the content-type header
content_type_declarations = content_type.split(';')
is_multipart = content_type_declarations.any?{|decl| decl.strip.start_with?('multipart') }
if(is_multipart)
# -------------> recurring actions
# 1 construct the boundary-value from the boundary-declaration
boundary_value = lambda {|b| b.strip.delete_prefix(BPREFIX).delete("\"").prepend('--') }
# 2 get the encoding from a Content-Transfer-Encoding declaration
encoding_value = lambda {|e| e.split(':')[1].strip }
# 3 split a mail-body at a boundary.
split_body = lambda {|bdy, bndry| bdy.split(bndry)}
# 4 get the content-type from a Content-Type declaration
type_value = lambda {|t| t.split(':')[1].strip}
# <---------------
# find a boundary in the mail-header
boundary = content_type_declarations.detect{|decl| decl.strip.start_with?(BPREFIX)}
boundary = boundary_value.call(boundary) if boundary
debug 'boundary from header is ' << (boundary ? boundary : 'NIL')
# retrieve all parts
parts = Array.new
if boundary
# the mail-body as string is used repetitively
body = mail.body.to_s
# first match of a boundary
split_content = split_body.call(body, boundary)[1]
# ... and all the remaining
while split_content && !split_content.strip.empty?
p_boundary = split_content.split(LN).detect{|l| l.strip.start_with?(BPREFIX )}
if p_boundary
boundary = boundary_value.call(p_boundary)
debug 'boundary from part ' << (boundary ? boundary : 'NIL')
end
# get the portion following the boundary.
part = split_body.call(body, boundary)[1]
if part
type = part.match(CT_REGEXP)
type = type_value.call(type[0]) if type
encoding = part.match(CTRE_REGEXP)
encoding = encoding_value.call(encoding[0]) if encoding
part_array = part.split(LN)
# Cut off content-type and content-transfer-encoding.
# Find an empty line.
empty_line = part_array.detect{|l| l if l.match(/^\s+/)}
empty_index = part_array.index(empty_line) if empty_line
# keep what follows the empty line
if empty_index
text = part_array[empty_index...part.size].join(LN)
else
# or all
text = part
end
parts << MailPart.new(type, encoding, text)
debug 'latest part ' << parts.last.inspect
# continue with more parts only, if a boundary was found in the
# current part
if p_boundary
split_content = split_body.call(body, boundary)[2]
else
debug 'no new boundary found'
split_content = nil
end
end
end
end
return (parts && !parts.empty? ? parts : nil)
else
return nil
end
end
Ω End of post.