Reading mail messages, finding plaintext, suffering Mime-parts

bleak_gravel · August 29, 2024, 8:37am

How does anybody do this? 🡸 That is my question.

My idea to find data in the bodies of mail addressed to a specific account (for a purpose common to the expeditor and the recipient) makes me learn stuff that I did not want to know in the first place.

I could impose that all the mail-clients invoved be configured to send plain-text only. That would be the end of the story and lead to the most idyllic communication environment. But alas …

I suppose now (for the time being) that

Mail-clients send HTML only
Mail-clients send multipart messages (mixed/alternative/relative)
multipart/alternative and multipart/related are mixed-up by some clients
OVH[dot]com exists and does stuff that you have to address if something useful is expected from there (not otherwise)
The order of content-type and content-transfer-encoding is not determined at the beginning of a mime-part (and line-breaks have to be expected anywhere).
The declaration of a specific border for each individual mime-part must be taken into consideration.

What I do now in an attempt to progress – as a downright solution is not in sight:

I match content-type and content-transfer-encoding in every mime-part
I ignore borders alltogether and split up the mail-body at the content-type declarations
If there is nothing coming out of this, I look at the Content-Type and Content-Transfer-Encoding mail-headers, supposing that this is not a multipart message.
Finding something definitive (text/plain or text/html), I decode either the plaintext or the result of Html2Text::convert, provided I have an encoding (quoted-printable or base64 for the time).
and yes – I almost forgot – there is a force_encoding('utf-8') somewhere; but that is just the worm-hole to a full-blown alternative universe, so do not put too much weight on it.

PSE comment as you like but do not tell me that this is how it is supposed to work. This is all bull and because I am confronted to bull. I can accept however that the mail-system is a mess, as this has become quite obvious to me.

bleak_gravel · August 30, 2024, 12:22pm

UPDATE.
You have to do it all:

match content-type & content-transfer-encoding in every mail-part, as their order is not determined.
read the content-type & content-transfer-encoding from the mail-header.
read the first mime-part and store the next boundary.
split the remaining mail-body at each individual boundary.

Mail is the medium the most simple to secure with point-to-point- and message-encryption, thus avoiding any kind of centralized structure.

But they have deliberately broken it. Now The Stupid use http and believe that organizations will protect them. The reasons are obvious: obstruct, lie and praise facilitations that are none.

bleak_gravel · September 3, 2024, 5:49am

The latest nightmare-code.
These fragments belong to a class which parses eml-files. The static methods, below, take a Mail-object (from the Mail gem) as argument. The objective is to retrieve 1 plain-text version of the mail-body.

In the constants, CT means “Content-Type” and CTR “Content-Transfer-Encoding”.

I am posting this, because I do not like it but am currently unable to do better.

# This sub-class represents a part of the message, normally a mime-part with 
# content-type, content-transfer-encoding and content.
  
  class MailPart
    def initialize(type, encoding, text)
      @type = type
      @encoding = encoding
      @text = text
    end

    def charset
      declarations = @type.split(';') if @type
      if declarations && declarations.size > 1
        return declarations[1].split('=')[1].strip
      else
        return nil
      end
    end

    def type
      declarations = @type.split(';') if @type
      if declarations && !declarations.empty?
        return declarations[0].strip
      else
        return nil
      end
    end

    attr_reader :encoding, :text
  end
  # end MailPart

  # return plain text of 1 incoming mail. The plain text may be found after
  # converting html to text, if no text/plain part is found in the mail.
  # Quoted-Printable and Base64 encodings are handled and utf-8 is enforced, if
  # necessary.
  def self::plaintext(mail)
    debug('----- start plaintext() -----')
    text = nil
    encoding = nil
# mail_parts() , see further below for the code of that method.
    parts = mail_parts(mail)
    debug('mail_parts() returns ' << (parts ? parts.size.to_s : 'no') << ' part' << (parts && parts.size > 1 ? 's' : '') )  
    if parts && !parts.empty?
      part = parts.detect {|p| p if p.type == CT_PLAIN}
      if part 
        debug 'found ' << CT_PLAIN
        text = part.text
      else
        part = parts.detect {|p| p if p.type == CT_HTML}
        text = Html2Text.convert(part.text) if part
        debug('converted from ' << CT_HTML << "\n ---------- \n" << (text ? text : ' NIL ') )
      end
      encoding = part.encoding if part
      debug('part is encoded ' << (encoding ? encoding.to_s : 'NIL') ) 
    else
      debug('No mime-parts identified')
      # All decoding is based on the mail-headers.
      type = mail.headers.detect{|h| h.name == CTYPE}.value.split(';').detect{|v| v.strip if v.strip.start_with?('text')}
      # TODO: Verify that there cannot be other things in the Content-Transfer-Encoding header.
      encoding = mail.headers.detect{|h| h if h.name == CTRENCODING}.to_s.split(';')[0]
      debug("\t Mail #{CTRENCODING}: " << (encoding ? encoding : 'NIL') )
      # decode HTML if need be
      text = mail.body.to_s if type == CT_PLAIN
      text = Html2Text.convert(mail.body.to_s.force_encoding(UTF8)) if type == CT_HTML

    end
    if encoding 
      debug('mail body is encoded ' << encoding.to_s) 
      lines = text.split(LN)
      decoder = nil
      case encoding
      when 'quoted-printable'
        decoder = QP
      when 'base64'
        decoder = B64
# other encodings stay intact and must be
# honored by the caller.
      end
      debug('text before decoding ' << encoding)
      debug("----------------\n" << text)
      text = lines.collect do |l| 
        decoder.decode(l.gsub(/=$/,'' ))
      end.join(LN) if decoder
      #text = text.force_encoding('utf-8')

      debug("-------after----\n" << text)
      return text
    else
      return text.force_encoding('utf-8')
    end

    return nil
  end

 # Retrieve any mail-parts. Returns all parts or nil.
  # This looks complicated because it is.
  def self::mail_parts(mail)
    content_type = mail.headers.detect{|h| h.name == CTYPE}.value
    debug 'content-type from header is ' << content_type
    # separate values in the content-type header
    content_type_declarations = content_type.split(';')
    is_multipart = content_type_declarations.any?{|decl| decl.strip.start_with?('multipart') }
    if(is_multipart)

      # -------------> recurring actions

      # 1 construct the boundary-value from the boundary-declaration
      boundary_value = lambda {|b| b.strip.delete_prefix(BPREFIX).delete("\"").prepend('--') }
      # 2 get the encoding from a Content-Transfer-Encoding declaration
      encoding_value = lambda {|e| e.split(':')[1].strip }
      # 3 split a mail-body at a boundary.
      split_body = lambda {|bdy, bndry| bdy.split(bndry)}
      # 4 get the content-type from a Content-Type declaration
      type_value = lambda {|t| t.split(':')[1].strip}

      # <---------------

      # find a boundary in the mail-header
      boundary = content_type_declarations.detect{|decl| decl.strip.start_with?(BPREFIX)}
      boundary = boundary_value.call(boundary) if boundary
      debug 'boundary from header is ' << (boundary ? boundary : 'NIL')
      # retrieve all parts
      parts = Array.new
      if boundary
        # the mail-body as string is used repetitively
        body = mail.body.to_s
        # first match of a boundary
        split_content = split_body.call(body, boundary)[1]
        # ... and all the remaining
        while split_content && !split_content.strip.empty?
          p_boundary = split_content.split(LN).detect{|l| l.strip.start_with?(BPREFIX )}
          if p_boundary
            boundary = boundary_value.call(p_boundary)
            debug 'boundary from part ' << (boundary ? boundary : 'NIL')
          end
          # get the portion following the boundary.
          part = split_body.call(body, boundary)[1]
          if part
            type = part.match(CT_REGEXP)
            type = type_value.call(type[0]) if type 
            encoding = part.match(CTRE_REGEXP)
            encoding = encoding_value.call(encoding[0]) if encoding 

            part_array = part.split(LN)
            # Cut off content-type and content-transfer-encoding.
            # Find an empty line.
            empty_line = part_array.detect{|l| l if l.match(/^\s+/)}
            empty_index = part_array.index(empty_line) if empty_line

            # keep what follows the empty line
            if empty_index
              text = part_array[empty_index...part.size].join(LN) 
            else
              # or all
              text = part
            end

            parts << MailPart.new(type, encoding, text)
            debug 'latest part ' << parts.last.inspect
            # continue with more parts only, if a boundary was found in the
            # current part
            if p_boundary
              split_content = split_body.call(body, boundary)[2]
            else
              debug 'no new boundary found'
              split_content = nil
            end
          end
        end
      end
      return (parts && !parts.empty? ? parts : nil)
    else
      return nil
    end
  end

Ω End of post.