Adam.S

Posted on Feb 1, 2021 • Edited on Jun 28, 2021 • Originally published at bas-man.dev

Searching for and Getting Emails

#google #python #api #email

Part 3 in a series of articles on implementing a notification system using Gmail and Line Bot

Hi. In this article I will be working through how to get a list of message_ids and to get the email associated with an id.

I have done quite a bit of work with processing email in a past life. Working in Perl. I worked on a project which I am pretty sure predates mailchimp. So I have a solid understanding of Email and its standards.

But this is my first venture into processing emails with Python. So I did some research and found a couple of guides. Neither were particularly great. But they at least pointed me in the right direction. This will be a distilled version of what I gleamed.
If you want to see one of the sources, then I refer you to this. The flow is not the best. But he gets there.

For my purposes, I need to get a list of emails with the following conditions:

Arrived within the last 5 minutes (not possible so need a label)
Already labeled using Gmail filters.

I generally label key email with special labels. In my case I have three emails which already have labels applied when they come into my email account, so I will create a search using these labels.

This means creating a Gmail search string which does the following.

Gets all emails that have the labels I am interested in. Emails only need to have one of these labels.
The email must be only 1 day old. (Can not limit search to newer emails)
exclude email that have a label that will be added after processing. (Prevent re-processing) Processing once every 5 minutes.

(label:labela OR label:labelb OR label:labelc)
- This gives us all emails that have any of these labels attached.
newer_than:1d
- Limit to email that are only 24 hours older or newer.
-label:processed
- exclude email with this label

The final search string looks like this:

((label:labela OR label:labelc OR label:labelc) AND -label:processed) AND newer_than:1d

So we want to add this as a CONSTANT that can be referred to later.

SEARCH_STRING = ((label:labela OR label:labelc OR label:labelc) AND -label:processed) AND newer_than:1d

How do I get a list of emails that match this search condition?

def get_message_ids(service, search_string):

    try:
        search = service.users().messages().list(userId='me',
                                                 q=search_string).execute()
    except (errors.HttpError, error):
    return search

This returns a dictionary which contains two keys.

messages -> List of dict() with two keys: 'id' and 'threadId'
resultSizeEstimate -> number of messages in the response

{
'messages': [
    {'id': '1775d10a91ba4249', 'threadId': '1775c1ffe59cda8f'},
    {'id': '1775c1ffe59cda8f', 'threadId': '1775c1ffe59cda8f'}
    ], 
'resultSizeEstimate': 2
}

id is the idvidual email id and threadId is the email thread the id belongs to.

Calling this using:

message_ids = get_message_ids(service, SEARCH_STRING)

Remember that service comes from service = get_service()

We can use resultSizeEstimate to determine if there are no matching messages.
Keeping in mind that any integer greater than zero is considered True, we can make this function which will return True or False

def found_messages(message_ids):
    return bool(message_ids['resultSizeEstimate'])

This will return False when resultSizeEstimate equals zero, or True if it is greater than zero.

I am not interested in threads. So I am going to get just a list of id

def get_only_message_ids(message_ids):
    ids = []
    for anId in message_ids['messages']:
        ids.append(anId['id'])
    return ids

Here I am accessing the messages dictionary to just get each message's id
This will give me something like:

[
  '1775d10a91ba4249',
  '1775c1ffe59cda8f'
]

These are the individual ids for each email that were found with my search string.

At some point I will need to loop through these ids to process each message. But these are just the ids. We need to get the actual email referenced using the ids we have.

Let's get an actual email.

At this point we also need to add some more modules to our script.

import base64
import email
from email import parser
from email import policy

def get_message(service, msg_id):
    msg = service.users().messages().get(userId='me',
                                         id=msg_id,
                                         format='raw').execute()
    msg_in_bytes = base64.urlsafe_b64decode(msg['raw'].encode('ASCII'))
    email_tmp = email.message_from_bytes(msg_in_bytes,
                                         policy=policy.default)
    emailParser = parser.Parser(policy=policy.default)
    resulting_email = emailParser.parsestr(email_tmp.as_string())
    return resulting_email

What are we doing here?

Getting the message, a dictionary object. This gives us:

{ 'id': '1775d10a91ba4249',
  'threadId': '1775c1ffe59cda8f',
  'labelIds': [
      'Label_18',
      'CATEGORY_PERSONAL'
      ],
  'snippet': 'REDACTED REDACTED 様の入退室情報をお知らせします。 【セーフティメール情報】 2021-02-01 19:08:26 に退室しました。 ※なお、このメールに返信することはできませんのでご注意ください。',
  'sizeEstimate': 3448,
  'raw':
   'RGVsaXZlcmVkLVRvOiBiYXNwYW5uQGdtYWlsLmNvbQ0KUmVjZWl2ZWQ6IGJ5IDIwMDI6YWRm
   Ojk1MDY6MDowOjA6MDowIHdpdGggU01UUCBpZCA2Y3NwMzg1MzYyd3JzOw0KICAgICAgICBNb2
   ....snip...
   6MDg6MjYgGyRCJEtCYDw8JDckXiQ3JD8hIxsoQg0KDQobJEIiKCRKJCohIiQzJE4lYSE8JWskS
   0pWPy4kOSRrJDMkSCRPJEckLSReJDskcyROJEckNENtMFUkLyRAJDUkJCEjGyhCDQoNCg==',
  'historyId': '3086929',
  'internalDate': '1612174106000'
}

Not so useful, but we could probably do something with the snippet, but I have read this is not provided with the api with all languages. Your milage may differ.

Then access the raw key which is base64 encoded. This key gives us the byte string format of the entire email including headers.

b'Delivered-To: redact@example.com\r\nReceived: by 2002:adf:9506:0:0:0:0:0 
  with SMTP id 6csp385362wrs;\r\n        Mon, 1 Feb 2021 02:08:29 -0800 (PST)
  \r\nX-Googl..snip....
  \r\n\r\n\x1b$B"($J$*!"$3$N%a!<%k$KJV?.$9$k$3$H$O$G$-$^$;$s$N$G$4Cm0U$/$@$5$$!#\x1b(B
  \r\n\r\n'

This is a little more useful, but again in this case it's been encoded due to the character set that was used in the email.

Create a Email Parser object to process the byte string email
Read this byte string and create an email.message.EmailMessage object.

The last two steps get us closer. The email is still character encoded. But it's ready for us to use in the next step.

Comment

It is my understanding that email was developed to only support the ascii character set. As a result, encodings have been added to support other languages. This means that a lot of email are not really human readable by default.

If you are interested you can take a peek at RFC2045

Note:

I am using:

email_tmp = email.message_from_bytes(msg_in_bytes,
                                     policy=policy.default)
emailParser = parser.Parser(policy=policy.default)

The policy.default is relatively new. Processing email this way means, I don't have to check the encoding of the contents. The parser will handle that for me. I can skip checking if the string is UTF-8 or ISO-2022-JP. I am doing this because the emails I am dealing with are in Japanese as I live in Tokyo. If you are dealing with only english and emails that are only in ascii encoded, you can simplify the get_message() function.

In the next section I will look at processing a single email to get the information I am interested in.

DEV Community

Searching for and Getting Emails

Top comments (0)

Read next

Diagram-as-Code: Creating Dynamic and Interactive Documentation for Visual Content

Practical Experience: Integrating Over 50 Neural Networks Into One Open-Source Project

Overfitting vs Underfitting

Google Workspace Mail Management