Part 3 in a series of articles on implementing a notification system using Gmail and Line Bot
Hi. In this article I will be working through how to get a list of message_ids and to get the email associated with an id.
I have done quite a bit of work with processing email in a past life. Working in Perl. I worked on a project which I am pretty sure predates mailchimp. So I have a solid understanding of Email and its standards.
But this is my first venture into processing emails with Python. So I did some research and found a couple of guides. Neither were particularly great. But they at least pointed me in the right direction. This will be a distilled version of what I gleamed.
If you want to see one of the sources, then I refer you to this. The flow is not the best. But he gets there.
For my purposes, I need to get a list of emails with the following conditions:
- Arrived within the last 5 minutes (not possible so need a label)
- Already labeled using Gmail filters.
I generally label key email with special labels. In my case I have three emails which already have labels applied when they come into my email account, so I will create a search using these labels.
This means creating a Gmail search string which does the following.
- Gets all emails that have the labels I am interested in. Emails only need to have one of these labels.
- The email must be only 1 day old. (Can not limit search to newer emails)
- exclude email that have a label that will be added after processing. (Prevent re-processing) Processing once every 5 minutes.
- (label:labela OR label:labelb OR label:labelc)
- This gives us all emails that have any of these labels attached.
- newer_than:1d
- Limit to email that are only 24 hours older or newer.
- -label:processed
- exclude email with this label
The final search string looks like this:
((label:labela OR label:labelc OR label:labelc) AND -label:processed) AND newer_than:1d
So we want to add this as a CONSTANT that can be referred to later.
SEARCH_STRING = ((label:labela OR label:labelc OR label:labelc) AND -label:processed) AND newer_than:1d
How do I get a list of emails that match this search condition?
def get_message_ids(service, search_string):
try:
search = service.users().messages().list(userId='me',
q=search_string).execute()
except (errors.HttpError, error):
return search
This returns a dictionary which contains two keys.
- messages -> List of dict() with two keys: 'id' and 'threadId'
- resultSizeEstimate -> number of messages in the response
{
'messages': [
{'id': '1775d10a91ba4249', 'threadId': '1775c1ffe59cda8f'},
{'id': '1775c1ffe59cda8f', 'threadId': '1775c1ffe59cda8f'}
],
'resultSizeEstimate': 2
}
id
is the idvidual email id
and threadId
is the email thread the id
belongs to.
Calling this using:
message_ids = get_message_ids(service, SEARCH_STRING)
Remember that service
comes from service = get_service()
We can use resultSizeEstimate
to determine if there are no matching messages.
Keeping in mind that any integer greater than zero is considered True
, we can make this function which will return True
or False
def found_messages(message_ids):
return bool(message_ids['resultSizeEstimate'])
This will return False
when resultSizeEstimate
equals zero, or True
if it is greater than zero.
I am not interested in threads. So I am going to get just a list of id
def get_only_message_ids(message_ids):
ids = []
for anId in message_ids['messages']:
ids.append(anId['id'])
return ids
Here I am accessing the messages
dictionary to just get each message's id
This will give me something like:
[
'1775d10a91ba4249',
'1775c1ffe59cda8f'
]
These are the individual ids for each email that were found with my search string.
At some point I will need to loop through these ids to process each message. But these are just the ids. We need to get the actual email referenced using the ids we have.
Let's get an actual email.
At this point we also need to add some more modules to our script.
import base64
import email
from email import parser
from email import policy
def get_message(service, msg_id):
msg = service.users().messages().get(userId='me',
id=msg_id,
format='raw').execute()
msg_in_bytes = base64.urlsafe_b64decode(msg['raw'].encode('ASCII'))
email_tmp = email.message_from_bytes(msg_in_bytes,
policy=policy.default)
emailParser = parser.Parser(policy=policy.default)
resulting_email = emailParser.parsestr(email_tmp.as_string())
return resulting_email
What are we doing here?
- Getting the message, a dictionary object. This gives us:
{ 'id': '1775d10a91ba4249',
'threadId': '1775c1ffe59cda8f',
'labelIds': [
'Label_18',
'CATEGORY_PERSONAL'
],
'snippet': 'REDACTED REDACTED 様の入退室情報をお知らせします。 【セーフティメール情報】 2021-02-01 19:08:26 に退室しました。 ※なお、このメールに返信することはできませんのでご注意ください。',
'sizeEstimate': 3448,
'raw':
'RGVsaXZlcmVkLVRvOiBiYXNwYW5uQGdtYWlsLmNvbQ0KUmVjZWl2ZWQ6IGJ5IDIwMDI6YWRm
Ojk1MDY6MDowOjA6MDowIHdpdGggU01UUCBpZCA2Y3NwMzg1MzYyd3JzOw0KICAgICAgICBNb2
....snip...
6MDg6MjYgGyRCJEtCYDw8JDckXiQ3JD8hIxsoQg0KDQobJEIiKCRKJCohIiQzJE4lYSE8JWskS
0pWPy4kOSRrJDMkSCRPJEckLSReJDskcyROJEckNENtMFUkLyRAJDUkJCEjGyhCDQoNCg==',
'historyId': '3086929',
'internalDate': '1612174106000'
}
Not so useful, but we could probably do something with the snippet
, but I have read this is not provided with the api with all languages. Your milage may differ.
- Then access the
raw
key which is base64 encoded. This key gives us the byte string format of the entire email including headers.
b'Delivered-To: redact@example.com\r\nReceived: by 2002:adf:9506:0:0:0:0:0
with SMTP id 6csp385362wrs;\r\n Mon, 1 Feb 2021 02:08:29 -0800 (PST)
\r\nX-Googl..snip....
\r\n\r\n\x1b$B"($J$*!"$3$N%a!<%k$KJV?.$9$k$3$H$O$G$-$^$;$s$N$G$4Cm0U$/$@$5$$!#\x1b(B
\r\n\r\n'
This is a little more useful, but again in this case it's been encoded due to the character set that was used in the email.
- Create a Email Parser object to process the byte string email
- Read this byte string and create an
email.message.EmailMessage
object.
The last two steps get us closer. The email is still character encoded. But it's ready for us to use in the next step.
Comment
It is my understanding that email was developed to only support the ascii character set. As a result, encodings have been added to support other languages. This means that a lot of email are not really human readable by default.
If you are interested you can take a peek at RFC2045
Note:
I am using:
email_tmp = email.message_from_bytes(msg_in_bytes,
policy=policy.default)
emailParser = parser.Parser(policy=policy.default)
The policy.default
is relatively new. Processing email this way means, I don't have to check the encoding of the contents. The parser will handle that for me. I can skip checking if the string is UTF-8
or ISO-2022-JP
. I am doing this because the emails I am dealing with are in Japanese as I live in Tokyo. If you are dealing with only english and emails that are only in ascii encoded, you can simplify the get_message() function.
In the next section I will look at processing a single email to get the information I am interested in.
Top comments (0)