In this article, regular expressions of currency (e.g., US$100, £0.12, or HK$54), time, and date are listed out for quick copy and paste. They’re all battle-tested. While each regex comes with limitations, we have notes addressing that along with customization tips.
We do hope you check out the interactive code snippets to get a better idea on how the regexes work!
Currency Regex
Note that currency signs apart from “$” will be dropped. The currency value will still gets matched, i.e., pound sterling sign £ in the first item of the test array.
import re
test = [
"$9876 £112.00",
"asdf$1234",
"$12.00 14",
"$3000000000000",
"$00000000000001",
"$00000000000000",
"asdf",
"one hundred forty two dollars"
]
regex = re.compile(
r'\$?(?:(?:[1-9][0-9]{0,2})(?:,[0-9]{3})+|[1-9][0-9]*|0)(?:[\.,][0-9][0-9]?)?(?![0-9]+)'
)
print(sum([regex.findall(x) for x in test],[]))
Results should be:
['$9876', '112.00', '$1234', '$12.00', '14', '$3000000000000', '1', '0']
Interactive code snippets available here
Time Regex
import re
test = [
"00:00:00", "23:59:59",
"00 00 00", "23 59 59",
"00.00.00", "23.59.59",
"00:00.00", "23.59:59",
"9:00pm", "9:00am", "10:00:00 am",
"13:00:12 am", "13 pm" #won't be considered as valid time
]
regex = re.compile(
r'(?=((?: |^)[0-2]?\d[:. ]?[0-5]\d(?:[:. ]?[0-5]\d)?(?:[ ]?[ap]\.?m?\.?)?(?: |$)))'
)
print(sum([regex.findall(x) for x in test],[]))
Results should be:
['00:00:00', '23:59:59', '00 00 00', ' 00 00', '23 59 59', '00.00.00', '23.59.59', '00:00.00', '23.59:59', '9:00pm', '9:00am', '10:00:00 am', '13:00:12 am']
Interactive code snippets available here
Regex Date with months in English (YYYY/MMMM/dd)
Note that currency signs apart from “$” will be dropped. The currency value will still gets matched, i.e., pound sterling sign £ in the first item of the test array.
import re
test = [
"2020-jan-1",
"2012-jan-12",
"1920-feb-22",
# space isn't a valid delimiter here, you can add it in the regex though
"2020 mar 1",
# only 19** and 20** are considered valid here, add year prefix accordingly, or extract with the last two year digits only
"1840-jun-12",
# Must follow the format YYYY-MMMM-dd
"2020-01-01"
]
regex = re.compile(
'(?=((?:(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:19|20)?\d{2}(?!\:)|'
'(?:19|20)?\d{2}[/\-,.]?(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])|'
'(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?(?:19|20)\d{2}(?!\:)|'
'(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?\d{2})))'
)
print(sum([regex.findall(x) for x in test],[]))
Results should be:
['2020-jan-1', '20-jan-1', '2012-jan-12', '12-jan-12', '2-jan-12', '1920-feb-22', '20-feb-22', '40-jun-12']
Interactive code snippets available here
Check out the Original Post for More Details
This is an abstract from our original blog post, which provides more regexes and explanations. In that article, more accurate ways to extract data are also discussed, with solutions proposed. It'd be nice if you can check it out and share your thoughts. Happy coding, cheers!
Top comments (0)