Advanced Machine Learning, Data Mining, and Online Advertising Services
One interesting problem that we worked in the past was understanding how people generate their email addresses. This problem has applications in different areas like fraud detection and generating emails for cold email reachout.
Let's asume you have a person's first name (fn), last name (ln) and email address. One way to think about this problem is to formalate a function f(.) taking "fn" and "ln" and generate email as output: email = f(fn, ln). Function f() is a language function transforming "fn" and "ln" to an email address.
If we express f(.) programmatically, we can approach the email language modeling as a pattern mining problem in which we generate an f(.) for each tuple (fn, ln, email) as email pattern. Next, we merge similar patterns and report the overall count as the pattern frequency.
Let's take my gmail address for example. My tuple is (fn, ln, em)=(kazem, jahanbakhsh, k.jahanbakhsh) where "k.jahanbakhsh" is the local part of my gmail. So, my email can be expressed as follows:
em = f(fn, ln) = fn[1].ln
In words, my email is composed of the first character of "fn" appended by "dot" and "ln".
One simple (this is not exhaustive) way to approach this problem is to rely on our intuition. People are likely to pick some part of their "fn" and "ln" and sometimes a number (e.g. birthday) and compose those pieces to create their email address. So, one way to express f(.)'s is by pre-defining likely patterns using the regular expression. Say we guess many people would take their "fn" and "ln" and append them to create their email. So, we can express such an f(.) using the following regex:
re.match(value["fn"] + value["ln"] + '$', value["local_part"])
We came up with 12 predefined regex rules to mine one of our datasets and count frequent patterns. The list below shows some of the most frequent patterns we found in the dataset we mined:
fn.ln --> 68
fnln --> 52
fn --> 47
fnln\d+ --> 34
fn[0]ln --> 30
"fn.ln" and i"fnln" appear as the most frequent patterns for emails. The pros for above approach is its simpilicity. However, the issue is its limitation of not being able to automatically generate all possible patterns.
In our second approach, we want to automate generating patterns. People are likely to generate their email addresses by picking some characters from their "fn" and "ln" and append them with some especial characters such as: ".", "_" and digits. Also, due to sequence in human memory people will pick characters from their "fn"/"ln" from left to right.
Based on above observations, we can come up with a simple greedy search strategy scanning "fn" and "ln" from left to right and try to generate the email address. Below you see a sample Python code for searching and generating a pattern by analyzing fn, ln, and email:
#generate pattern
def gen_pattern(fn, ln, em):
#first drop all especial characters and numbers
pattern = re.compile("[^a-z]")
clean_em = pattern.sub('', em)
clean_fn = pattern.sub('', fn)
clean_ln = pattern.sub('', ln)
flag = False
for fn_len in xrange(len(clean_fn) + 1):
for ln_len in xrange(len(clean_ln) + 1):
if clean_fn[:fn_len] + clean_ln[:ln_len] == clean_em:
extract "p1", "inter", "p2", "end" parts from (fn, ln, em)
flag = True
elif clean_ln[:ln_len] + clean_fn[:fn_len] == clean_em:
extract "p2", "inter", "p1", "end" parts from (fn, ln, em)
flag = True
if flag:
em_pattern = p1 + inter + p2 + end
return em_pattern
After generating patterns, we combined and counted the frequency of each pattern. Below, see the list of most frequent patterns generated by a greedy search technique:
('fnln\d{4}', 4)
('fn.ln.', 4)
('fn\d{4}', 4)
('fnln\d{1}', 5)
('ln.fn', 5)
('ln', 6)
('fn[1]ln\d{3}', 6)
('fn[3]ln', 6)
('fn[1]ln\d{4}', 6)
('fn_ln', 7)
('fn.ln.\d{1}', 7)
('fn[1]ln\d{2}', 14)
('fnln\d{2}', 21)
('fn[1]ln', 31)
('fn', 50)
('fnln', 52)
('fn.ln', 70)
Above results show extracted patterns with their frequencies. This approach is more exhaustive than the first one. The most frequent pattern is 'fn.ln'. However, it's interesting to see fnln\d{2}' and 'fn[1]ln\d{4}' patterns when people probably attach the last two digits or their birthdate to their "fn" and "ln".
In this post, we described the interesting problem of exploring different ways that people employ to generate their email assresses by formulating the problem as a simple data mining problem. You can also apply the same technique to mine people social media usernames such as Facebook or Twitter.
Feel free to leave a comment below if have any suggestions/questions. You can also reach us at info@AIOptify.com .