Programming Languages

E-mail Database

Generalities

This is an optional individual activity that consists of two parts:

A 5 point extra credit in the final grade will be awarded to those students who fulfill both parts.

Problem

The following problem was taken from the ACM International Collegiate Programming Contest 2018, Latin American Regionals Contest.

Original problem name: Problem D — Database of Clients
Author: Paulo Cezar Pereira Costa, Brazil.

Nowadays there are billions of email users. A little-known fact is that some email providers offer awy more than the usual username@provider.com email address.

Some providers simply ignore dots in usernames. Thus, if John owns the username johnsmith, he could tell people that his email address is johnsmith@provider.com, john.smith@provider.com, or john.s.mith@provider.com, among others. Emails sent to any of these addresses would end up on his mailbox.

Other providers allow appending the character “+” followed by any combinations of letters and/or digits after the username. With this feature, by registering the username johnsmith, John would also be able to use johnsmith+friends@provider.com and johnsmith+2x3is6@provider.com.

Sometimes both features are available at once and in those cases john.smith+icpc@provider.com and john.smith+wants.2.eat.lemon.3.14@provider.com are valid addresses that John could use.

This is quite useful for users, who can manage different addresses to help organize their mailboxes and easily filter the newsletters eventually sent after registering on a new website. Unfortunately, this also opens up space for abuse. Some websites rely upon the fact that each email address identifies a single user. However, a misbehaving user might easily create multiples accounts by taking advantage of the multiple addresses allowed by the email provider.

After learning all of this, your boss got really worried. What if the number of usique users that has been reported to the shareholders is not accurate, bloated by duplicate accounts instead? That brings you to the task at hand: given the list of all email addresses form the users database of the company, you must determine the real number of unique users, assuming that all email providers have both described features available.

Write a Clojure function called count-different-emails that accepts a list of strings, where each string represents an email address from the database. Each email address has the form localpart@provider where localpart is a non-empty sequence of labels with a “.” (dot) or a “+” (plus sign) between each pair of consecutive labels, and provider is a non-empty sequence of labels always with a “.” (dot) between each pair of consecutive labels. A label is a non-empty sequence of lowercase letters and/or digits. The character “+” (plus sign) appears at most once in each email address.

The function must return an integer indicating the number of unique users in the database.

Examples:

(count-different-emails ["two.different.providers@now.here"
                         "two.different.providers@nowhere"])
=> 2

(count-different-emails ["1.2.3@testing"
                         "testing@1.2.3"])
=> 2

(count-different-emails ["alice@e.mail"
                         "eve@another.mail"
                         "bob@e.mail"
                         "joe90@e.mail"
                         "b.o.b@e.mail"
                         "bob+new@e.mail"
                         "bob@another.provider"])
=> 5