This repository was archived by the owner on Aug 24, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 38
This repository was archived by the owner on Aug 24, 2022. It is now read-only.
Improve e-mail-address processing #34
Copy link
Copy link
Open
Description
Problem
When considering the following two From lines in mbox files, Codeface will run into problems right now:
From: ambrus at math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=)
From: Hans Huber
While the string at is replaced by @ in the first case, it is not in the second.
In the first case, the name is not properly parsed (it is Zsbán Ambrus actually), the string is stored as is in the database.
mysql> select name, email1 from person;
+-------------------------------+------------------------+
| name | email1 |
+-------------------------------+------------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu |
| Hans Huber | huber at hubercorp.com |
+-------------------------------+------------------------+
Fix for case two
The following patch by @wolfgangmauerer (taken from the mailing-list, tested by me) implements a more robust handling of e-mail addresses and is able to handle the second case to be transformed correctly.
diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
## Trim trailing and leading whitespace
author <- str_trim(author)
+ ## Replace textual ' at ' with @, sometimes
+ ## we can recover an email
+ author <- sub(' at ', '@', author)
+ author <- sub(' AT ', '@', author)
+
## Check if email exists
email.exists <- grepl("<.+>", author, TRUE)
@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
"<xxxyyy@abc.tld>); attempting to recover from: ", author)
logdevinfo(msg, logger="ml.analysis")
- ## Replace textual ' at ' with @, sometimes
- ## we can recover an email
- author <- sub(' at ', '@', author)
- author <- sub(' AT ', '@', author)
-
## Check for @ symbol
r <- regexpr("\\S+@\\S+", author, TRUE)
email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
## string minus the new email part as name, and construct
## a valid name/email combination
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
}
## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
author <- paste(name, ' <', email, '>', sep="")
}
else {
- ## Verify that the order is correct
+ ## There is a correct email address. Ensure that the order is correct
+ ## and fix cases like "<hans.huber@hubercorp.com> Hans Huber"
+
## Get email and name parts
r <- regexpr("<.+>", author, TRUE)
if(r[[1]] == 1) {
email <- substr(author, r, r + attr(r,"match.length")-1)
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
email <- str_trim(email)
author <- paste(name,email)
}
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map, loc.id) {
return(global.id)
}
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+ name <- str_trim(name)
+ if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+ str_length(name)) == ")") {
+ name <- substr(name, 2, str_length(name)-1)
+ }
+
+ return (name)
+}After applying the patch, the database contains:
mysql> select name, email1 from person;
+-------------------------------+----------------------+
| name | email1 |
+-------------------------------+----------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu |
| Hans Huber | huber@hubercorp.com |
+-------------------------------+----------------------+
Things to do
- Incorporate the patch into Codeface core
- Fix the encoding problem somehow
Metadata
Metadata
Assignees
Labels
No labels