Cli tool to sanitize and enrich addresses written in Clojure.
Addresses have a clear structure that can be more or less detailed. It is simple to model an address in a relational database, but very often you get a CSV with infromation about companies or people including addresses that are not correctly formatted.
Unlike email addresses and phone numbers, addresses cannot easly be parsed or validated though regex. In addition people make typos, abbreviate some words and only include part of the addres, turning something like
Geissbergstrasse 3, 8302 Kloten Zurich CH
into
Geisbergstr. 3 Kloten
For programmatic use of the data, especially in conjuction with other, external Services it is mendatory to have the data automatically corrected and normalized.
This solution proposes to use an external service like Open Street Maps, PTV Maps or Google Maps, to sanitize and enrich the incorrect or incomplete addresses.
It should be flexible enough to take a CSV or text file and list of the address fields that you want and return a file that contains the corrected and enriched addresses. If the address is not found in the service provider than it should return the original address in the field fallback.
In order to make this tool usable in a more versatile and langauge agnostic way, a REST API should be implemented.
Before first using the tool, it is necessary to create an uberjar:
lein uberjar
The created uberjar will be located here
target/address-sanitizer-clojure-0.1.0-SNAPSHOT-standalone.jar
Once the file is compiled, you can use it with Java:
java -jar target/address-sanitizer-clojure-0.1.0-SNAPSHOT-standalone.jar
Usage: program-name [options] filepath
Options:
-c, --chunk CHUNK_SIZE 100 Chunk size
-o, --output FILE_PATH /tmp/results.csv Output file
-f, --format OUTPUT_FORMAT display_name Output format
-h, --help
An example case can be started with the provided dummy data:
java -jar target/address-sanitizer-clojure-0.1.0-SNAPSHOT-standalone.jar dummy_data/address_list.txt -c 6 -o /tmp/results.csv -f 'display_name lat lon fallback'
The several options for the output format can be found at: Nominatim
Copyright © 2019 Miguel Soares
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.
