Web log anonymizer
I recently had need to anonymize the IP addresses in an Apache access log. It seemed like a simple task; however, there weren’t any really good code samples out there directly for it. It’s a pretty simple exercise; however, given there wasn’t anything readily available, I figured I’d post it here so others might make use of it. The only requirement it really had was to be able to process large logs rather fast and to maintain the same IP address mappings for multiple entries in the logs in order to preserve the actual traffic data as it relates to sessions. With a little more work, I’m sure it could select random IP addresses in the same geo as the original one whereas this will probably evenly distribute the IPs across the globe (skewed for actual ownership of the ranges).
So here are the few lines of Perl that got the job done:
#!/usr/bin/perl
if ($#ARGV + 1 < 1) {
print "\n\tUsage:\n";
print "\t------\n\n";
print "\tperl log_anonymize.pl file1 [file2 [file3 [...]]]\n\n";
die "Please specify at least one file to use this script.\n\n";
}
my %forward = ();
my %reverse = ();
foreach (@ARGV) {
open(ORIG, $_)
or die "Failed to open input file for reading.";
open(ANON, "+>", $_.".anon")
or die "Failed to open destination file for writing.";
while (<ORIG>) {
if (/([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)/) {
if (!($forward->{$1})) {
$newIp = getNewIp();
while ($reverse->{$newIp}) {
$newIp = getNewIp();
}
print "New mapping created: $1 -> $newIp\n";
$forward->{$1} = $newIp;
$reverse->{$newIp} = $1;
}
$repl = $forward->{$1};
$_ =~ s/$1/$repl/;
}
print ANON $_;
}
close(ORIG);
close(ANON);
}
exit 0;
sub getNewIp {
return int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256));
}
It is fairly straightforward. You invoke the Perl script with one or more arguments. Every argument should be a path to an access log. For each file, a new file of the same name and “.anon” appended gets created. Across all those files, the script maintains an internal hash of the IPs it has mapped to a new, random IP address and will re-use those mappings as they are encountered. It spits out a little message when the mappings occur so you could do some counts using ‘wc’ or something similar to see how many you had… or you could make it output a count at the end, it’s pretty simple to do either.
So that’s it, easy web log anonymizing via random IP address remapping.