Tuesday, June 26, 2007

Ruby script: find duplicate files

A quick google for a script that would find duplicate files by name in a directory tree turned up two promising techniques, one a Ruby script posted to OnJava by Bill Siggelkow and the other a bash script using common Unix tools.

Here's my attempt to reproduce the bash results in Ruby:

#!/usr/bin/env ruby
require 'find'

files = {}
found = {}

# read root directory from command line
ARGV.each do |arg|
Find.find(arg) do |f|
if File.file?(f)
# accumulate the file names
files[f] = File.basename(f)
end
end
end

# count up the number of each file name
files.each_value do |base|
# Ruby doesn't allow this Perl idiom: found[base]++
found[base] = 0 if !found[base]
found[base] += 1
end

# print the path of each file found more than once,
# prepended with rm command commented out
found.each do |name,count|
if count > 1
files.each do |path,filename|
if name == filename
puts "# rm #{path}"
end
end
end
end


Given a directory structure containing files with duplicate names in different directories, the output looks something like this:

# rm /market/fruits/tomato.txt
# rm /market/vegetables/tomato.txt
# rm /market/fruits/pea.txt
# rm /market/vegetables/pea.txt

The output could be piped to a shell script, in which you'd uncomment the "rm" statements for the files that should be deleted (if that's what you want).

This is all a bit clunky, if you've found a better or more Rubyesque way to do this, let me know!

1 comment:

  1. I wonder how to add an ignore/exclude list with such things as: .svn, Makefile, build.xml, etc.

    By the way, it seems to run quite well.

    ReplyDelete

Note: Only a member of this blog may post a comment.