Philip Steiner: Ruby script: find duplicate files

Tuesday, June 26, 2007

Ruby script: find duplicate files

A quick google for a script that would find duplicate files by name in a directory tree turned up two promising techniques, one a Ruby script posted to OnJava by Bill Siggelkow and the other a bash script using common Unix tools.

Here's my attempt to reproduce the bash results in Ruby:

#!/usr/bin/env ruby
require 'find'

files = {}
found = {}

# read root directory from command line    
ARGV.each do |arg|
  Find.find(arg) do |f|
    if File.file?(f) 
      # accumulate the file names
      files[f] = File.basename(f)
    end
  end
end

# count up the number of each file name 
files.each_value do |base|
  # Ruby doesn't allow this Perl idiom: found[base]++
  found[base] = 0 if !found[base]
  found[base] += 1
end

# print the path of each file found more than once,
# prepended with rm command commented out
found.each do |name,count|
  if count > 1
    files.each do |path,filename|
      if name == filename
        puts "# rm #{path}"
      end
    end
  end
end

Given a directory structure containing files with duplicate names in different directories, the output looks something like this:

# rm /market/fruits/tomato.txt
# rm /market/vegetables/tomato.txt
# rm /market/fruits/pea.txt
# rm /market/vegetables/pea.txt

The output could be piped to a shell script, in which you'd uncomment the "rm" statements for the files that should be deleted (if that's what you want).

This is all a bit clunky, if you've found a better or more Rubyesque way to do this, let me know!

1 comment:

GeorgeJune 19, 2008 at 8:55 AM
I wonder how to add an ignore/exclude list with such things as: .svn, Makefile, build.xml, etc.

By the way, it seems to run quite well.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Pages

Tuesday, June 26, 2007

Ruby script: find duplicate files

1 comment: