DupeFinder.java—Find duplicate files
Note: the explanations on how this program works are still missing.
The DupeFinder program creates a list of files from parameters
given to it and finds all files with the same content.
The program first throws out all files with a unique size,
then creates CRC32 checksums on all remaining files and prints all
files which share the same checksum and size to standard output.
This program covers the following topics:
- recursively scanning directory trees for file listings,
- creating checksums (also see CopyFile for that topic),
- processing lists and
- sorting with user-defined Comparators.
Note that this program turned out a bit larger and more complex than I expected.
You may want to try the smaller examples first.
If you do master the program's complexity, there are some
tips on extending the program as a student project.
Compiling and running the program
These instructions are hopefully beginner-friendly.
That's why they are a bit verbose.
-
Save the source code in a file DupeFinder.java
(regard case).
-
Open a prompt (shell), change to the directory where you have saved the file
and compile it:
javac DupeFinder.java
Now you should have two new files DupeFinder.class and FileInfo.class
in the same directory.
Explanation for the second class file: the source code file contains two class declarations.
-
Run the program with this command:
java DupeFinder FILE1 FILE2 ... DIR1 DIR2 ...
where the FILEs and DIRs are file and directory names which you can
add in an arbitrary order.
Explanation
TODO.
If you have studied the program and find it interesting,
here are some suggestions on how you could enhance it.
-
Right now, the program only prints duplicates to standard output.
Make the program delete all but the oldest file in a set of duplicates to save space.
-
Implement a more sophisticated algorithm to quickly find differences in files.
Instead of creating a checksum on a complete file,
work on blocks of data from all files of the same size in parallel.
Whenever a file has a unique checksum on a block, its content is unique and
you can abort reading and creating checksums on other blocks,
saving CPU cycles.
-
Try to determine whether there are complete directories or even
directory trees which are equal to others.
This will make it easier for the user to regain disk space by deleting
more sizeable portions of the file system.
-
Add a persistence layer on the file information.
Creating those checksums is time-consuming,
so store that information in a flat file or a database.
This is a bit tricky because you'll have update this
list with the current file system state, recognizing added,
modified and deleted files.
You'll also have to include all files of a directory tree,
this program currently throws away all files with a unique size
before further processing.
DupeFinder.java