How to resolve the algorithm Find duplicate files step by step in the Java programming language

Published on 12 May 2024 09:40 PM

How to resolve the algorithm Find duplicate files step by step in the Java programming language

Table of Contents

Problem Statement

In a large directory structure it is easy to inadvertently leave unnecessary copies of files around, which can use considerable disk space and create confusion.

Create a program which, given a minimum size and a folder/directory, will find all files of at least size bytes with duplicate contents under the directory and output or show the sets of duplicate files in order of decreasing size. The program may be command-line or graphical, and duplicate content may be determined by direct comparison or by calculating a hash of the data. Specify which filesystems or operating systems your program works with if it has any filesystem- or OS-specific requirements. Identify hard links (filenames referencing the same content) in the output if applicable for the filesystem. For extra points, detect when whole directory sub-trees are identical, or optionally remove or link identical files.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Find duplicate files step by step in the Java programming language

High Level Overview

The DuplicateFiles program in Java aims to find and group duplicate files within a specified directory that meet a specified minimum file size. It uses a combination of file size and MD5 checksum comparisons to identify potential duplicates.

Detailed Explanation

Main Method (line 11-22):

  • Validates that the command-line arguments include a directory name and minimum file size (line 13).
  • Calls findDuplicateFiles to perform the duplicate file search (line 20).

findDuplicateFiles Method (line 26-66):

  • Receives a directory path and minimum file size as parameters (line 27).
  • Initializes a FileVisitor to traverse the file system (line 36) and a map to store duplicate file information keyed by a FileKey (line 37).
  • Uses the Files API to recursively walk the file tree (line 39).
  • Outputs information about duplicate filesets that have the same size and checksum (lines 57-60).

FileVisitor (line 72-105):

  • Extends the SimpleFileVisitor class to provide custom file processing logic.
  • Calculates MD5 checksums for files that meet the minimum size requirement (line 88).
  • Stores potential duplicate file paths and file key information in the fileMap_ map (lines 90-97).

FileKey (line 109-124):

  • Represents a key for grouping files based on size and MD5 checksum.
  • Compares FileKey instances based on size and checksum (line 121).

StringListComparator (line 69-71):

  • Custom comparator for sorting lists of strings (used to display duplicate file paths).

Additional Methods (line 127-136):

  • containsDuplicates: Checks if a map contains any key with more than one value (indicating potential duplicates).
  • getMD5Sum: Calculates the MD5 checksum for a file (line 101).
  • hashCompare: Compares two byte arrays representing MD5 checksums (line 134).

Source code in the java programming language

import java.io.*;
import java.nio.*;
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.security.*;
import java.util.*;

public class DuplicateFiles {
    public static void main(String[] args) {
        if (args.length != 2) {
            System.err.println("Directory name and minimum file size are required.");
            System.exit(1);
        }
        try {
            findDuplicateFiles(args[0], Long.parseLong(args[1]));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void findDuplicateFiles(String directory, long minimumSize)
        throws IOException, NoSuchAlgorithmException {
        System.out.println("Directory: '" + directory + "', minimum size: " + minimumSize + " bytes.");
        Path path = FileSystems.getDefault().getPath(directory);
        FileVisitor visitor = new FileVisitor(path, minimumSize);
        Files.walkFileTree(path, visitor);
        System.out.println("The following sets of files have the same size and checksum:");
        for (Map.Entry<FileKey, Map<Object, List<String>>> e : visitor.fileMap_.entrySet()) {
            Map<Object, List<String>> map = e.getValue();
            if (!containsDuplicates(map))
                continue;
            List<List<String>> fileSets = new ArrayList<>(map.values());
            for (List<String> files : fileSets)
                Collections.sort(files);
            Collections.sort(fileSets, new StringListComparator());
            FileKey key = e.getKey();
            System.out.println();
            System.out.println("Size: " + key.size_ + " bytes");
            for (List<String> files : fileSets) {
                for (int i = 0, n = files.size(); i < n; ++i) {
                    if (i > 0)
                        System.out.print(" = ");
                    System.out.print(files.get(i));
                }
                System.out.println();
            }
        }
    }

    private static class StringListComparator implements Comparator<List<String>> {
        public int compare(List<String> a, List<String> b) {
            int len1 = a.size(), len2 = b.size();
            for (int i = 0; i < len1 && i < len2; ++i) {
                int c = a.get(i).compareTo(b.get(i));
                if (c != 0)
                    return c;
            }
            return Integer.compare(len1, len2);
        }
    }

    private static boolean containsDuplicates(Map<Object, List<String>> map) {
        if (map.size() > 1)
            return true;
        for (List<String> files : map.values()) {
            if (files.size() > 1)
                return true;
        }
        return false;
    }

    private static class FileVisitor extends SimpleFileVisitor<Path> {
        private MessageDigest digest_;
        private Path directory_;
        private long minimumSize_;
        private Map<FileKey, Map<Object, List<String>>> fileMap_ = new TreeMap<>();

        private FileVisitor(Path directory, long minimumSize) throws NoSuchAlgorithmException {
            directory_ = directory;
            minimumSize_ = minimumSize;
            digest_ = MessageDigest.getInstance("MD5");
        }

        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            if (attrs.size() >= minimumSize_) {
                FileKey key = new FileKey(file, attrs, getMD5Sum(file));
                Map<Object, List<String>> map = fileMap_.get(key);
                if (map == null)
                    fileMap_.put(key, map = new HashMap<>());
                List<String> files = map.get(attrs.fileKey());
                if (files == null)
                    map.put(attrs.fileKey(), files = new ArrayList<>());
                Path relative = directory_.relativize(file);
                files.add(relative.toString());
            }
            return FileVisitResult.CONTINUE;
        }

        private byte[] getMD5Sum(Path file) throws IOException {
            digest_.reset();
            try (InputStream in = new FileInputStream(file.toString())) {
                byte[] buffer = new byte[8192];
                int bytes;
                while ((bytes = in.read(buffer)) != -1) {
                    digest_.update(buffer, 0, bytes);
                }
            }
            return digest_.digest();
        }
    }

    private static class FileKey implements Comparable<FileKey> {
        private byte[] hash_;
        private long size_;

        private FileKey(Path file, BasicFileAttributes attrs, byte[] hash) throws IOException {
            size_ = attrs.size();
            hash_ = hash;
        }

        public int compareTo(FileKey other) {
            int c = Long.compare(other.size_, size_);
            if (c == 0)
                c = hashCompare(hash_, other.hash_);
            return c;
        }
    }

    private static int hashCompare(byte[] a, byte[] b) {
        int len1 = a.length, len2 = b.length;
        for (int i = 0; i < len1 && i < len2; ++i) {
            int c = Byte.compare(a[i], b[i]);
            if (c != 0)
                return c;
        }
        return Integer.compare(len1, len2);
    }
}


  

You may also check:How to resolve the algorithm Factorial step by step in the PL/0 programming language
You may also check:How to resolve the algorithm Find limit of recursion step by step in the PHP programming language
You may also check:How to resolve the algorithm Convert decimal number to rational step by step in the Seed7 programming language
You may also check:How to resolve the algorithm Ackermann function step by step in the MAXScript programming language
You may also check:How to resolve the algorithm Rhonda numbers step by step in the C++ programming language