Last week I was working with two separate problems. I needed a Java library that I can use for some machine learning use cases. I also needed a way to identify lot of similar images sitting in six of my directories. When I saw BoofCV library, I thought I will try to see what it has and what I can do with it.
The first thing I wanted to learn is how I can load images so that I can process them. So I created a program to load images from all directories. I calculated the Euclidean distance and identified if the images are similar.
Of course this has nothing much to do with machine learning, but it was a beginning for my starting to learn BoofCV.
Problem Statement
I used to be an avid photographer. Now of course because of time crunch, photography has taken a back seat. But I still take a lot of photographs on my phone. Over a period of time, I keep on downloading the same image across different directories. This creates a lot of duplicate files on my file system. Of course there are programs to detect them, and in this case the filenames will also be the same. But it is more fun to build one yourself. All this while I also get to test a Java library for computer vision.
For this example I just download some random images from scopio and unlimphotoes websites and put them in two separate directories. I also made sure some images are duplicated across the directories.
This is how my directories look like on windows explorer.

This is the first set of images. I copied over some images here.

This is the second set of images. I copied over a different set here. If you notice there are quite a number of images that are duplicated intentionally. I want to extract duplicate images and put them into a different directory merged sidewise for view. I also want to dump a list of files that are same.
Plan of Attack
This did not need too much of a thought. I decided to just use squared euclidean distance to calculate the distance.
Euclidean distance is a measure used in machine learning for measuring distance between two points in Euclidean space. One of the prime uses for distance formula is clustering of data points (extensively used in k-nearest neighbor algorithm).

Suppose we have two points defined by (x1, y1) and (x2, y2). We first sum distances between all points that is evaluated by using Pythagorean theorem on a single plane. The square root of this sum is the Euclidean distance.
In most case we do not need the real value of the formula. To speed up the process, we will avoid taking the final square root and save some computing cycles. The formula thus arrived at is known as the squared Euclidean Formula.
We will now concentrate on using the discussion above and finding a solution using BoofCV.
Creating BoofCV project
BoofCV supports maven initialization of project. We will be using maven to create this project. There is just one dependency to add in this case. I am not planning to use the GUI extension for this.
<dependency> <groupId>org.boofcv</groupId> <artifactId>boofcv-core</artifactId> <version>0.42</version> </dependency>
We just use the core library, but depending on requirements, additional maven dependencies can be added/
boofcv-android: Add support for Android boofcv-WebcamCapture: Add support for capturing webcam boofcv-javacv: Add support to work with OpenCV files boofcv-ffmpeg: Add support to work with videos boofcv-swing: Add support for GUI/ visualizations boofcv-jcodec: Uses Java libraries for working with videos boofcv-all: Add all of the additional modules to core
Finding same images
We will start by loading an image into a BufferedImage object.
private BufferedImage loadColorImage(final String filePath) { final BufferedImage image = UtilImageIO.loadImage(filePath); return image; }
Calculate Histogram
To calculate distance, we will first calculate Histogram for this image. This will generate an array of value for the entire image with pixel intensities for each RGB channel.
public double[] getHsvHist(final BufferedImage img) { final Planar<GrayF32> rgb = new Planar<GrayF32>( GrayF32.class, img.getWidth(), img.getHeight(), 3); ConvertBufferedImage.convertFrom(img, rgb, true); final Planar<GrayF32> hsv = new Planar<>( GrayF32.class, img.getWidth(), img.getHeight(), 3); ColorHsv.rgbToHsv(rgb, hsv); final Planar<GrayF32> hs = hsv.partialSpectrum(0, 1); final Histogram_F64 histogram = new Histogram_F64(12, 12); histogram.setRange(0, 0, 2.0*Math.PI); histogram.setRange(1, 0, 1.0); // Compute Histogram GHistogramFeatureOps.histogram(hs, histogram); UtilFeature.normalizeL2(histogram); return histogram.data; }
BoofCV works with it’s own Image data structures. So first thing we do is to convert the image into Planar object on Line 5. Instead of using RGB pixel values, we will compute the histogram for HSV (Hue Saturation value). This gives a better result. Line 9 converts the image to HSV. We finally compute histogram on Line 18. Finally, we normalize the histogram and return.
Reading all Files
I also created a small bean to keep all histograms.
@Data public class ImageProps { private int height; private int width; private TupleDesc_F64 histPoints; }
I wanted to send a lot of different directories to scan, so I just assigned a varargs signature for my main function. So, my main function starts like this,
public void runIt(final String outDir, final String ... fileDirs) { final List<String> files = new ArrayList<>(); for (final String fileDir : fileDirs) { final List<String> images = UtilIO.listImages(fileDir, true); files.addAll(images); } Collections.sort(files);
The thing to note here is on Line 5, I am using a BoofCV function to list all images in a directory.
Populating Histogram for Files
Next I loop through and populate file details for every file we had read before and store in a List of values. I just created a custom function to load file details that look like this.
public ImageProps getImageProps(final String file) { final ImageProps props = new ImageProps(); log.debug("Processing file: {}", file); final BufferedImage img = loadColorImage(file); final int srcWidth = img.getWidth(); final int srcHeight = img.getHeight(); props.setHeight(srcHeight); props.setWidth(srcWidth); final double[] hist = getHsvHist(img); final TupleDesc_F64 tuple = new TupleDesc_F64(hist); props.setHistPoints(tuple); return props; } public FileDetails getDetail(final String filePath) { final FileDetails fd = new FileDetails(); final Path path = Paths.get(filePath); fd.setDirectory(path.getParent().toString()); fd.setFileName(path.getFileName().toString()); final ImageProps props = getImageProps(filePath); fd.setProps(props); return fd; }
Here I am using a simple bean class to store the values.
@Data public class FileDetails { private String directory; private String fileName; private ImageProps props; }
Calculating Euclidean distance
Finally, we have everything to start calculating the squared euclidean distance between all the files we got. In BoofCV, we calculate euclidean distance as given below,
public double getEuclideanDistance(final TupleDesc_F64 src, final TupleDesc_F64 dst) { return DescriptorDistance.euclideanSq(src, dst); }
To store all the differences, I created a bean class and defined custom equals on it. This allows me to skip the same set of files being added twice.
@Data public class HistDiff implements Comparable<HistDiff> { private String fileName1; private String fileName2; private double euclidDiff; @Override public int compareTo(HistDiff diff) { return this.fileName1.compareTo(diff.fileName1); } @Override public boolean equals(Object obj) { if (obj == this) { return true; } if (!(obj instanceof HistDiff)) { return false; } final HistDiff hd = (HistDiff)obj; if ((fileName1.equals(hd.fileName1) && fileName2.equals(hd.fileName2)) || (fileName1.equals(hd.fileName2) && fileName2.equals(hd.fileName1))) { return true; } return false; } @Override public int hashCode() { return fileName1.hashCode() * fileName2.hashCode(); } }
Next we just loop through each file and add the euclidean difference between them and store them in a List. The function to get all of them looks as following:
public List<HistDiff> getHistograms( final FileDetails fdSrc, final List<FileDetails> fds) { final List<HistDiff> simList = new ArrayList<>(); fds.forEach(fl -> { if (fl.getFileName() != fdSrc.getFileName() && fl.getDirectory() != fdSrc.getDirectory()) { final double eclDist = getEuclideanDistance( fdSrc.getProps().getHistPoints(), fl.getProps().getHistPoints() ); final HistDiff hist = new HistDiff(); hist.setEuclidDiff(eclDist); hist.setFileName1(String.format("%s\\%s", fl.getDirectory(), fl.getFileName() )); hist.setFileName2(String.format("%s\\%s", fdSrc.getDirectory(), fdSrc.getFileName() )); simList.add(hist); } }); return simList; }
Write Final Output
Now that out of the way, we compare if the euclidean distance is 0, then we create a merged image and dump in output directory. Since we have not included the swing library, we just create a custom function to combine two images (reference only).
public BufferedImage joinImage(final String file1, final String file2) { final BufferedImage img1 = loadColorImage(file1); final BufferedImage img2 = loadColorImage(file2); final int img1Width = img1.getWidth(); final int img1Height = img1.getHeight(); final int img2Width = img2.getWidth(); final int img2Height = img2.getHeight(); final int newWidth = img1Width + img2Width; final int newHeight = (img1Height > img2Height)?img1Height:img2Height; final BufferedImage newImg = new BufferedImage(newWidth, newHeight, BufferedImage.TYPE_INT_RGB); newImg.createGraphics().drawImage(img1, 0, 0, null); newImg.createGraphics().drawImage(img2, img1Width, 0, null); return newImg; }
Now we can process our duplicates in any way we want. The images dumped compare that identified files were indeed duplicates. You can see the output below.

Those seem accurate to me.
Conclusion
Even though this was a simple project, I had fun going through APIs for BoofCV. I intend to try out more of the computer vision stuff next with that library. Hope you found this useful. Ciao for now.