Mac OS Photos, Clustering & Dimension Reduction

6 min readJan 9, 2022

I am always looking for new idea and way to practice my Machine Learning & Python programming skills. Recently, I changed my computer and moved on a Mac Mini M1. Of course, when doing that you always have to (re)synchronise all your data via cloud storage or backup. Most complicated was to deal with multiple Photos libraries.

Photos

I found on my backups and cloud storage multiple versions of my Photos Bibliotek (Apple) and of course didn’t want to loose one important images so I merged everything in a new one !

Is there a Python Library for that ?

Python and associated libraries are an endless journey. You can find tools and libraries for anything. So few Google search later I found “osxphotos” a Python Library to handle with Photos. I would say, “a very cool library”.

I will let you go by yourself and discover the multiple features and options you can play with on your Photos Library stored on your Mac.

osxphotos

OSXPhotos provides both the ability to interact with and query Apple's Photos.app library on macOS directly from your…

pypi.org

So, what to do with this new toy ?

I decided to crawl my Library (more than 28K images/photos/movies) and apply an Unsupervised Learning algorithm. Unsupervised learning is algorithm that learns patterns from untagged data. Of course, with the osxphotos library I was able to extract for example keyword and tagged my resources with them but I choose the unsupervised method.

Challenge

So the challenge was to dump images from the library, create a dataset for a clustering algorithm and visualise the result.

use osxphotos to download on my hard drive photos (I excluded movies)
create an algorithm to extract vectors form photos (X,Y,3) and store it in an array
fit a clustering estimator (I choose K Means from Sci kit-Learn)
predict outcome, this means classify images into the number of categories I choose
output images in a another folder (for example if I want to process this output and re import in a new Photos Library)
apply a PCA (Principal Component Analysis) which is another unsupervised algorithm to apply a dimensionality reduction
Plot a scatter chart of my Photos Library (just for the fun) !

First steps with osxphotos

You need to install the library using the usual pip instruction. Then create a Notebook and start explore this library.

The code below creates a dump of the Photos Library (on the current computer) and saves it in a Pandas Dataframe.

import osxphotos
import pyforest
import os#magic line to create a dump information Library
slist = !osxphotos dumplist_photos = []
for i in range(1,len(slist)):
 list_photos.append(slist[i].split(‘,’))pd.DataFrame(list_photos).to_csv(‘inventory.csv’, header=None)
photos = pd.read_csv(‘inventory.csv’)
photos.shape

4 random rows of the DataFrame created with the dump.

You can now manipulate that DataFrame to search, sort, extract useful information about your library. You have no limits !

Export photos and fit a clustering estimator

First, you need to import some libraries.

# importing useful librariesimport requestsimport os
from PIL import Image
import matplotlib.pyplot as plt
import numpy as npimport seaborn as sns
import osxphotos
import pyforestfrom sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCAfrom shutil import copyfile
from tqdm import tqdm

When converting a photo (image) to a vector, we must crop the image with a factor to reduce the volume. Vector for estimator for 1 image is [X (pixels), Y (pixels), Z (number of layers)]. I decided to work with images of 200x200 pixels and as the images are in colors Z will be “3” for RGB.

The function below convert a file in a vector that can be appended in a array which will be load in the fit() function of the estimator (and predict as well).

def center_crop(image_path, size):

 try:
     img = Image.open(image_path)
     img = img.resize((size+1,size+1))
     x_center = img.width/2
     y_center = img.height/2
     size = size/2
     cr = img.crop((x_center-size, y_center-size, x_center+size,      y_center+size))
 except:
     print(‘Unable to open’, image_path)
     cr=None
 return cr

Then, another magic command to export all photos / movies from the Photo Library in the hard drive. I decided to take not shared photos and those which are not hidden by Photos software. I decided to take only photos and store it in a folder named bibliotek with a folder structure based on the created year of photo.

!osxphotos export — not-shared — not-hidden — from-date “2000–01–01T12:00:00” — jpeg-ext jpg — only-photos ./bibliotek — convert-to-
jpeg — directory “{created.year}”#the job can take several minutes depending on the size of Library
#and the processor of your computerphotos_by_year={}
for y in range(2000,2023):
    try:
        photos_by_year[str(y)]= os.listdir(f”bibliotek/{y}”)
    except:
        continue

Then, I created a dictionary of my photos.

Based on my dictionary, I created a list of all files which will be used to crop images and then feed the estimator.

all_photos = []
# create dataset with all photos
for key in photos_by_year:
 list_of_year = photos_by_year[key]
 
 for i in list_of_year:
     if ‘mov’ not in i and ‘edited’ not in i and ‘(1)’ not in i:
         all_photos.append(f”bibliotek/{key}/{i}”)
 

train_list=all_photos

To feed the estimator, I must have a vector of my photos on the template [X,Y,Z]. This is created by the bloc below.

l=len(train_list)# decide the ratio of reduction when cropping the photo
# the more you set the more pixels you will have for the estimator
pixels=200train_data = np.zeros((l,pixels*pixels*3))i=0for key in tqdm(train_list):
 image_name =key
 image_path = f”{image_name}”
 try:
     crp_img = center_crop(image_path,pixels)
     crp_arr = np.array(crp_img).reshape(-1)
     train_data[i] = crp_arr
 except ValueError:
     print(f”unable to open {image_name} — {ValueError}”)
     continue
 i=i+1

Cool, I have now a vector with all my photos, so I can fit an estimator. I choose to work with KMeans from Scikit-Learn and randomly choose 35 as n_clusters (number of categories). The algorithm will scan all photos (vector) and classify them in of the 35 categories.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=35, random_state=0).fit(train_data)predict=kmeans.predict(train_data)# display the distribution of images by categories
pd.DataFrame(predict).value_counts()

When the job completed, I moved images from the original bibliotek to a cluster folder classified with my 35 entries. This will allow me to print random images by categories or do whatever I want to do with.

#create folder to stor image classified
for c in range(1,36):
   new_folder = f”cluster/{c}”
   os.mkdir(new_folder)idx=0for file in tqdm(train_list):
   filename = file.split(‘/’)[2]
   dest_file = f’cluster/{predict[idx]+1}/{filename}’
   copyfile(file, dest_file)
   idx+=1

Plot the library vectors and predictions

Finally, for the fun, I plotted the library vectors after applying a dimensionalty reduction (n_components=2)to have a tuple of [x,y] values.

# standardize the features
sc = StandardScaler()
X_train_std = sc.fit_transform(train_data)pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_x_array = []
y_array = []
for i in range(0,len(X_train_pca[:,:-1])):
    x_array.append(X_train_pca[:,:-1][i][0])
    y_array.append(X_train_pca[:,-1:][i][0])#plot the result
sns.scatterplot(x=x_array, y=y_array, hue=predict
 ,palette=”deep”)

Library Cluster after Dimensionality reduction

Conclusion

Never stop to explore the web for a new library. You can luckily find one which will give you a lot of fun :-)

osxphotos has almost 10 (dump, export, …) features and a lot of options to play with; just explore
osxphotos is read-only so no danger for your your Library; a few options in the “export” feature can create Album but original photos won’t be edited
I didn’t use a NLP technics but it can be useful to deal with all comments, names and other text information related to photos
why not creating a backup script to dump photos on a SSD drive and keep them safe from any programs (export features can convert all format to a jpeg in full quality)
you must have some free space on your hard drive if you want to download and run the script several times (I did not implemented a cleanup of folders at the beginning of Notebook but should have to do it !)
can go much deeper with the K Means estimator and tune hyperparameters (for next session)