Recreating “The Clock” with Machine Learning and Web Scraping

“The Clock” is a 2010 art installation by Christian Marclay. It features over 12,000 individual scenes from movies and tv shows, each featuring a clock, and runs for 24 hours. The footage of clocks in films is cut together sequentially, following the current time of day by while the film is shown.

The film itself becomes a clock, allowing the viewer to keep track of the time by watching the film.

When I first saw The Clock at the Tate Modern in London, I was blown away. How was it made? Did somebody really sit through thousands of movies, waiting for the times when clocks were shown? Or did someone use recent advances in computer vision to sift through thousands of movies instead?

In today’s post, we’ll try to recreate “The Clock”, using found videos on the web and machine learning. It’s the first part in a three part post, about how to build and deploy machine learning APIs on Kubernetes.

Along the way, I’ll try to describe the thought process behind how I approach a problem like this, and how to deal with problems that come up during the development process.

Along the way, we’ll learn a bit about Kubernetes, scraping YouTube, and what real world machine learning APIs might look like.

How “The Clock” Was Originally Built

The Movies of The Clock

When I got back home, I found out Christian Marclay originally used over a thousand films and a team of six assistants to build his supercut of the movie. They spent 3 years building it, putting together the pieces in order to make up a movie.

The final result is hypnotizing. There’s something about the pop culture references, seeing and spotting movies you know, along with the time within them. It points to the circles of time within our own lives, spent watching other worlds in movies.

But just a few years after being painstakingly built by the team, deep learning techniques like YOLO now allow us to automatically detect when clocks are in images and videos.

By finding the timepoints featuring clocks, we can then write a basic analog clock reading algorithm, and start to build up our own catalog of times within public videos.

Of course, I don’t have access to the video catalog Christian Marclay did, or the hundreds of thousands of dollars in budget for time spent working.

Instead, I’ve got access to a home computer, with a decent GPU for deep learning inference, and the ability to scrape YouTube. So let’s look at how this might be accomplished with these tools instead.

Step One: Gather Input Videos

In the early 2000’s, the internet was a much more open place. Two students were able to scrape the entire web by borrowing their university’s servers for a while, and extract a graph of all the links between sites.

Student Search Engine

2019 is an entirely different story. Large platforms go out of their way to hide content from people who would want to analyze their data. Places like Instagram, Facebook, and YouTube are all purposefully difficult to scrape, and would prefer it if you didn’t take up their bandwidth scraping their content with tons of spiders.

So we need to be a bit more careful about how we grab our input videos from the web. For YouTube, the easiest thing I found was to have an instance of Firefox in a Docker container, scraping the site itself, and scrolling down.

This is because YouTube has been built to only show more results if you scroll the page nowadays.

The code to do this is relatively straightforward. In our case, we’ll use Splinter to open and control our web browser. From there, we’ll then scroll manually, using Javascript. We’ll gather all the YouTube links with BeatifulSoup and then use pytube to grab our actual video files as mp4s and webms.

The code to do that looks like this:

from bs4 import BeautifulSoup as bs
from pytube import YouTube
from splinter import Browser

import time

browser = Browser('firefox', headless=False)  # set headless to False to see browser
browser.visit('https://www.youtube.com/results?search_query=clocks')  # change query to your search

num_pages = 5
# sleeps to wait for page loads
time.sleep(2)
for i in range(num_pages):
    time.sleep(5.0)

    browser.execute_script("window.scrollBy(0, window.innerHeight * 2);")

page = browser.html.encode()
soup = bs(page, 'html.parser')
vids = soup.findAll('a', attrs={'class':'yt-simple-endpoint style-scope ytd-video-renderer'})
browser.quit()

columns = ['id', 'url', 'title']
videolist =[]
for v in vids:
    try:
        tmp = {'id': v['href'].split('?v=')[1],
               'url': 'https://www.youtube.com' + v['href'],
               'title': v['title']}
        videolist.append(tmp)
    except:
        continue

for item in videolist:
    tube = YouTube(item['url'])
    filepath = tube.streams.first().download()
    print(filepath)

But with this script, we don’t yet have a way to take in arbitrary search terms, or to scroll an arbitrary amount of pages.

So to do that, I converted this script into a command line process. With the basic command line in process, I then started thinking about how to keep all my code working together.

And so I created a basic web app interface in Flask, allowing me to call subprocesses, and to give a basic user interface while I build out the rest of my tools.

Video Search Interface

Once we’ve got an interface, and a way to grab our videos, it’s time to start thinking about how we’ll add and call our video inference process.

Running Inference on Our Dataset with PyTorch and YOLOv3

Ideally, we’d have a pretrained model that can detect clocks in images, and tell us their location, along with the time they’re showing.

Since there are (at least) two different types of clocks, the model would be able to tell the two apart, and come back with a time.

But as of right now, I’m not aware of a pretrained model to do all three that’s publicly available. So we’ll need to see how close we can get to that ideal.

There is the COCOS dataset, along with a few models that have been trained to detect things within that dataset. Luckily, clocks are within that dataset, allowing us to get both location and detection of clocks in one pass.

As before, I’ve decided to use the YOLO model, and use it to first detect our clocks. We can run it on a video by extracting images frame by frame as the video passes by.

But because we’ll be extracting and running inference on video, it makes the most sense to have hardware accelerated inference wherever possible. In my case, I prefer to use my GPU, as it can give a massive speedup compared to CPU inference and video decoding / encoding.

Building an NVIDIA Docker Image for Hardware Acceleration

NVIDIA GPU Cloud

Luckily, NVIDIA has started releasing prebuilt Docker images for deep learning.

We can use a base image with PyTorch already installed, with hardware acceleration built in. The only remaining thing to do is install OpenCV (for later attempts at detecting clock times), along with hardware acceleration for ffmpeg. (To decode and re-encode our videos we’ve scraped.)

Writing a Dockerfile is relatively straightforward. It’s mostly installing the packages and downloading our files from source:

FROM nvcr.io/nvidia/pytorch:19.02-py3
RUN apt-get update && apt-get -y install autoconf automake build-essential libass-dev libtool  pkg-config texinfo zlib1g-dev cmake mercurial libjpeg-dev libpng-dev libtiff-dev libavcodec-dev libavformat-dev libswscale-dev libv4l-dev  libxvidcore-dev libx264-dev libx265-dev libnuma-dev libatlas-base-dev libopus-dev libvpx-dev gfortran unzip 
RUN git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
RUN cd nv-codec-headers && make && make install
RUN git clone https://git.ffmpeg.org/ffmpeg.git
RUN cd ffmpeg &&  ./configure  --enable-shared --disable-static --enable-cuda --enable-cuvid --enable-libnpp --enable-libvpx --enable-libopus --enable-libx264  --enable-gpl --enable-pic --enable-libass --enable-nvenc --enable-nonfree --enable-libx265 --extra-cflags="-I/usr/local/cuda/include/ -fPIC" --extra-ldflags="-L/usr/local/cuda/lib64/ -Wl,-Bsymbolic"  && make -j4  && make install
RUN wget -O opencv.zip https://github.com/opencv/opencv/archive/4.0.0.zip && unzip opencv.zip && mv opencv-4.0.0 opencv
RUN wget -O opencv_contrib.zip https://github.com/opencv/opencv_contrib/archive/4.0.0.zip && unzip opencv_contrib.zip && mv opencv_contrib-4.0.0 opencv_contrib
RUN cd opencv && mkdir build && cd build && cmake -D CMAKE_BUILD_TYPE=RELEASE \
	-D CMAKE_INSTALL_PREFIX=/usr/local \
	-D INSTALL_PYTHON_EXAMPLES=ON \
	-D INSTALL_C_EXAMPLES=OFF \
	-D OPENCV_ENABLE_NONFREE=ON \
	-D OPENCV_EXTRA_MODULES_PATH=/workspace/opencv_contrib/modules \
  -D OPENCV_SKIP_PYTHON_LOADER=ON \
  -D PYTHON_LIBRARY=/opt/conda/lib/python3.6 \
  -D PYTHON_EXECUTABLE=/opt/conda/bin/python3 \
  -D PYTHON2_EXECUTABLE=/usr/bin/python2 \
  -D WITH_CUDA=ON \
  -D ENABLE_FAST_MATH=1 \
  -D CUDA_FAST_MATH=1 \
  -D WITH_CUBLAS=1 \
  -D WITH_FFMPEG=ON \
	-D BUILD_EXAMPLES=ON .. && make -j8 && make install
RUN cd /workspace/opencv/build/modules/python3 && make && make install
RUN ln -s /usr/local/python/python-3.6/cv2.cpython-36m-x86_64-linux-gnu.so /opt/conda/lib/python3.6/site-packages/cv2.so
RUN pip install pandas jupyter ipywidgets
RUN jupyter nbextension enable --py widgetsnbextension
RUN git clone https://github.com/ayooshkathuria/pytorch-yolo-v3 && cd pytorch-yolo-v3 && wget https://pjreddie.com/media/files/yolov3.weights

Reading the above Dockerfile, we’re first installing our dependencies, then adding the NVIDIA headers for ffmpeg hardware acceleration.

We then build OpenCV with that hardware acceleration linked to it. Finally, we install Python and Jupyter, and then pytorch-yolo-v3, along with the YOLOv3 model.

With this, we’re finally ready to do our accelerated encoding, inference, and decoding from within a repeatable container! Now let’s turn that into an API.

Building A Deep Learning API in Flask

Recreating "The Clock" with Machine Learning

With our containers in order, we can finally start thinking about how our deep learning API will look.

If we’re choosing to orchestrate these containers with kubernetes, we have a few options here, most notably Kubeflow or something like TensorRT Inference Service. We’d even get benefits like monitoring and work graphs built in.

But neither of these is particularly well suited to longer running inferences like what we have with our videos. They assume a gRPC or HTTP call that can return in a reasonable time. We’d have to send image by image requests across for inference. And for now, setting them up is even more complexity overhead.

So instead, for now I’ve fallen back to plain old Flask and shell processes. This isn’t a scalable approach, but it will do to get our first prototype system built. Later, once we’ve figured out the proper flow, we can go back and hook up TensorRT or Kubeflow.

For simplicity’s sake, we’ll share the volume of the downloaded videos with this container, and pass back and forth web requests with filenames to follow the graph of computation:

# In our Flask app:

@app.route('/video-inference', methods=['POST'])
def video_inference():
    params = request.get_json()
    os.chdir('/workspace/pytorch-yolo-v3/')
    span = tracer.current_span()
    app.logger.info('Span ID and trace Id: %s %s' % (span.context.span_id, span.context.trace_id))

    subprocess.Popen(['python3',
                     '/workspace/pytorch-yolo-v3/video-to-json.py',
                     '--video',
                     params['filename'],
                     '--post-url',
                     params['postback_url'],
                     '--trace-id',
                     str(span.context.trace_id),
                     '--parent-id',
                     str(span.context.span_id),
                     '--sampling-priority',
                     str(USER_KEEP)])

    return jsonify({'message': 'received'})

Our Python shell process takes in the video filename, and then posts the JSON inference for each frame back to our original Scraper API. From there, we can save the inference, and check for snippets featuring clocks.

Extracting and Joining Clock Snippets

Once we’ve got a JSON list of all the video frames, and whether or not they contain clocks, the only thing remaining is to slice the input videos into snippets, allowing us to rejoin the video later.

Ffmpeg makes this easy enough, and with yet another subprocess.POpen, we can split them up into shorter videos, using hardware decoding and encoding flags:

# different decoder flags for webm vs mp4
if inference['filename'][-4:] == 'webm':
    command = f"ffmpeg -hwaccel cuvid -c:v vp8_cuvid -i '{inference['filename']}' -ss {inference['clock_segments'][i]['start']} -t {length} -f mp4 -filter:v scale_npp=w=1280:h=720:format=yuv420p:interp_algo=lanczos -c:v h264_nvenc -preset slow -acodec aac -r 30 /downloads/slices/{filename}.mp4 -hide_banner"
else:
    command = f"ffmpeg -hwaccel cuvid -c:v h264_cuvid -i '{inference['filename']}' -ss {inference['clock_segments'][i]['start']} -t {length} -f mp4 -filter:v scale_npp=w=1280:h=720:format=yuv420p:interp_algo=lanczos -c:v h264_nvenc -preset slow -acodec aac -r 30 /downloads/slices/{filename}.mp4 -hide_banner"

subprocess.call(shlex.split(command))

The rest of the code lives within the splitter.py file, including calls to the Scraper API to grab inferences and post filenames. Of finished snippets for tracking.

With a bunch of snippets in the same filetype, we can finally use ffmpeg’s concat demuxer to join our videos into one file, passing in a list of files to join together.

The code to generate that list of files looks like this:

import subprocess
import requests
import random
import os

scraperURL = 'http://' + os.environ['SCRAPERAPP_SERVICE_HOST'] + ':' + os.environ['SCRAPERAPP_SERVICE_PORT']

req = requests.get(f'{scraperURL}/snippets')
snippets = req.json()['snippets']

print(snippets)

def getUniqueVideo(vidlist, snippets):
    while True:
        vidCandidate = random.choice(snippets)
        if vidCandidate['video'] not in vidlist:
            return vidCandidate

with open('/downloads/slices/filelist.txt', 'w') as f:
    videoList = []

    # need at least 50 videos or we loop forever :)
    for i in range(50):
        video = getUniqueVideo(videoList, snippets)
        videoList.append(video['video'])
        f.write(f"file {video['filename']}\n")
f.close()

This outputs a filelist.txt file, and we can create our video with a command like this:

$ ffmpeg -f concat -i videolist.txt -c copy out.mp4

With this we can finally see the results of our work, before adding the ability to see the time featured in videos:

Going Further

If you look at the Scraper and Inference API repos, you’ll see that they both mention Kubernetes, and indeed, both have accompanying YAML files to run each service.

In the next post, we’ll see how to deploy and monitor our machine learning APIs in Kubernetes. If you’d like to be notified when that goes live, I suggest subscribing below.

If you’re still learning Python and Pygame, or you want a visual introduction to programming, check out my book, Make Art with Python. The first three chapters are free and online here.

Finally, feel free to share this post with your friends. It helps me to continue making these tutorials .

Updated: