Using AI to find Fencing Courses in London

Our Club (Morley Blades) get a lot of queries from people who want to learn to fence, but we unfortunately have limited space and can only run so many courses. So I wanted to put together a list of places they could look for alternatives. Having been looking for a case to explore scraping AI and wanting to setup an automated pipeline for generating up to date information I opened a Jupyter notebook and went with the following flow.

💡
I do acknolwedge that for the $20 in AI costs and 6 hours of time this could have been done for cheaper, but this is mostly an R&D piece.

Fetching the initial data

Unless you're mapping the internet everyday, you're going to need a starting set of data to work with. Ours came from a list available on the London Regional Fencing site.

We have a list of sites, so the first step is to check if they're online. For-each one that responds cheerfully to a quick GET request, create a directory with the name of the site.

Then, for-each directory we now have, we GET the html, parse out all the urls, then go check out those and rinse and repeat until we have a neat list. Then just put all the urls into a pages.json file inside each directory.

Problems
  • We had to note the content-type response header, given we're only interested in text/html and not images or video's ect.
  • All of the sites we're targetting didn't require JS but if they did we'd have to use a browser to visit them.

Scrape the sites

I once listened to an accessibility talk at TfL on how to make websites. In it they said the most obvious and important lesson of WebDev, please stop complicating websites. Having explored just 26 webistes for this project, I whole-heartedly agree more than ever.

Initially I asked GPT-4 to get some timetabling information from the HTML alone, which it completely failed at. It couldn't wrap it's head around the absurd position:absolute; left:-20 hacks that coder's weave together.

So, instead we have to put in an expensive step. Using playright to drive a chromium instance we visit each of the pages we listed,  then render the whole page to a PNG. The result, a directory now full of screenshots of web-pages.

Problems
  • I had to do a lot of hackery, disabling every cookie-banner, every newsletter box and every other interruption WebDev's are forced to implement.
  • I disabled images, because 9/10 times they were of people, or were background that confused the AI.
  • Multiple pages listed in the crawl result in the same page, this can be due to query params or multiple routes, but I had to perform a deduplication process over the images after.
  • Scrolling is complicated math when there are headers to worry about, so I tried to remove all footers, headers. Again, semantic HTML would have been so helpful here if any of them followed it.

Describing the scraped site.

Next step was to fire each image off to GPT-4 with a simple enough prompt. This was by far the most expensive process, given GPT-Vision is still mostly in beta. The resuts however were incredible, offering appropriatly detailed information with seemingly no noticable size-limit.

You are an assistant designed to look at screenshots of a webpage and summarize as much of the content as you can. Skip any wording about user privacy, cookies or staying updated via communication methods like email. Do not describe the layout of the page sections or navigation elements, just extract facts. Here is a screenshot of a webpage.

Followed by a base64 of the image. Then dump that into a .png.txt file.

Problems
  • I did have to add quite a lot to the prompt to make not directly describe the site with all it's layout and footer's ect.

Generate embeddings for the text files

So now we have essentially described the website on a page-by-page basis. But which parts of it are relevant? Most of it is about things like championships, privacy policies and club news, and with OpenAI charging by the token we have to be cautious about what we use. Plus I found that GPT can get distracted by lots of text quite easily.

Embeddings can be used to measure the similarity between phrases. So the embedding of a page that describes courses, lessons, or instructors will be measurably closer to the embedding of "beginner courses" than one that talks about competitions, news or privacy policies.

So we grab ChromaDB (an embedding database) and then run all our image descriptions through it.

Problems
  • I ended up using the OpenAI API directly to do the embedding manually given this both let me set the model, and add some basic caching.

Generate the JSON data

Finally getting the fun part, I use a trick I discovered whereby you can describe the output you want in Python classes using pydantic as follows:

from enum import Enum

from typing import Annotated, List, Optional

from pydantic import BaseModel, Field
from pydantic.config import ConfigDict

class CourseSession(BaseModel):
    """
    Represents an upcoming planned session of a fencing course
    """

    year_start: Optional[str] = Field(
        description="The year the course starts",
    )

    month_start: Optional[str] = Field(
        required=False,
        description="The month the course starts",
    )

    day_of_week : Optional[str] = Field(
        description="The day of the week the course runs on"
    )

    time_start: Optional[str] = Field(
        description="The start time of the course"
    )

    time_end: Optional[str] = Field(
        description="The end time of the course"
    )

Which can be converted into a JSON Schema with a simple line json.dumps(CourseSession.model_json_schema())), which gives us our desired JSON Schema.

Next we want to get the context, that is the data that the AI should consider when generating the data. So we query chromadb with a suitable query like "beginner course sessions", and get back the most suitable 5 matches out of our webpage screenshots descriptions.

All this comes together to build the promps

system_string = f"You are tasked with extracting information from a website regarding \"{search_topic}\". You will be provided with relevant information to extract. Ensure that your output is in JSON format, adhering to the following schema:\n\n```\n{schema}\n```\n\n. You MuST not output a JSON schema, you must output JSON data. The JSON output should comprehensively outline the details about: {search_topic}."

A run of this costs roughly in the range of $0.02, and spits out mostly valid JSON data.

Problems
  • A serious fault here is that occasionally it'll output the Schema which I tried to prompt around but never truely solved.
  • It's quite "lazy" when matching the schema, so it'd merge course sessions over multiple days into one, but with a day_of_week as "Monday, Tuesday". So these fields have to be set to enum, which it'll mostly follow.

Results

After all this, I did have a JSON file with the AI's best attempt at filling in the file. I put together a quick Website that you can view here to navigate the data.

In putting the site together, I did however notice quite a few bugs in the data. The AI is.. frustratingly human in these moments? Each time I found issues, 9/10 times it'd be the original website that was confusing, and GPT just struggled the same way I likely would have.

Take the following example of one club's pricing model. It's lucky I suppose that GPT isn't colour-blind like myself.

I also discovered that my Schema was too loose. Careful use of descriptions and restrictions are required to force it into shape. Although it should be noted that all of that increases the token-cost.