For those that mountain bike, hike trails, climb rocks, ride (pilot? steer?...I’m definitely not an equestrian) horse trails… you’ve been in my shoes.
“I’m new to the area…where do I start?”
“If only I knew a local who could pick a trail for me”
“Is this trail over my head/ability/skillset?”
So what do we do?
Well, fortunately, these days there are comprehensive apps available to subscribers that can aid us in these choices. We can select trails based on specified attributes and usually be recommended several others that the app ‘thinks’ we’d like as well. But how do these apps accomplish this? And is there a basic tool I could develop that would provide similar results (and some bonus ‘neato’ points from friends)?
Enter, Recommender Engines!
There are a ton of ways to develop a recommender engine (and many types), but for this post, and what is most relevant to my goals, I’m going to discuss a Content-Based Recommender.
So, let’s start with the data!
For this project, I pulled trail statistics from MTBProject.com on 2000 mountain bike trails throughout Arizona and Utah. My scrapper is currently running to pull relevant data for the remainder of the country. But, for this project, we’ll focus on two states.
For each trail pulled, 20 features were utilized from elevation, difficulty, location data, trail management groups, total climb/descend, dog policy, and others.
My plan was to utilize cosine similarity to determine which trails share the most in common. If you’re interested in the theory behind this, Wikipedia does a good job walking through it.
There are a number of steps that need to be taken before calculating similarity. Primarily… data cleaning. Data scraped from the web will rarely be ready to import into a model. For this case, the data needed a lot of work. There were special characters, extra spaces, inconsistent suffixes, different scales, and a lot of duplicate data. A function was developed and made quick work of these modifications.
Nulls (missing data) were another issue that needed to be dealt with. Unfortunately, it’s not usually appropriate to fill all missing data with the same info. For categorical columns ‘unknown’ was entered for the nulls. ‘Unknown’ was a field within these categories already, and aiming to be conservative, I used that for imputation. Other missing data required a more robust imputation method. For trail-specific stats (total climb/descent, elevations, trail grades, etc.) I utilized KNN imputation. This information is assumed ‘Missing at Random’. Since this missing data was not systematically different from other points, and it is more likely that user-generated data just hasn’t been collected yet. KNN Imputation should do a great job replacing missing data. This method would utilize the closest trails to those with missing data and use the average of those neighbors (I utilized five neighbors in the calculations) to replace nulls. Once a few irrelevant columns were removed, I was done with cleaning and ready to move onto preprocessing and modeling!
Now that our data were cleaned, sorted, and scaled, preprocessing could commence.
Three steps away now. First, the cleaned data would be converted into a 2-D sparse matrix. Once all 40,000 trails are pulled from the web and categorical columns one-hot-encoded… this would leave a lot of “0’s” in my data frame. A sparse matrix would do an excellent job compressing all of these zeros and speeding up the processing of all the data! Once my matrix was developed, I converted each value to pair-wise distances (this is where the cosine similarity comes into play!). There are several different distance ‘metrics’ that can be utilized in this function, but I was looking for a value bound between 0 and 1, and cosine fit the bill.
The resulting matrix was then built into a data frame with each trail acting as an index label and a column label. With this representation, to find similarities between trails all we needed to do was reference the column that held its title (trail name, in this case) and look down the rows! Each trail name-labeled row value contains a value between 0 and 1 that represents the similarity to that of the column-named trail.
In the above example, the input returns a sorted list of values where ‘0’ represents a trail most similar to that input into the code, and ‘1’ represents a trail most dissimilar.
Now that the recommender is working, a user interface can be developed along with a trail dashboard and deployed onto the web for others to utilize! I’ll get more into that in my next post!