Hosted by TiE Seattle, Madrona Venture Labs and KRNL Labs


Fundable Ideas

Real data

Expert Mentors

VC Access

Welcome to the RESOURCES area for the GoVertical ML Startup Workshop! In order to make the most of the time the weekend of the event, you'll find key resources and information here to prepare. These include:

  • Machine learning and data science educational material. We encourage all participants to learn about machine learning and data science during the workshop. Several resources are listed below that target Non-Technical, Developer and Data Scientist participant types

  • Data sets on which to base your ML company idea. Machine-learning algorithms are powered by data! Part of the criteria you'll be judged on is how well you articulate a plan to leverage real data in your solution

  • Be Prepared! Start thinking through what types of data could power your business and product ideas. Often times a combination of multiple, disparate data sets can yield the most ingenious ideas and solutions!

Machine Learning and Data Science Educational Material

The list below contains a wide variety of machine learning educational resources aimed at data scientists, software engineers, and even non-technical business and product people. Take some time to peruse the books, articles and videos that appear most appropriate for your background. The more prepared you are coming into the event, the more you'll get out of it!

"...If data-ism is today's philosophy, this book is its bible..."
Participant type: All

Best selling book, authored by one of our event's mentors, Joel Grus, shows how many of the most fundamental data science tools and algorithms work by implementing them from scratch.
Participant type: Developer

"...this microsite is intended to help newcomers (both non-technical and technical) begin exploring what's possible with AI"
Participant type: Non-Technical

Blog articles with resources for non-technical folks about ML/AI.
Participant type: Non-Technical

"...This (KDNuggets) post, the first in a series of ML tutorials, aims to make machine learning accessible to anyone willing to learn."
Participant type: Non-Technical, Developer

KDNuggets article by best selling author Sebastian Raschka. Pointers to videos and other resources for an introductory high-level overview of ML and data science.
Participant type: Non-Technical, Developer

Great curated list of FREE, online and foundational machine learning books
Participant Type: Developer, Data Scientist

List of the most popular recent Machine Learning videos worth watching, curated by KDNuggets
Participant Type: All

"bare bones take on learning machine learning with Python, a complete course for the quick study hacker with no time (or patience) to spare"
Participant type: Developer, Data Scientist

"Master machine learning by using it on real life applications, even if you’re starting from scratch." Great site for software engineers who want to understand machine learning through code.
Participant type: Developer

A cool and interactive visualization that explains some fundamentals behind machine learning.
Participant type: Non-Technical and Developer

Highly visual YouTube video explaining fundamentals of Deep Learning.
Participant type: All

"This book has been written in layman’s terms as a gentle introduction to data science and its algorithms. Each algorithm has its own dedicated chapter that explains how it works, and shows an example of a real-world application."
Participant type: Non-Technical, Developer

KDNuggets article describing a deep learning approach to extract knowledge from a large amount of data from the recruitment space (see data sets, below)
Participant type: Data Scientist

"...Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data..."
Participant type: Non-Technical, Developer

Data Sets

Your novel business idea should be grounded in real-world data with plausible machine-learning/analytics on top. We've compiled a list of over 40 data sets from which to gain inspiration. Note that you are not restricted to basing your idea on the data sets below. You may discover other open source data sets that inspire your creativity or you may bring your own proprietary data sets if you wish.

Many of the data sets below are from Kaggle, Crowdflower, Data.World, etc. The advantage of these data sets is that many have been cleaned and normalized and are ready to be explored with ML and data science tools. Note that the use of these datasets is often intended for research purposes only. Be sure to read any associated license agreements to understand if there are commercial restrictions if you plan to continuing using the data after the workshop is over.

We included some sample ideas for each category to give you some ideas on how you might utilize these data sets. is a web site dedicated to providing advanced NFL statistics in a simple to use interface
Idea: How can this play-by-play data be used to tell fantasy football players which of their players they should start on a week-by-week basis?

Consolidated draft data from for all drafts from 1985 to 2015.
Idea: Build a service assigning expected value to future NFL players based on college statistics, height, weight, speed, draft position, etc.

Contributors were presented a football scenario and asked to note what the best coaching decision would be. (originating page found here)
Idea: How can this data be used to build enticing products that give products Fantasy sports league players an edge over their competitors?

A complete history of major league baseball stats from 1871 to 2015

25k+ matches, players & teams attributes for European Professional Football
Idea: What soccer player/team attributes actually lead to wins?

NBA Salaries - 1990 to 2016

A variety of crime statistics from major US cities
Idea: Build a model for predicting crime probabilities at given intersections over the course of a night/day, thus allowing police to better position assets and prevent crime.

Homicide Reports, 1980-2014

Civilians shot and killed by on-duty police officers in United States

Archive of U.S. gun violence incidents collected from over 2,000 sources
Idea: Study how crime patterns are correlated with socio-economic trends longitudinally, perhaps utilizing census data as well. Can you predict where there are opportunities for urban renewal and gentrification?

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
Idea: Design a smart grocery list application that automatically suggests other relevant grocery shopping items depending what items that user has entered in their list and what time they are planning to go shopping.

The FDIC is often appointed as receiver for failed banks. This list includes banks which have failed since October 1, 2000
Idea: Build a model that identifies key markers and points to banks that are likely to fail over the next 3 months, thus allowing regulators to better target their resources.

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014
Idea: Can you design a model that learns what a high-quality, specific, helpful review looks like, and give you real-time suggestions as you write reviews?
Idea 2: Can you mine the review data to pull out the three things people most like and the three things people most dislike about various products?

A large collection of Amazon and Yelp reviews, plus Yahoo Answers data.
Idea: Design a review summarizer that summarizes the positive and negative reviews for a product to allow users to quickly understand overall review sentiment from users.

Labeled tweets about multiple brands and products. (originating page found here)

Using 8 years daily news headlines to predict stock market movement

All Ethereum data from the start to May 2017

Detailed purchase records of vehicles licensed in WA state. Includes VIN, purchase data, price, odometer reading, city. - sample only - full set will be provided at the event
Idea: Build product that predicts the optimal price for a used car. How would you monetize the predictions?

300k medical appointments and its 15 variables of each, including whether the patient shows up or not.
Idea: Design a product for doctors offices that predicts whether a patient will show up or whether there is a particular time slot that they are more likely to show up to. How would you integrate your technology into existing scheduling systems? What would you have to offer to displace a legacy scheduling system?

National Cancer Institute - Cancer Statistics Query Tool

WONDER online databases utilize a rich ad-hoc query system for the analysis of public health data.

Survey on Mental Health in the Tech Workplace in 2014

Predicting doctor attributes from prescription behavior

How inpatient hospital charges can differ among different providers in the US
Idea: A service that allows you to input your medical treatment received during a hospital stay and determine whether the bill you get is within an acceptable range.

United States Mortality Rates by County 1980-2014

What does your exercise pattern fall into?
Idea: Design a product that analyzes not just how much people do a particular activity, but how well they do it. Are they consistent? Are they in danger of of hurting themselves?
Idea 2: Is there a way to connect exercise patterns with improved health outcomes or longevity. Can you design the most efficient workout for someone given a desired goal -- weight loss, lower blood pressure, lower cholestorol, etc?

This data set contains over fifteen thousand sentiment-scored images. Originating page can be found here
Idea: Can you train a model on this image sentiment dataset and combine it with NLP to create a service that suggests a photo from my camera roll to post with text I have written, or suggest text to go with a photo I have chosen?

Surfacing the Hidden Beauty of Flickr Pictures
Idea: Can you design an app that predicts how many likes a picture will get on social media, thus helping users to select which photo they should upload to gain maximum exposure?

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Lyrics for 55000+ songs in English from LyricsFreak

Pitchfork music reviews from Jan 5, 1999 to Jan 8, 2017
Idea: Can you combine a number of these data sources and build a model that predicts the likelihood of success of an unreleased track or album? Assuming your model is accurate, what do the features with the most predictive power tell you about how record labels should deploy marketing resources?

Over 20 Million Movie Ratings and Tagging Activities Since 1995

Find sounds with text-queries based on their acoustic content

83% of all the Jeopardy questions scraped from Reddit over 250,000
Idea: How can this type of data be used to build a knowledge graph than an AI can leverage for context in other domains?
Idea 2: Can you design an interatctive AI assistant that uses this dataset to helps people efficiently and effectively train for jeopardy/general trivia?

Which city has the highest median price or price per square foot?

Housing market data for metropolitan areas, cities, neighborhoods and zip codes across the nation
Idea: Can you build a model that analyzes past trends to determine which local real estate markets are about to heat up and which are likely to cool down?

Download a single file with all Zillow metrics

National and regional data on the number of new single-family houses sold and for sale

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015

How is Airbnb really being used in and affecting the neighborhoods of your city?
Idea: How can all of this data be used to build a product that helps real-estate developers choose locations and building types that optimize profits/risk.
Idea 2: How can this data be used to create an addon/extension for AirBnB hosts that helps them make sure they are maximizing profit?

The world’s richest open dataset on politicians

GovTrack is here to help you track legislation being debated in the United States Congress
Idea: Can legislative patterns be used to illuminate where future economic upturns or downturns are likely to occur?

Collection of data from Donald Trump's 2016 presidential campaign including speeches and tweets

Thousands of social media messages from US Senators and other American politicians. Originating page can be found here

56 Major Speeches by Donald Trump (June 2015 - November 2016)

Land and ocean temperature anomalies
Idea: Can this type of data be used to predict and adjust agricultural/planting strategies? How might this data be combined with arial imagery to show the effects of global warming on local regions and the resultant economic effects.

Why are our best and most experienced employees leaving prematurely?
Idea: Design a product that uses HR data to predict who is likely to leave. What can the data say about why employees "churn"? How can these patterns be interrupted?

Treasure trove of employment statistics

Percent change of employment from the same quarter a year ago
Idea: Discover the real patterns of why the manufacturing sector is losing jobs (not what the politicians say). Can those patterns be used target groups of workers who might be more successful in retraining programs?

Thousands of job postings. - sample only - full set will be provided at the event
Idea: Discover related keywords and phrases and other hidden linguistic information to allow recruiters to do a better job identifying matching candidates to employer preferences and hiring patterns. 
Idea 2: Can you use the job posting data and data from LinkedIn to pattern match candidates who are likely to be interested in a job where there would be mutual interest from an employer?