Generalized neural net
Unit tests in our shared game implementation
b. Do computation & modeling problem 86 -- You did this problem before using your custom KNN, but do it once more using sklearn's knn implementation. You can re-use code. I just want to make sure you know how to use sklearn's KNN
Extra Credit
You can get 100 points worth of extra credit on this assignment (i.e. a full assignment's worth of extra credit) if you create your own heuristic scoring technique and demonstrate that it performs better than the basic scoring technique shown in this assignment.
To demonstrate that it performs better, you'll need to run the same experiments for it and plot it on the same graph as the
Watch Prof. Wierman's videos:
Paul Rothemond https://www.youtube.com/watch?v=WhGG__boRxU
Answer the following questions:
Fill in the blank: According to Paul Rothemund, synhetic biologists are $\_\_\_\_\_$-oriented while molecular programmers are $\_\_\_\_\_$-oriented.
In what kind of organism can you find a single-stranded loop of DNA?
What is the point of making smiley faces and other pictures out of DNA? Why are scientists researching this?
In computer science lingo, what can DNA tiles be used to represent?
What kind of unnatural DNA structure did Rothemund design in 2006? How big is the structure relative to the size of a bacterium?
In Qian's DNA neural network, how does the network determine whether a given array represents an "L" or a "T"?
In Qian's DNA neural network example, if the network were given an "I", what would the weighted sum be using the "L" weight, what would the weighted sum be using the "T" weight, and what would the network classify the "I" as?
If you have any missing assignments... in particular, titanic analysis assignments... start catching up on those.
Lastly, the last meeting with Prof. Wierman may need to be rescheduled, possibly Friday 6/4.
The final will is supposed to take place on Wednesday 6/2 from 11am-1pm, but I hear a lot of you guys have AP tests then. If you have an AP test that day, then I'll just leave the time window open on Canvas so you can take it any time between Tuesday 6/1 and Thursday 6/3.
Any topic that appeared on an assignment this semester is fair game.
Here are the notes from class. (I'll update this with more notes as we do more review.)
Here is a list of topics to help you focus your studying.
Create an elbow curve for k-means clustering on the titanic dataset, using min-max normalization.
Remember that the titanic dataset is provided here:
In your clustering, use all the rows of the data set, but only these columns:
["Sex", "Pclass", "Fare", "Age", "SibSp"]
The first few rows of the normalized data set should be as follows:
["Sex", "Pclass", "Fare", "Age", "SibSp"]
[0, 1, 0.01415106, 0.27117366, 0.125]
[1, 0, 0.13913574, 0.4722292, 0.125]
[1, 1, 0.01546857, 0.32143755, 0]
Then, just as before, make a plot of sum squared distance to cluster centers vs $k$ for k=[1,2,3,...,25]
.
Choose k to be at the elbow of the graph (looks like k=4). Then, fit a k-means model with k=4, add the cluster label as a column in your data set, and find the column averages.
Tip: Use groupby: df.groupby(['cluster']).mean()
Here is an example of the format for your output. Your numbers might be different.
Sex Pclass Fare Age SibSp
cluster
0 1.000000 2.183908 38.759867 28.815940 0.000000
1 0.502110 2.092827 45.046011 29.253985 1.118143
2 0.456522 2.847826 52.115039 14.601963 4.369565
3 0.000000 2.419355 20.452848 31.896441 0.000000
To help us interpret the clusters, add a column for Survived
(the mean survival rate in each cluster) and add a column for count
(i.e. the number of data points in each cluster).
Note: We only include Survived
AFTER the clustering. Later, we'll want to incorporate clustering into our predictive model, and we don't know the Survived
values for the passengers we're trying to predict.
Here is an example of the format for your output. Your numbers might be different.
Sex Pclass Fare Age SibSp Survived count
cluster
0 1.000000 2.183908 38.759867 28.815940 0.000000 0.787356 174.0
1 0.502110 2.092827 45.046011 29.253985 1.118143 0.527426 237.0
2 0.456522 2.847826 52.115039 14.601963 4.369565 0.152174 46.0
3 0.000000 2.419355 20.452848 31.896441 0.000000 0.168203 434.0
Then, interpret the clusters. Write down, roughly, what kind of passengers each cluster represents.
Code that generates the plot and prints out the mean data grouped by cluster
Overleaf doc with the grouped data as a table, and your interpretation of what each cluster means
Generate an elbow graph for the same data set as in the previous assignment, except using scikit-learn's k-means implementation. This problem will mainly be an exercise in looking up and using documentation.
It's possible that the sum squared error values may come out a bit different due to scikit-learn using a different method to assign initial clusters. That's okay. Just check that the elbow of the graph still occurs at k=3.
Submission: Code that generates the elbow plot using scikit-learn's implementation.
Note: For this problem, put your code in a separate file (don't just overwrite the file from the previous assignment). This way, when I grade assignments, I can still run the code from the previous assignment.
Since AP tests are starting this week, the assignments will be shorter, starting with this assignment.
When clustering data, we often don't know how many clusters are in the data to begin with.
A common way to determine the number of clusters is using the "elbow method", which involves plotting the total "squared error" and then finding where the graph has an "elbow", i.e. goes from sharply decreasing to gradually decreasing.
Here, the "squared error" associated with any data point is its distance from its cluster center. If a data point $(1.1,1.8,3.5)$ is assigned to a cluster whose center is $(1,2,3),$ then the squared error associated with that data point would be
$$ (1.1-1)^2 + (1.8-2)^2 + (3.5-3)^2 = 0.3. $$The total squared error is just the sum of squared error associated with all the data points.
Watch the following video to learn about the elbow method:
Recall the following dataset of cookie ingredients:
columns = ['Portion Eggs',
'Portion Butter',
'Portion Sugar',
'Portion Flour']
data = [[0.14, 0.14, 0.28, 0.44],
[0.22, 0.1, 0.45, 0.33],
[0.1, 0.19, 0.25, 0.4],
[0.02, 0.08, 0.43, 0.45],
[0.16, 0.08, 0.35, 0.3],
[0.14, 0.17, 0.31, 0.38],
[0.05, 0.14, 0.35, 0.5],
[0.1, 0.21, 0.28, 0.44],
[0.04, 0.08, 0.35, 0.47],
[0.11, 0.13, 0.28, 0.45],
[0.0, 0.07, 0.34, 0.65],
[0.2, 0.05, 0.4, 0.37],
[0.12, 0.15, 0.33, 0.45],
[0.25, 0.1, 0.3, 0.35],
[0.0, 0.1, 0.4, 0.5],
[0.15, 0.2, 0.3, 0.37],
[0.0, 0.13, 0.4, 0.49],
[0.22, 0.07, 0.4, 0.38],
[0.2, 0.18, 0.3, 0.4]]
Use the elbow method to construct a graph of error vs k. For each value of k, you should do the following:
To initialize the clusters, assign the first row in the dataset to the first cluster, the second row to second cluster, and so on, looping back to the first cluster after you assign a row to the $k$th cluster. So the cluster assignments will look like this:
{
1: [0, k-1, ...],
2: [1, k, ...],
3: [2, k+1, ...]
...
k: [k-1, ...]
}
Check the logs if you need some more concrete examples.
For each value of k, you should run the k-means algorithm until it converges, and then compute the squared error.
You should get the following result:
Then, estimate the number of clusters in the data by finding the "elbow" in the graph.
Note: Here is a log to help you debug.
Link to repl.it code that generates the plot
Github commit to machine-learning repository
In your submission, write down your estimated number of clusters in the data set.
Previously, we ran into the issue that the Gobble game tree is too big to work with. So, what we can do instead is repeatedly generate a smaller local tree, and use that instead.
Minimax Algorithm usng Local Trees
Each time it's your player's turn to move, you can build a local tree as follows:
Use the current game state as the root node
Generate more nodes corresponding to $N$ turns of the game
$N=1$ would mean you stop after generating the child nodes.
$N=2$ would mean you stop after generating the grandchild nodes.
and so on...
Assign scores to the leaf nodes of the local tree (I'll explain this more after these bullet points).
Propagate the scores up the tree using the standard minimax approach.
Choose your action in accordance with the standard minimax strategy (i.e. choose the action which takes you to the highest-score child).
Scoring Non-Terminal States
How do we assign scores to the leaf nodes of the local tree? The local tree only tracks the possibilities of the game $N$ turns into the future, and at that point, it's unlikely that either player has won the game.
What we can do is use a heuristic technique to assign scores. The word "heuristic" refers to a technique that is intuitive and practical, though not necessarily optimal.
In our case, a good heuristic technique is to create a score that gets higher when you're in a better position to win (and is highest when you have won).
For tic-tac-toe-like games, we can create a heuristic score like this:
Start with score=0
ADD 100 for each row, column, or diagonal that contains 3 of YOUR OWN pieces.
ADD 10 for each row, column, or diagonal that contains 2 of YOUR OWN pieces, and where the remaining spot has nobody's piece in it.
SUBTRACT 100 for each row, column, or diagonal that contains 3 of your OPPONENT'S pieces.
SUBTRACT 10 for each row, column, or diagonal that contains 2 of YOUR OPPONENT'S pieces, and where the remaining spot has nobody's piece in it.
Your Task
Experiment: create a Gobble player that uses a local game tree approach, and match it up against a random player for 200 games (alternating who goes first).
Repeat the above experiment for $N=1,2,3,$ and so on, stopping at the value of $N$ for which the experiment takes more than 3 minutes to run.
Make a table of win rate & loss rate vs N in an Overleaf doc and submit it along with a replit link and github commit.
Note: This heuristic scoring technique is pretty basic so it might not perform super well. But I think it should at least do a bit better than the random player.
Riley -- once you've cleaned up your code, pull your Gobble game to the shared repo. You can accept your own pull request for this. Please do this today (Wednesday) so that everyone can use it for this assignment.
George -- be ready present your gobble implementation on Friday. If you're stuck, that's okay, just present where you got stuck and what you tried to get around it.
Everyone -- create a branch of our shared repo called your-name-game-tree
. Put your game tree in there, and then create a minimax player and test it using our shared repo. Have it play 200 games against a random player (100 as first player, 100 as second player) and post its win rate on Slack.
Link to your branch with the minimax player
Clustering in General
"Clustering" is the act of finding "groups" of similar records within data.
Watch this video to get a general sense of what clustering is and why we care about it. (Best to play it at 1.5 or 1.75x speed to save time)
K-Means Clustering
Your task will be to implement a basic clustering technique called "k-means clustering". Here is a video describing k-means clustering:
Here is a summary of k-means clustering:
Initialize the clusters
Randomly divide the data into k parts. Each part represents an initial "cluster".
Compute the mean of each part. Each mean represents an initial cluster center.
Update the clusters
Re-assign each record to the cluster with the nearest center (using Euclidean distance).
Compute the new cluster centers by taking the mean of the records in each cluster.
Keep repeating step 2 until the clusters don't change after the update.
Your Task
Write a KMeans
clustering class and use it to classify the following data.
# these column labels aren't necessary to use
# in the problem, but they make the problem more
# concrete when you're thinking about what the data
# means.
columns = ['Portion Eggs',
'Portion Butter',
'Portion Sugar',
'Portion Flour']
data = [[0.14, 0.14, 0.28, 0.44],
[0.22, 0.1, 0.45, 0.33],
[0.1, 0.19, 0.25, 0.4],
[0.02, 0.08, 0.43, 0.45],
[0.16, 0.08, 0.35, 0.3],
[0.14, 0.17, 0.31, 0.38],
[0.05, 0.14, 0.35, 0.5],
[0.1, 0.21, 0.28, 0.44],
[0.04, 0.08, 0.35, 0.47],
[0.11, 0.13, 0.28, 0.45],
[0.0, 0.07, 0.34, 0.65],
[0.2, 0.05, 0.4, 0.37],
[0.12, 0.15, 0.33, 0.45],
[0.25, 0.1, 0.3, 0.35],
[0.0, 0.1, 0.4, 0.5],
[0.15, 0.2, 0.3, 0.37],
[0.0, 0.13, 0.4, 0.49],
[0.22, 0.07, 0.4, 0.38],
[0.2, 0.18, 0.3, 0.4]]
# we usually don't know the classes, of the
# data we're trying to cluster, but I'm providing
# them here so that you can actually see that the
# k-means algorithm succeeds.
classes = ['Shortbread',
'Fortune',
'Shortbread',
'Sugar',
'Fortune',
'Shortbread',
'Sugar',
'Shortbread',
'Sugar',
'Shortbread',
'Sugar',
'Fortune',
'Shortbread',
'Fortune',
'Sugar',
'Shortbread',
'Sugar',
'Fortune',
'Shortbread']
Make sure your class passes the following test:
# initial_clusters is a dictionary where the key
# represents the cluster number and the value is
# a list of indices (i.e. row numbers in the data set)
# of records that are said to be in that cluster
>>> initial_clusters = {
1: [0,3,6,9,12,15,18],
2: [1,4,7,10,13,16],
3: [2,5,8,11,14,17]
}
>>> kmeans = KMeans(initial_clusters, data)
>>> kmeans.run()
>>> kmeans.clusters
{
1: [0, 2, 5, 7, 9, 12, 15, 18],
2: [3, 6, 8, 10, 14, 16],
3: [1, 4, 11, 13, 17]
}
Here are some step-by-step tests to help you along:
>>> initial_clusters = {
1: [0,3,6,9,12,15,18],
2: [1,4,7,10,13,16],
3: [2,5,8,11,14,17]
}
>>> kmeans = KMeans(initial_clusters, data)
### ITERATION 1
>>> kmeans.update_clusters_once()
>>> kmeans.clusters
{
1: [0, 3, 6, 9, 12, 15, 18],
2: [1, 4, 7, 10, 13, 16],
3: [2, 5, 8, 11, 14, 17]
}
>>> kmeans.centers
{
1: [0.113, 0.146, 0.324, 0.437],
2: [0.122, 0.115, 0.353, 0.427],
3: [0.117, 0.11, 0.352, 0.417]
}
>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
1: ['Shortbread', 'Sugar', 'Sugar', 'Shortbread', 'Shortbread', 'Shortbread', 'Shortbread'],
2: ['Fortune', 'Fortune', 'Shortbread', 'Sugar', 'Fortune', 'Sugar'],
3: ['Shortbread', 'Shortbread', 'Sugar', 'Fortune', 'Sugar', 'Fortune']
}
### ITERATION 2
>>> kmeans.update_clusters_once()
>>> kmeans.clusters
{
1: [0, 2, 5, 6, 7, 9, 10, 12, 15, 18],
2: [14, 16],
3: [1, 3, 4, 8, 11, 13, 17]
}
>>> kmeans.centers
{
1: [0.111, 0.158, 0.302, 0.448],
2: [0.0, 0.115, 0.4, 0.495],
3: [0.159, 0.08, 0.383, 0.379]
}
>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
1: ['Shortbread', 'Shortbread', 'Shortbread', 'Sugar', 'Shortbread', 'Shortbread', 'Sugar', 'Shortbread', 'Shortbread', 'Shortbread'],
2: ['Sugar', 'Sugar'],
3: ['Fortune', 'Sugar', 'Fortune', 'Sugar', 'Fortune', 'Fortune', 'Fortune']
}
### ITERATION 3
>>> kmeans.update_clusters_once()
>>> kmeans.clusters
{
0: [0, 2, 5, 7, 9, 12, 15, 18],
1: [3, 6, 8, 10, 14, 16],
2: [1, 4, 11, 13, 17]
}
>>> kmeans.centers
{
0: [0.133, 0.171, 0.291, 0.416],
1: [0.018, 0.1, 0.378, 0.51],
2: [0.21, 0.08, 0.38, 0.346]
}
>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
0: ['Shortbread', 'Shortbread', 'Shortbread', 'Shortbread', 'Shortbread', 'Shortbread', 'Shortbread', 'Shortbread'],
1: ['Sugar', 'Sugar', 'Sugar', 'Sugar', 'Sugar', 'Sugar'],
2: ['Fortune', 'Fortune', 'Fortune', 'Fortune', 'Fortune']
}
Repl.it link to your k-means tests (and your github commit)
Using our shared tic-tac-toe implementation as a starting point, implement the "Gobble" game that was described during class.
3x3 board, just like tic-tac-toe. Player wins when they have 3 pieces in a row.
Pieces of 3 sizes: 1, 2, 3. You can use a larger-size piece to cover a smaller-size piece.
Each player has $k$ pieces of each size. This is a parameter that we may want to vary.
I already copied the tic-tac-toe implementation into a "gobble" folder, so you just need to create a branch and modify the existing code to implement Gobble.
Next class, be ready to present your Gobble implementation (i.e. what you changed in the existing tic-tac-toe implementation).
Write some code to create game trees and answer the following questions:
a. How many nodes are in a full tic-tac-toe game tree, and how long does it take to construct?
b. How many nodes are in a full Gobble game tree with $k=2,$ and how long does it take to construct?
c. How many nodes are in a full Gobble game tree with $k=3,$ and how long does it take to construct?
d. How many nodes are in a full Gobble game tree with $k=4,$ and how long does it take to construct?
e. How many nodes are in a full Gobble game tree with $k=5,$ and how long does it take to construct?
Link to gobble code
Link to overleaf doc with your answers to the game tree analysis questions
a. Take your code from the previous problem and run it again, this time on the titanic dataset.
Remember that the titanic dataset is provided here:
Filter the above dataset down to the first 100 rows, and only these columns:
["Survived", "Sex", "Pclass", "Fare", "Age","SibSp"]
Then, just as before, make a plot of leave-one-out cross validation vs $k$ for k=[1,3,5,7,...,99]
. Overlay the 4 resulting plots: "unscaled", "simple scaling", "min-max", "z-score". You should get the following result:
b. Compute the relative speed at which your code runs (relative to mine). The way you can do this is to run this code snippet 5 times and take the average time:
import time
start = time.time()
counter = 0
for _ in range(1000000):
counter += 1
end = time.time()
print(end - start)
When I do this, I get an average time of about 0.15 seconds. So to find your relative speed, divide your result by mine.
c. Speed up your code in part (a) so it runs in (your relative speed) * 45 seconds or less. I took a deeper dive into some code that was running slow for students, and it turns out the code just needs to be written more efficiently.
To make the code more efficient, you need to avoid unnessarily repeating expensive operations. Anything involving a dataset transformation is usually expensive.
The very first thing you do should be processing all of your data and splitting it into your X
and y
arrays. DON'T do this every time you fit a model -- just do it once at the beginning.
In general, avoid repeatedly processing the data set. If there's something you're doing to the data set over and over again, just do it once at the beginning.
You can time your code using the following setup:
import time
begin_time = time.time()
(your code here)
end_time = time.time()
print('time taken:', end_time - start_time)
REALLY IMPORTANT:
While you make your code more efficient, you'll need to repeatedly run it to see if your actions are actually decreasing the time it takes to run. Instead of running the full analysis each time, just run a couple values of $k$. That way, you're not waiting a long time for your code to run each time. Once you've decreased this partial run time by a lot, you can run your entire analysis again.
If you get stuck for more than 10 minutes without making progress, ping me on Slack so that I can take a look at your code and let you know if there's anything else that's making it slow.
d. Complete quiz corrections for any problems you missed. (I'll have the quizzes graded by tonight, 5/5.) That will either involve revising your free response answers or revising your code and sending me the revised version.
Link to KNN code that runs in (your relative speed) * 45 seconds or less. When I run your code, it should print out the total time it took to run.
Quiz corrections
Before fitting a k-nearest neighbors model, it's common to "normalize" the data so that all the features lie within the same range. Otherwise, variables with larger ranges are given greater distance contributions (which is usually not what we want).
The following video explains 3 different normalization techniques: simple scaling, min-max scaling, and z-scoring.
Consider the following dataset. The goal is to use the features to predict the book type (children's book vs adult book).
First, read in this dataset and change the "book type" column to be numeric (1 if adult book, 0 if children's book).
a. Create a "leave-one-out accuracy vs k" curve for k=[1,3,5,...,99]
.
b. Repeat (a), but this time normalize the data using simple scaling beforehand.
c. Repeat (a), but this time normalize the data using min-max scaling beforehand.
d. Repeat (a), but this time normalize the data using z-scoring beforehand.
e. Overlay all 4 plots on the same graph. Be sure to include a legend that labels the plots as "unscaled", "simple scaling", "min-max", "z-score".
You should get the following result:
f. Answer the big question: why does normalization improve the accuracy? (Or equivalently, why did the model perform worse on the unnormalized data?)
Overleaf doc with plot and explanation, as well as a link to the code that you wrote to generate the plot.
Note: Previously, this problem had consisted of a KNN model on the full titanic dataset along with normalization techniques. The analysis was taking too long on chromebooks, so I've reduced the size of the dataset. Also, the normalization techniques weren't having an effect on the result, so I took that off this assignment but will revise the normalization task and put it on the next assignment. Any code you wrote for the normalization techniques will be useful in the next assignment.
In this problem, your task is to use scikit-learn's k-nearest neighbors implementation to predict survival in a portion of the titanic survival modeling dataset.
Remember that the fully-processed dataset is here:
Take that fully-processed dataset and filter it down to the first 100 rows, and only these columns:
[
"Survived",
"Sex",
"Pclass",
"Fare",
"Age",
"SibSp"
]
Then, create a plot of leave-one-out accuracy vs $k$ for the following values of $k{:}$
[1,3,5,10,15,20,30,40,50,75]
You should get the following result:
K-fold cross validation is similar to leave-one-out cross validation, except that instead of repeatedly leaving out one record, we split the dataset into $k$ sections or "folds" and repeatedly leave out one of those folds.
This video explains it pretty well, with a really good visual at the end:
Answer the following questions:
If we had a dataset with 800 records and we used 2-fold cross validation, how many models would we fit, how many records would each model be trained on, and how many records would each model be validated (i.e. tested) on?
If we had a dataset with 800 records and we used 8-fold cross validation, how many models would we fit, how many records would each model be trained on, and how many records would each model be validated (i.e. tested) on?
If we had a dataset with 800 records, for what value of $k$ would $k$-fold cross validation be equivalent to leave-one-out cross validation?
Link to your code that generates the plot
Overleaf doc with the plot and the answers to the 3 questions
a. Implement a minimax player for your tic-tac-toe game.
Remember that the minimax strategy works as follows:
Repeatedly propogate those scores up the tree to parent nodes.
If the game state of the parent node implies that it's your turn, then the score of that node is the maximum value of the child scores (since you want to maximize your score).
If the game state of the parent node implies that it's the opponent's turn, then the score of that node is the minimum value of the child scores (since your opponent wants to minimize your score).
Always make the move that takes you to the highest-score child state. (If there are ties, then you can choose randomly.)
b. Check that your minimax strategy usually beats a random strategy. Run as many minimax vs random matchups as you can in 3 minutes, alternating who goes first. What percentage of the time does minimax win? Post your win percentage on Slack.
Heads up that next class, we'll discuss whether to use Riley's tic-tac-toe implementation or Colby's implementation. If you have strong feelings one way or the other, prep for it by creating a pro con list that you can reference during the discussion.
Github commit for minimax code, and post your minimax win percentage on Slack.
Remember, quiz Friday! See the previous assignment for information on what's on it.
I was going to have us create tic-tac-toe playing agents, but then I realized that creating the tic-tac-toe game is enough work for one assignment. So that will be the goal of this assignment.
I invited everyone to a github team called cohort-1
. Accept the invite and you will be granted write access to the following repository:
In that repository, create a folder tic-tac-toe
and create a basic tic-tac-toe game in there. There should be a Game
class that accepts two Strategy
classes, similar to how space-empires
works. (You can make additional classes as you see fit.)
You should also include some basic tests to demonstrate that the game works properly.
Prepare a 3-5 minute presentation about your implementation for Wednesday. Don't exceed 5 minutes. As usual, the things to address are the following:
You can show parts of your code, but DON'T go through it line-by-line. This is supposed to be a quick elevator pitch of your implementation.
(There's no submission for this assignment; your grade will be based on your presentation during class.)
Forward/backward selection, basic manipulations with pandas / numpy / sklearn. Also, the videos that Prof. Wierman assigned:
Thomas Vidick https://www.youtube.com/watch?v=Cwz_tMjzavc
There won't be any game tree stuff (we'll wait until we're further along that path).
Overview: There are 2 parts to this assignment: backward selection and space empires.
In this assignment, you'll do "backward selection", which is very similar to forward selection except that we start with all features and remove features that don't improve the accuracy.
One key difference is that with backward selection, we'll just loop through all the features once and remove any features that don't improve the accuracy. This is different from forward selection (in forward selection, we looped through all the features repeatedly).
A couple notes:
Use 100 iterations and set random_state=0
(it's a parameter in the logistic regressor; check out the documentation for more info)
100 iterations isn't enough for the regressor to converge, but since things run slow on the chromebooks, we'll just do this exercise with 100 iterations regardless. To suppress convergence warnings, set the following:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
Results
Initially, using all the features, testing accuracy should be about 0.788
Then, after backwards selection, testing accuracy should have increased to 0.831
For your ease of debugging, all the features along with information about each iteration of backward selection are shown in the log below.
In class, decided on using George's implementation, with the following tweaks:
allow it to run all tests at once if we want (like how Colby did)
get the phase from the game state (like how David did)
Elijah - your implementation was clever, but George's seemed simpler for everyone to build off of.
This weekend:
George - merge your pull request (do this ASAP, definitely by Saturday, because Colby and David's tasks depend on this). There are probably merge conflicts that you'll have to resolve.
Colby - after George has merged his pull request, create a new branch to include the capability to run all unit tests at once. Then create a pull request and merge the code.
David - after George has merged his pull request, create a new branch to and tweak the code to get the phase from the game state. Then create a pull request and merge the code.
Riley - create a page on the wiki called "Unit Test Descriptions" and write a brief description for each of the existing unit tests. This way, we can look to this page to understand what we do and don't have unit tests for already.
Elijah - everyone was having an issue translating between the native game state and the standard game state, so you'll need to either
write documentation for the native game state and write functions for translating to and from the standard game state, or
update the standard game state (and its documentation) so that we no longer need a native game state
Since you're the one who knows the most about the native game state, this can be your call.
If there were any problems you didn't get right, fix them and show all your work (or all your code).
To introduce the idea of how one can design intelligent agents, we'll implement an intelligent angent that solves tic-tac-toe using the minimax algorithm. But before we actually implement it, we need to understand it at a high level.
Watch the first 8 minutes of the following video that explains the minimax algorithm. (You can probably set it to 1.5x speed)
Then, answer the following questions:
What does the root of the game tree represent?
What does each edge of the game tree represent?
What are the scores of a win, a loss, a tie? (3 answers)
Is your opponent the maximizing player or the minimizing player?
If a node has a child with score +1 and a child with score -1, then what is the score of the node?
If a node has two children with score +1, one child with score 0, and one child with score -1, then what is the score of the node?
Draw the full game tree proceeding from the following root node, and label each node with its score according to the minimax algorithm. There should be 12 nodes in total.
X | O | X
---------
| O | O
---------
| | X
You can do the drawing on paper, take a picture, and put that in your Overleaf doc.
On Friday, everyone will give a 5-minute presentation of their unit testing framework, and based on that, we will decide what kind of framework to use for our shared implementation.
Before class, run through your 5-minute presentation and make sure that you are able to do the following in 5 minutes:
Explain how your framework works (at a high level).
Show off key pieces of code. If there are any really elegant pieces, show them off. If there are any messy pieces, be forthcoming about it.
Show how you run your testing framework on the existing unit tests and show the output.
You don't have to make slides or anything super formal. You just need to describe things clearly and concisely.
Overleaf doc with quiz corrections & minimax answers
This is a catch-up assignment. Please prioritize problem 114 -- it's important to have that done by Wednesday's class.
a. If there's anything that you find confusing about our game implementation, post on Slack for discussion.
In particular, George -- you should ask what you were wondering about the game state and include what you printed out for the game state.
This is now our shared implementation, so it's everyone's responsibility to maintain it. If there's anything that you find confusing, then take initiative to ask about what it means on Slack. If there's any part of the code that you don't like, kick off a discussion about changing it. It's everyone's code now.
b. We currently have 4 unit tests in the unit_tests
folder: movement test 1, economic tests 1 and 2, and combat test 1. Create a file that executes these unit tests.
You DON'T have to debug the game so that the tests pass. I just want you to create a unit test framework that runs each test and says whether it passes or not.
You can change the structure of the test files if you want, e.g. structuring the test description in a more standard format, or whatever you want to do to make it easier to run the unit tests.
Then, next week, we can compare the tradeoffs of everyone's frameworks and agree upon a format to use going forward.
Develop your unit testing file on your own separate branch. But you don't have to make a pull request (we'll decide which branch to pull in during the next class).
A link to your unit tests, and a link to the commit on your branch.
If you haven't already, get your strategy working in our shared game implementation and create a pull request so we can merge in class on Friday.
This is important. If you're confused by any part in the code, DON'T use that as an excuse for not having this done. Post on Slack and we'll clear up any confusions in the code.
Lastly, don't worry if the game doesn't work exactly as intended right now. We'll start with unit tests on Friday. I just want everyone's strategy to run on our shared game without giving errors.
Previously, you built a logistic model with 167 features, and got the following results using max_iter=10,000
:
training: 0.848
testing: 0.811
It turned out that running that many iterations was taking a while (5 minutes) for some students, so let's use max_iter=1,000
instead. The logistic regressor might not fully converge, which means the model will probably be slightly worse, but that's okay because right now just going through this modeling process for educational purposes.
Using max_iter=1,000
, I get the following results:
training: 0.846
testing: 0.808
Yours should be pretty similar.
Now, you'll notice that the training accuracy is quite a bit higher than the testing accuracy. This is because we now have a LOT of features in our dataset, and not all of them are useful, which means it's harder for the model to figure out what is useful. The model ends up fitting to some "noise" in the data (see https://en.wikipedia.org/wiki/Noisy_data) and that causes it to pick up on some random patterns that aren't actually meaningful. The model becomes paranoid!
To fix this issue, we need to carry out feature selection, in which we attempt to select only the features that are actually useful to the model.
One type of feature selection method is forward selection, in which we begin with an empty model and add in variables one by one. In each forward step, you add the one variable that gives the single best improvement to your model.
Your task is to carry out forward selection on those 167 features.
Initially, you'll assume a model with no features. You don't actually build this model, but you assume its accuracy is 0.
Each forward step, you'll need to create a new model for each possible feature you might add next.
The next feature should always be the feature that gives you the largest accuracy when included in your model.
Stopping Criterion: If the feature that gives the largest accuracy doesn't actually improve the accuracy of the model, then stop.
In general, in the $n$th step of forward selection, you should be testing out models with $n$ features, $n-1$ of which are the same across all the models.
Put this problem in a separate file. I'll give you the processed data set so that you can be sure you're using the right starting point (it should match up with yours, but just in case it doesn't you can still do this problem without having to go down the rabbit hole of debuggin your data processing).
Your task is to take the processed data set and carry out forward selection. You should end up with the features and accuracies shown below.
['Sex', 'Pclass * SibSp', 'Pclass * Fare', 'Pclass * CabinType=E', 'Fare * CabinType=D', 'SibSp * CabinType=B', 'SibSp>0', 'Fare * CabinType=A']
training: 0.818
testing: 0.806
Print out a log like that given in the file below. This log is given to help you debug.
IMPORTANT: While initially writing your code, change max_iter
to a small number like 10
so that you're not waiting around for your log to generate each time. Once your code seems like it's working as intended, THEN update the iterations to 1000
and check that your results match up with those given in the log above.
You'll notice that we were able to remove a TON of the features, and get nearly the same testing accuracy. The training accuracy also got closer to the testing accuracy. That's good.
However, the testing accuracy didn't increase. It actually went down a bit. In a future assignment, we'll talk about another feature selection method that solves this issue.
Just the repl.it link to your file and the commit link for Qithub.
Also, remember that there's a quiz on Friday (as outlined on the previous assignment).
Put your code for this problem in the file that you've been using to do the titanic survival prediction using pandas, numpy, and sklearn.
Previously, we left off using a logistic regression with the following features:
['Sex', 'Pclass', 'Fare', 'Age', 'SibSp', 'SibSp>0', 'Parch>0', 'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S', 'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']
We got the following accuracy:
training accuracy: 0.8260
testing accuracy: 0.7903
Now, let's introduce some interaction terms. You'll need to create another column for each non-redundant interaction between features. An interaction is redundant if the two features are derived from the same original feature.
SibSp
and SibSp>0
are redundant
All the features that start with Embarked=
are redundant with each other
All the features that start with CabinType=
are redundant with each other
I can't give you a list of all these features because then you could just copy over that list and use it as a starting point. But I can tell you that there will be 167 features in total, not including Survival
(which is not actually a feature since that's what we're trying to predict). There are 20 non-interaction features and 147 interaction features for a total of 167 features.
There are many ways to accomplish this. My suggestion is to first just create a list of all the names of interaction terms between non-redundant features,
['Sex * Pclass', 'Sex * Fare', ...]
and then loop through that list to create the actual column in your dataframe for each interaction feature.
If you fit your regressor using all 167 features with max_iterations=10000
, you should get the following result (rounded to 3 decimal places)
training: 0.848
testing: 0.811
Note that at this point, our model is probably overfitting a bit. In a future assignment, we'll fix that by introducing some basic "feature selection" methods.
Just submit the repl.it link to your file along with the Github commit to your kaggle
repository. Your file should print out your training and testing accuracy, which should match up with the given result.
We'll have a quiz on Friday on the following topics:
logistic regression (pseudoinverse & gradient descent)
basic data processing / model fitting with pandas / numpy / sklearn
Note that in class today, we reviewed the logistic regression part, but the questions I ask on the quiz aren't going to be exactly the same as the ones we went over in the review. The quiz will check whether you've developed intuition from really understanding the answers to those questions, and the intuition should carry over to similar but slightly different questions.
I may ask you to do some computations by hand, so make sure you're able to do that too (I'd suggest to work out the first iteration in problem 76 by hand and make sure that the gradient & updated weights you get match up with what's in the log).
a. Get your level 3 strategy working against NumbersBerserker
. Work on a separate branch and create a pull request when you're done. Also, post your win rate on Slack.
The strategies are in space-empires-cohort-1/src/strategies/level_3
https://github.com/eurisko-us/space-empires-cohort-1/tree/main/src/strategies
b. Watch the videos that Prof. Wierman assigned during the last meeting. Make sure you're watching them closely enough to talk about them afterwards.
(This is a short ~30 minute assignment since we have Wednesday off.)
Now that you've built a logistic regressor that uses gradient descent, you've "unlocked" the privilege to use sklearn's LogisticRegressor
.
Previously, you carried out a Titanic prediction problem using sklearn's linear regressor. For this problem, just tweak the code you wrote to use the logistic regressor instead.
After you replace LinearRegressor
with LogisticRegressor
in your code, you'll have to
tweak a parameter of the regressor to get it to run long enough to converge
update your code to support the format in which the logistic regressor returns information
I'm not going to tell you exactly how to fix those issues, because the point of this problem is to give you practice debugging and reading documentation.
Tip: To find the official documentation on sklearn's logistic regressor, do a google search with the query "sklearn logistic regression".
You should get the output below. The predictions with the logistic regressor turn out to be a little bit better than those with the linear regressor.
features: [
'Sex',
'Pclass',
'Fare',
'Age',
'SibSp', 'SibSp>0',
'Parch>0',
'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S',
'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']
training accuracy: 0.8260
testing accuracy: 0.7903
coefficients:
{
'Constant': 1.894,
'Sex': 2.5874,
'Pclass': -0.6511,
'Fare': -0.0001,
'Age': -0.0398,
'SibSp': -0.545,
'SibSp>0': 0.4958,
'Parch>0': 0.0499,
'Embarked=C': -0.2078, 'Embarked=None': 0.0867, 'Embarked=Q': 0.479, 'Embarked=S': -0.3519,
'CabinType=A': -0.0498, 'CabinType=B': 0.0732, 'CabinType=C': -0.2125, 'CabinType=D': 0.7214, 'CabinType=E': 0.4258, 'CabinType=F': 0.6531, 'CabinType=G': -0.7694, 'CabinType=None': -0.5863, 'CabinType=T': -0.2496
}
Just submit the repl.it link to your code. When I run it, it should print out the information above.
Previously, we built a LogisticRegressor
that worked by reducing the regression task down to the task of finding the least-squares solution to a linear system.
More precisely, the task of fitting the logistic function
$$y=\dfrac{1}{1+e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}$$was reduced to the task of fitting the linear regression
$$\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n = \ln \left( \dfrac{1}{y} - 1 \right).$$Although this is a slick way to solve the problem, it suffers from the fact that we have to do something "hacky" in order to fit any data points with $y=0$ or $y=1.$
In such cases, we can't just run the model as usual, because the $\ln \left( \dfrac{1}{y}-1 \right)$ term blows up -- so our "hack" has been to
change any instances of $y=0$ to a small decimal like $y=0.1$ or $y=0.001,$ and
change any instances of $y=1$ to $1$ minus the small decimal, like $y=0.9$ or $y=0.999,$
depending on the context of the problem.
But this isn't a great way to deal with the issue, because the resulting logistic function can change significantly depending on what small decimal we use. The difference between small decimals may seem like such a minor difference, but when we plug these values in the $\ln \left( \dfrac{1}{y} - 1 \right)$ term, we get wildly different results, which leads to quite different fits.
PART A. To illustrate the quite different fits, fit 4 instances of your current LogisticRegressor
to the following dataset:
one instance where you change all instances of y=0
to y=0.1
and all instances of y=1
to y=0.9
another instance where you change all instances of y=0
to y=0.01
and all instances of y=1
to y=0.99
another instance where you change all instances of y=0
to y=0.001
and all instances of y=1
to y=0.999
another instance where you change all instances of y=0
to y=0.0001
and all instances of y=1
to y=0.9999
df = DataFrame(
[[1,0],
[2,0],
[3,0],
[2,1],
[3,1],
[4,1]],
columns = ['x', 'y'])
Put these all on the same plot, along with the data, and put them in an Overleaf doc. Be sure to label each curve with 0.1
, 0.01
, 0.001
, or 0.0001
as appropriate.
If you need a refresher on plotting / labeling curves, see here:
https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-10-1
If you need a refresher on including data in plots, see here:
https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-33-1
Explain: How does the plot change as the small decimal is varied?
Instead, we can use gradient descent to fit our logistic function. We want to choose the coefficients that minimize the sum of squared error (the RSS).
PART B. In your LogisticRegressor
class, write the following methods:
calc_rss()
- calculates the sum of squared error for the regressor
set_coefficients(coeffs)
- allows you to manually set the coefficients of your regressor by passing in a dictionary of coefficients
calc_gradient(delta)
- computes the partial derivatives of the RSS with respect to each coefficient
gradient_descent(alpha, delta, num_steps, debug_mode=False)
- carries out a given number of steps of gradient descent. If debug_mode=True
, then print out every step of the way.
Note that we wrote a gradient descent optimizer that a while back:
https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-34-2
You can use this as a refresher on how to code up gradient descent, and you might be able to copy/paste some code from here.
LogisticRegressor
stores its coefficients in a dictionary.Note that we will use the central difference approximation
$$ f'(x) \approx \dfrac{f(x+\delta) - f(x-\delta)}{2\delta}. $$Here is a test case:
df = DataFrame.from_array(
[[1,0],
[2,0],
[3,0],
[2,1],
[3,1],
[4,1]],
columns = ['x', 'y'])
reg = LogisticRegressor(df, dependent_variable='y')
reg.set_coefficients({'constant': 0.5, 'x': 0.5})
alpha = 0.01
delta = 0.01
num_steps = 20000
reg.calc_gradient(alpha, delta, num_steps)
reg.coefficients
{'constant': 2.7911, 'x': -1.1165}
Here are logs for every step of the way:
Make a plot of the resulting logistic curve, along with the data, and put it in an Overleaf doc.Be sure to label your curve with "gradient descent"
.
link to Overleaf doc (just contains 2 plots and the explanation of the first plot): ____
repl.it link to code that generated the plots: _____
commit link (machine-learning): ____
Going forward, we need to to start using models from an external machine learning library after you build the initial versions of the corresponding models. Most of the learning comes from building the first version, and debugging these subtle issues takes up too much time. Plus, it's good to know how to work with external libraries.
So instead of "build everything from scratch and maintain it forever", our motto will be "build the first version from scratch and then switch to a popular library".
If you're behind on any machine learning problems, don't worry about catching up. Just start off with this problem. This problem doesn't depend on anything you've written previously.
Create a new repository called kaggle
. Create a folder titanic
, and put your dataset and analysis file in there. Remember that the dataset is here:
In this assignment, you will create an analysis.py
file that carries out an analysis similar to that described in problem 107, using the libraries numpy
, pandas
, and sklearn
. You should follow along with the relevant parts of the walkthrough in the class recording:
Here are the relevant parts. (But read the rest of the assignment before starting.)
[0:35-0:42] Set up the environment & read in the dataframe
[0:42-0:50] Process Sex
by changing male
to 0
and female
to 1
[0:56-1:02] Process Age
by replacing all NaN
s with the mean age
[1:02-1:09] Process SibSp
and Parch
. Keep SibSp
, but also add the indicator variable (i.e. dummy variable) SibSp>0
. Add the indicator variable Parch>0
as well, and get rid of Parch
.
[1:17-1:42] Split into train/test, fit the regressor, get the predictions, compute training/testing accuracy. (At this point, don't worry about checking your numbers match up with mine, since I wasn't showing exactly which columns were being used in the regressor.)
[1:42-1:46] State the columns to be used in the regressor. (Here, your numbers should match up with mine, since I show exactly which columns are being used in the regressor.)
[1:46-1:56] Process Cabin
into CabinType
and create the corresponding indicator variables. Also, create the corresponding indicator variables for Embarked
. Make sure to delete Cabin
, CabinType
, and Embarked
afterwards.
[2:00-2:02] Run the final model. Your numbers should match up with mine.
You can just follow along with the walkthrough in the class recording and turn in the code you write as you followed along.
Note that watching me type and speak at normal (slow) pace is a waste of time, so play the video on 2x speed. You can access the speed controls by clicking on the gear icon in the bottom-right of the video.
I think this is a 90-minute problem. The relevant parts of the recording take up 70 minutes, and if you play at 2x speed, it's only 35 minutes. If we budget an equal or double time for you to write the code as you follow along, then we're up to 90 minutes. But if you find yourself taking longer or getting stuck anywhere, please let me know.
Here is the documentation for LinearRegressor()
:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
At the end, your code should print out the following (where numbers are rounded to 4 decimal places):
features: [
'Sex',
'Pclass',
'Fare',
'Age',
'SibSp', 'SibSp>0',
'Parch>0',
'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S',
'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']
training accuracy: 0.81
testing accuracy: 0.7749
coefficients:
{
'Constant': 0.696,
'Sex': 0.5283,
'Pclass': -0.0978,
'Fare': 0.0,
'Age': -0.0058,
'SibSp': -0.0585, 'SibSp>0': 0.0422,
'Parch>0': 0.0097,
'Embarked=C': -0.0547, 'Embarked=None': 0.052, 'Embarked=Q': 0.0709, 'Embarked=S': -0.0682,
'CabinType=A': 0.0447, 'CabinType=B': 0.0371, 'CabinType=C': -0.0124, 'CabinType=D': 0.1818, 'CabinType=E': 0.1088, 'CabinType=F': 0.2593, 'CabinType=G': -0.2797, 'CabinType=None': -0.0677, 'CabinType=T': -0.2717
}
Just submit 2 things:
kaggle/titanic/analysis.py
We're going to cut down on Eurisko assignment durations by a third. We've made a lot of progress, and most of you have AP tests coming up, so we're going to ease off the gas pedal a bit. We're going to hit the brakes on Haskell, C++, and code review, since you've had some basic exposure to those things and pursuing them further isn't going to be as valuable to the goals of the class as the space empires and machine learning stuff. Each assignment will consist of a single problem in one of the following areas:
For this problem, you'll need to turn in both your analysis code and an Overleaf writeup. The code should print out all the checks that are provided to you in this problem.
Note: after this problem was released, I realized I forgot to include a Constant
column, as we should normally do for linear regression. However, the main things to be learned on this assignment don't really depend on the constant, so carry on without it.
a. Continue processing your data as follows:
Sex
- replace "male"
with 0
and "female"
with 1
Age
- replace any instances of None
with the mean age (which should be about 29.699
)
SibSp
- this was one of the variables that didn't have a clear positive or negative association with Survival
. When SibSp=0
, the survival was low; when SibSp>=1
, the survival started higher but then decreased as SibSp
decreased.
So, what we can do is create a dummy variable SibSp=0
that equals 1
when SibSp
is equal to 0
(and 0
otherwise). And we'll keep SibSp
as well. This way, the variable SibSp=0
can be given a negative coefficient that offsets the coefficient of SibSp
in the case when SibSp
equals 0
.
Parch
- we'll replace this with a dummy variable Parch=0
, because the only significant difference in the data is whether or not Parch
is equal to 0
. Among passengers who had Parch
greater than 0
, it doesn't look like there's much variation in survival.
CabinType
- replace this with dummy variables of the form CabinType=A
, CabinType=B
, CabinType=C
, CabinType=D
, CabinType=E
, CabinType=F
, CabinType=G
, CabinType=None
, CabinType=T
.
Embarked
- replace this with dummy variables of the form Embarked=C
, Embarked=None
, Embarked=Q
, Embarked=S
.
Now, your data should all be numeric, and we can put it into linear regressor.
Note: To get predictions out of the linear regressor, we'll interpret the linear regression's output in the following way.
if the linear regressor predicts a value less than 0.5
, then it predicts the passenger did not survived (i.e. it predicts survival=0
)
if the linear regressor predicts a value greater than or equal to 0.5
, then it predicts the passenger survived (i.e. it predicts survival=1
)
b. Create train and test datasets. Use first 500 records for training, and the rest for testing. Start out just training a model which uses Sex
as the only feature. This will be our baseline.
train accuracy: 0.8
test accuracy: 0.7698
{'Sex': 0.7420}
Note that accuracy is just the number of correct classifications divided by the total number of classifications.
c. Now, introduce Pclass
. Uh oh! Why didn't our test accuracy get any better? Write your explanation in an Overleaf doc.
train accuracy: 0.8
test accuracy: 0.7698
{'Sex': 0.6514, 'Pclass': 0.0419}
Hint: Look at the Sex
coefficient.
d. Bring in some more features: Fare
, Age
, SibSp
, SibSp=0
, Parch=0
. The test accuracy still hasn't gotten any better. Why?
train accuracy: 0.796
test accuracy: 0.7698
{
'Sex': 0.5833,
'Pclass': -0.0123,
'Fare': 0.0012,
'Age': 0.0008,
'SibSp': -0.0152,
'SibSp=0': 0.0478,
'Parch=0': 0.0962
}
e. Bring in some more features: Embarked=C
, Embarked=None
, Embarked=Q
, Embarked=S
. Now the model actually got better. Why is the model more accurate now?
train accuracy: 0.806
test accuracy: 0.7902813299232737
{
'Sex': 0.4862,
'Pclass': -0.1684,
'Fare': 0.0002,
'Age': -0.0056,
'SibSp': -0.0719,
'SibSp=0': -0.0784,
'Parch=0': -0.0269,
'Embarked=C': 0.9179,
'Embarked=None': 1.0522,
'Embarked=Q': 0.9282,
'Embarked=S': 0.8544
}
f. Bring in some more features: CabinType=A
, CabinType=B
, CabinType=C
, CabinType=D
, CabinType=E
, CabinType=F
, CabinType=G
, CabinType=None
. The model is continuing to get better.
train accuracy: 0.816
test accuracy: 0.8005
{
'Sex': 0.4840,
'Pclass': -0.1313,
'Fare': 0.0003,
'Age': -0.0058,
'SibSp': -0.0724,
'SibSp=0': -0.0823,
'Parch=0': -0.0187,
'Embarked=C': 0.5446,
'Embarked=None': 0.6773,
'Embarked=Q': 0.5522,
'Embarked=S': 0.4829,
'CabinType=A': 0.3830,
'CabinType=B': 0.3360,
'CabinType=C': 0.2686,
'CabinType=D': 0.4311,
'CabinType=E': 0.4973,
'CabinType=F': 0.4679,
'CabinType=G': 0.0858,
'CabinType=None': 0.2634
}
g. Now, introduce CabinType=T
. You'll probably see the accuracy go down. I won't include a check because different people will get different results for this one. Why did the accuracy go down?
This is subtle, so I'll give a hint. Look at the entries of $X^TX$ and compare to what the entries looked like before you introduced CabinType=T
. The entries get extremely large/small.
So, there are really two questions:
Our shared game implementation is here:
Here is a high-level guide of the process for making changes to our shared repository:
To check out a new branch:
>>> git checkout -b justin-comment
Switched to a new branch 'justin-comment'
Add a comment to yourname-comment.txt
>>> git status
On branch justin-comment
Untracked files:
(use "git add <file>..." to include in what will be committed)
justin-comment.txt
nothing added to commit but untracked files present (use "git add" to track)
Add your changes and commit to your branch
>>> git add justin-comment.txt
>>> git commit -m "create Justin's comment"
[justin-comment 542f30e] create Justin's comment
1 file changed, 1 insertion(+)
create mode 100644 justin-comment.txt
Push to your branch
>>> git push origin justin-comment
Username for 'https://github.com': jpskycak
Password for 'https://jpskycak@github.com':
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 309 bytes | 309.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
remote:
remote: Create a pull request for 'justin-comment' on GitHub by visiting:
remote: https://github.com/eurisko-us/space-empires-cohort-1/pull/new/justin-comment
remote:
To https://github.com/eurisko-us/space-empires-cohort-1.git
* [new branch] justin-comment -> justin-comment
On GitHub, it will show that your branch is a commit ahead, and possibly even commits behind (if other people have made commits in the time since you first created your branch).
Click "Pull request", and create the pull request. Don't merge it yet, though. We'll do that during class.
For your submission, copy and paste your links into the following template:
overleaf link to explanations: _____
repl.it link to file that prints out
the results of your model (it should
match up with the checks in the
assignment): _____
commit link (machine-learning): ____
I had a chat with Jason this morning about our approach to space empires. We've been building the games separately so that each person gets a maximum learning experience, but now we're at a point where this method of development is becoming so time consuming that it keeps us from making progress down other avenues (neural nets, sql parser, etc). Not to mention, we have a deadline that we need to have level 3 working in 3 weeks for our next meeting with Prof. Wierman.
So, he's ok with the idea of everyone in the class working on the same game implementation. We'll discuss that more during the next class, but for now, hit the brakes on space empires and focus on getting the problems on this current assignment done.
Now that you've had plenty of practice computing weight gradients, let's go back to implementations.
Consider the following dataset, whose points follow the function $y=A \sin (Bx)$ for some constants $A,B.$
[(0, 0.0),
(1, 1.44),
(2, 2.52),
(3, 2.99),
(4, 2.73),
(5, 1.8),
(6, 0.42),
(7, -1.05),
(8, -2.27),
(9, -2.93),
(10, -2.88),
(11, -2.12),
(12, -0.84),
(13, 0.65),
(14, 1.97),
(15, 2.81),
(16, 2.97),
(17, 2.4),
(18, 1.24),
(19, -0.23)]
Consider the following neural network:
$$ \begin{matrix} & & n_2 \\ & & \uparrow \\ & & n_1 \\ & & \uparrow \\ & & n_0 \\ \end{matrix} $$Let the activation functions be as follows: $f_0(x) = x,$ $f_1(x) = \sin(x),$ $f_2(x) = x.$
Then $a_2 = w_{12} \sin( w_{01} i_0 ),$ so we can use this network to fit our function $y=A \sin (Bx).$
Use this neural network to fit the dataset, starting with $w_{01} = w_{12} = 1$ and using a learning rate of $0.001.$ Loop through the dataset $1000$ times, applying a gradient descent update at each point (i.e. $20$ gradient descent updates per loop). So, there will be $20\,000$ gradient descent updates in total.
Your final weights should be $w_{01} = 0.42, w_{12} = 2.83$ rounded to $2$ decimal places.
Here is a log to help you debug. The numbers are rounded to 4 decimal places.
Here's the weight updates worked out for the second data point:
In the Titanic dataset, let's get a sense of how the continuous variables (Age
and Fare
) relate to Survived
.
a. For Age
, filter the records down to age categories (0-10, 10-20, 20-30, ..., 70-80) and compute the survival rate (i.e. mean survival) in each category. Exclude any None
s from the analysis.
Put a table in an overleaf document. Round the survival rate to $2$ decimal places (otherwise it's difficult to read.)
In the table, include the counts in parentheses. So each table entry should look like survivalRate (count)
. So if the survival rate were 0.13
and the count were 27
people, then you'd put 0.13 (27)
.
What does the table tell you about the relationship between age and survival?
Give a plausible explanation for why this is.
b. For Fare
, filter the records down to fare categories (0-5, 5-10, 10-20, 20-50, 50-100, 100-200, 200+) and compute the survival rate (i.e. mean survival) in each category. Exclude any None
s from the analysis.
Update your query
method to support ORDER BY
. The query
df.query("SELECT selectColname1, selectColname2, selectColname3 ORDER BY orderColname1 order1, orderColname2 order2, orderColname3 order3")
should be parsed and read into the following primitive operations:
df.order_by(orderColname3, order3)
.order_by(orderColname2, order2)
.order_by(orderColname1, order1)
.select([selectColname1, selectColname2, selectColname3])
Assert that your method passes the following tests:
>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)
>>> df.query("SELECT lastname, firstname, age ORDER BY age DESC").to_array()
[['Trapp', 'Charles', 17],
['Smith', 'Anna', 13],
['Mendez', 'Sylvia', 9],
['Fray', 'Kevin', 5]]
>>> df.query("SELECT firstname ORDER BY lastname ASC").to_array()
[['Kevin'],
['Sylvia'],
['Anna'],
['Charles']]
Assert that your method passes these tests as well:
>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Melvin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Carl', 'Trapp', 17],
['Anna', 'Smith', 13],
['Hannah', 'Smith', 13],
['Sylvia', 'Mendez', 9],
['Cynthia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)
>>> df.query("SELECT lastname, firstname, age ORDER BY age ASC, firstname DESC").to_array()
[['Fray', 'Melvin', 5],
['Fray', 'Kevin', 5],
['Mendez', 'Sylvia', 9],
['Mendez', 'Cynthia', 9],
['Smith', 'Hannah', 13],
['Smith', 'Anna', 13],
['Trapp', 'Charles', 17],
['Trapp', 'Carl', 17]]
Commit your code to Github.
Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)
For your submission, copy and paste your links into the following template:
repl.it link to neural net implementation that prints out the final weights: _____
overleaf link to titanic analysis: _____
repl.it link to sql parser: _____
link to resolved issue: ____
Commit links (machine-learning): ____
This will be a "consolidation problem." Your task is to make sure that you have Problem 104-1 completed by the end of the weekend, with the exception that you don't have to run your classmates' unit tests. You just have to get movement test 1 working and write your own unit test as assigned in Problem 104-1.
Remember that to initialize your game, you may need to loop through your game state to initialize some Player
and Unit
objects accordingly. If you get stuck or confused, please post on Slack.
Remember that to push your unit tests up to Github, you'll need to clone the repo, make your changes, and commit and push your changes. Here is how to do this:
Clone the repo:
>>> git clone https://github.com/eurisko-us/space-empires-cohort-1.git
Create your new unit tests. A fast way to make the necessary files is to cd
into the desired location and then touch
some files, like this:
>>> ls
space-empires-cohort-1
>>> cd space-empires-cohort-1/
>>> ls
README.md slinky_development unit_tests
>> cd unit_tests/
>>> ls
movement_test_1
>>> mkdir combat_test_1
>>> cd combat_test_1
>>> touch description.txt initial_state.py final_state.py strategies.py
>>> ls
description.txt initial_state.py final_state.py strategies.py
Commit and push your unit tests.
>>> git status
(will show the files you modified)
>>> git add *
(add all the files you modified)
>>> git commit -m "add combat test 1"
>>> git push origin
Check that the repo was updated successfully. Go to https://github.com/eurisko-us/space-empires-cohort-1 and make sure your unit tests are there.
Correct any errors on your quiz (if you got a score under 100%). You can just submit corrected code and/or explanations (you don't have to explain why you got it wrong in the first place).
Remember that we went through the quiz during class, so if you have any questions or need any help, look at the recording first.
Write a C++ program that creates an array {11, 12, 13, 14}
and prints out the memory address of the array and of each element.
Format your output like this:
array has address 0x7fff58f44160
index 0 has value 11 and address 0x7fff58f44160
index 1 has value 12 and address 0x7fff58f44164
index 2 has value 13 and address 0x7fff58f44168
index 3 has value 14 and address 0x7fff58f4416c
Note that your memory addresses will not be the same as those above. (Each time you run the program, the memory addresses will be different.)
Note: If you're having trouble figuring out where to start, remember that we've answered conceptual questions about pointers and the syntax of pointers using this resource:
https://www.learncpp.com/cpp-tutorial/introduction-to-pointers/
Commit your code to Github.
Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)
For your submission, copy and paste your links into the following template:
github link to space empires unit test that you created: ____
link to repl.it file in which you run movement test 1: ____
link to quiz corrections (if applicable): _____
link to c++ problem: _____
link to resolved issue: ____
Commit links (space-empires, assignment-problems): ____
This is the new repo where we'll store our logs, unit tests, and wiki pages:
You should all have write access to the repo.
Each person will create 1 unit test. You can use Colby's sheet for inspiration, or make up your own unit test.
Before you write your unit test, though, check in with the other person who's doing a test for the same phase to make sure that your test is different from theirs.
David: create movement test 2
George: create combat test 1
Colby: create combat test 2
Elijah: create economic test 1
Riley: create economic test 2
Post on slack if you run into any trouble pushing your tests up to the repo.
Clone eurisko-us/space-empires-cohort-1
Create a file to run all the unit tests. You can start making progress on this right away, since movement test 1 already exists.
Once your classmates push their tests, you can run them.
We're going to write a method in our DataFrame called query
, that will take a string with SQL-like syntax as input and execute the corresponding operations on our dataframe.
Let's start off simple, with the select statement only.
Write a function query
that takes a select query of the form
df.query("SELECT colname1, colname2, colname3")
and returns a dataframe with the appropriate select statement applied:
df.select([selectColname1, selectColname2, selectColname3])
Here is a concrete example that you should write a test for:
>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)
>>> df.query('SELECT firstname, age').to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]
Make sure your function is general (it should not be tailored to a specific number of columns).
Now that we are able to use our group_by
and aggregate
methods in our dataframes, let's return to the Titanic dataset.
We now have the following columns in our dataframe, and our current task is to figure out how each of these columns are related to survival (if at all).
[
"Pclass",
"Surname",
"Sex",
"Age",
"SibSp",
"Parch",
"TicketType",
"TicketNumber",
"Fare",
"CabinType",
"CabinNumber",
"Embarked"
]
Let's start with the columns that consist of few categories and are therefore relatively easy to analyze.
Put your answers to the following questions in an overleaf doc. Include a table for each answer, and be sure to explain what the data tells you about how that variable is related to survival (if anything), as well as why you think that relationship happens.
Note that there is not always a single correct answer regarding why the relationship happens, but you should try to come up with a plausible explanation.
To look up what a variable actually represents, check the data dictionary here: https://www.kaggle.com/c/titanic/data
a. Group your dataframe by Pclass
and find the survival rate (i.e. the mean of the survival
variable) and the count of records for each Pclass
.
You should get the following result. What does this result tell you about how Pclass
is related to survival
? Why do you think this is?
Pclass meanSurvival count
1 0.629630 216
2 0.472826 184
3 0.242363 491
b. Group your dataframe by Sex
and find the survival rate and count of records for each sex.
You should get the following result. What does this result tell you about how Sex
is related to survival
? Why do you think this is?
Sex meanSurvival count
female 0.742038 314
male 0.188908 577
c. Continuing the same analysis method as in parts (a) and (b): what is the table for SibSp
, what does it tell you about how SibSp
is related to survival
, and why do you think this is?
d. Continuing the same analysis method: what is the table for Parch
, what does it tell you about how Parch
is related to survival
, and why do you think this is?
e. Continuing the same analysis method: what is the table for CabinType
, what does it tell you about how CabinType
is related to survival
, and why do you think this is?
f. Continuing the same analysis method: what is the table for Embarked
, what does it tell you about how Embarked
is related to survival
, and why do you think this is?
In case you're interested, here is what we'll be doing in future assignments:
Age
and Fare
)Surname
, TicketType
, etc and seeing if it improves our modelsCommit your code to Github.
Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)
For your submission, copy and paste your links into the following template:
github link to space empires unit test that you created: ____
link to repl.it file in which you run the unit tests: ____
link to DataFrame.query test: ____
overleaf writeup for titanic survival exploration: _____
link to resolved issue: ____
Commit links (space-empires, machine-learning): ____
The current primary problem can be to finish up the slinky development thing from last time, and also write a method that initializes the game with a given game state.
And then on Wednesday's assignment, we can write unit tests that @Colby put in the doc (as well as any other unit tests you guys want).
And then after we've got those unit tests working, we can do a couple rounds of slinky development to scale up to the level 3 game that we were trying to implement before. I'm sure we can succeed in getting it done before the next meeting with Prof. Wierman.
Also, remember to watch the lectures that Prof. Wierman put in the chat sometime before our nexrt meeting.
You guys can have 2 more days to work on getting your game to match up with the logs.
Important! The logs have been updated! As of Sunday, the combat shown in the logs was screwed up.
The rule I implemented was that you score a hit if
die roll >= (attack strength) + (attack technology) - (defense strength) - (defense technology)
But actually, it should be that you get a hit if
die roll = 1
or die roll <= (attack strength) + (attack technology) - (defense strength) - (defense technology)
I also implemented some additional information suggested by George, such as survivors after combat, and removing Homeworld from the combat order.
So, get your logs to match up with mine from Problem 102-1, and submit your diffs. The logs are in cohort-1/102-pre-level-3-game
of the slinky-development repo:
We can do unit tests without too much additional infrastructure. We can just make a method in our game that initializes the game with a given game state, and then we can run the appropriate phase (movement, combat, economic) and make sure the game state afterwards is as we expect it to be.
For this assignment, just make the method that initializes your game with a given game state, and make sure that you will be able to update that state incrementally (i.e. by running the appropriate phase).
The next thing we need to do in our titanic prediction modeling is to determine which features are useful for predicting survival. However, this will involve some extensive data processing, and it will be much easier to do this if we first build some SQL primitives.
You should already have methods select
, where
, and order_by
implemented in your DataFrame
class. Check to make sure you have these methods and that they pass the following tests.
select_columns
to just select
, and select_rows_where
to just where
.>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)
>>> df.select(['firstname','age']).to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]
>>> df.where(lambda row: row['age'] > 10).to_array()
[['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]
>>> df.order_by('firstname').to_array()
[['Anna', 'Smith', 13],
['Charles', 'Trapp', 17],
['Kevin', 'Fray', 5],
['Sylvia', 'Mendez', 9]]
>>> df.order_by('firstname', ascending=False).to_array()
[['Sylvia', 'Mendez', 9],
['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]
>>> df.select(['firstname','age']).where(lambda row: row['age'] > 10).order_by('age').to_array()
[['Anna', 13],
['Charles', 17]]
At this point, writing a "select-where-order" SQL statement in terms of the primitives seems obvious. Just apply the select
, where
, and order
primitives in that order. Right?
Not exactly. The intuitive order only works when the columns referenced in where
and order_by
also appear in the select
statement. So, to carry out a "select-where-order" SQL statement, we really need to apply the primitives in the order where
, order
, select
.
A concrete example is shown below.
# this query FAILS because we filtered out the 'age' column
# before applying the where condition, and the where condition
# references the 'age' column
>>> df.select(['firstname']).where(lambda row: row['age'] > 10).order_by('age').to_array()
ERROR
# this query SUCCEEDS because we apply the where condition
# before filtering out the 'age' column
>>> df.where(lambda row: row['age'] > 10).order_by('age').select(['firstname']).to_array()
[['Anna'],
['Charles']]
Your task on this problem is to implement another primitive we will need: group_by
. Make sure your implementation passes the test below.
>>> df = DataFrame.from_array(
[
['Kevin Fray', 52, 100],
['Charles Trapp', 52, 75],
['Anna Smith', 52, 50],
['Sylvia Mendez', 52, 100],
['Kevin Fray', 53, 80],
['Charles Trapp', 53, 95],
['Anna Smith', 53, 70],
['Sylvia Mendez', 53, 90],
['Anna Smith', 54, 90],
['Sylvia Mendez', 54, 80],
],
columns = ['name', 'assignmentId', 'score']
)
>>> df.group_by('name').to_array()
[
['Kevin Fray', [52, 53], [100, 80]],
['Charles Trapp', [52, 53], [75, 95]],
['Anna Smith', [52, 53, 54], [50, 70, 90]],
['Sylvia Mendez', [52, 53, 54], [100, 90, 80]],
]
Also, implement a method called aggregate(colname, how)
that aggregates colname
according to the way that is specified in how
(count, max, min, sum, avg). Make sure your implementation passes the tests below.
>>> df.group_by('name').aggregate('score', 'count').to_array()
[
['Kevin Fray', [52, 53], 2],
['Charles Trapp', [52, 53], 2],
['Anna Smith', [52, 53, 54], 3],
['Sylvia Mendez', [52, 53, 54], 3],
]
>>> df.group_by('name').aggregate('score', 'max').to_array()
[
['Kevin Fray', [52, 53], 100],
['Charles Trapp', [52, 53], 95],
['Anna Smith', [52, 53, 54], 90],
['Sylvia Mendez', [52, 53, 54], 100],
]
>>> df.group_by('name').aggregate('score', 'min').to_array()
[
['Kevin Fray', [52, 53], 80],
['Charles Trapp', [52, 53], 75],
['Anna Smith', [52, 53, 54], 50],
['Sylvia Mendez', [52, 53, 54], 80],
]
>>> df.group_by('name').aggregate('score', 'sum').to_array()
[
['Kevin Fray', [52, 53], 180],
['Charles Trapp', [52, 53], 170],
['Anna Smith', [52, 53, 54], 210],
['Sylvia Mendez', [52, 53, 54], 270],
]
>>> df.group_by('name').aggregate('score', 'avg').to_array()
[
['Kevin Fray', [52, 53], 90],
['Charles Trapp', [52, 53], 85],
['Anna Smith', [52, 53, 54], 70],
['Sylvia Mendez', [52, 53, 54], 90],
]
The goal of this problem is to find the number of missing assignments for each student (across all classes) for the following data:
https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/sql-tables/4.sql
This problem will involve the use of subqueries. Since this is our problem involving subqueries (other than some simple stuff on SQL Zoo), I've scaffolded it a bit for you.
First, write a query to get the number of assignments that were assigned in each class. Let's call this Query 1. (Tip: use "count distinct")
classId numAssigned
2307 3
3110 2
4990 3
Then, get the number of assignments that each student has completed in each class. Let's call this query 2. (Tip: group by both studentId
and classId
)
studentId classId numCompleted
1 2307 3
1 3110 2
1 4990 2
2 2307 2
2 3110 2
2 4990 3
3 2307 1
3 3110 2
3 4990 1
4 2307 3
4 3110 1
4 4990 3
5 2307 1
5 3110 2
5 4990 3
Join the results of queries 1 and 2 so that you can compute each student's number of missing assignments. (Tip: use queries 1 and 2 as subqueries)
studentId classId numMissing
1 2307 0
1 3110 0
1 4990 1
2 2307 1
2 3110 0
2 4990 0
3 2307 2
3 3110 0
3 4990 2
4 2307 0
4 3110 1
4 4990 0
5 2307 2
5 3110 0
5 4990 0
Then, use the previous query to find the total number of missing assignments.
name totalNumMissing
Franklin Walton 1
Sylvia Sanchez 1
Harry Ng 4
Ishmael Smith 1
Kinga Shenko 2
Commit your code to Github.
Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)
For your submission, copy and paste your links into the following template:
Repl.it link to code that generates your space-empires logs: ____
Link to diff that shows your logs are the same as the given logs: ____
Repl.it link to group_by and aggregate tests: ____
sqltest.net link: ____
Resolved issue: _____
Commit links (space-empires, machine-learning): ____
Note: We have a meeting with Prof. Wierman on Monday from 11:30am-12:30pm. Put this on your calendar and set some kind of alarm so you don't forget. I'll paste the meeting link in Slack when it's time. Also, please prepare to turn on your video for this meeting (it's a more formal setting).
The Space Empires game is posing a bit of a challenge in that we need a "log of truth" to match up against when we're reconciling games. It's inefficient when we all try to reconcile at the same time, because that's an $n \times n$ problem. It's also not feasible for me to create the log of truth due to the magnitude of the additional time committment that would be needed to code up the game.
TI talked to Jason last night about this, and he had a pretty good idea. What we can do instead is repeatedly have 2 people create logs for some strategy matchup, reconcile their logs to form the log of truth, and then have the rest of the class reconcile against that log of truth. Then, 2 new people will create the next log of truth, resulting in a slinky-like effect. (So we'll refer to this as "slinky development".)
I'll start off the first round. Check out cohort-1/102-pre-level-3-game
of the slinky-development repo:
The rules for the game are in rules.txt
and the logs show the simulation results for several random seeds.
Your task is to replicate these logs with your game, exactly the way they appear in the repo, using the exact strategies in strategies.py
. Note that a tab is written as \t
. (But be sure to post on Slack if you think there are any errors.)
Then, copy your logs into https://www.diffchecker.com/ to verify that your logs match up perfectly with the logs in the repo. You'll save/submit the link to your diffs (example: https://www.diffchecker.com/57HDK3vO) along with a link to the code that you used to generate your logs.
The first step towards building our models is deciding which independent variables to in our model (i.e. which variables might be useful for predicting survival
?). There is a data dictionary at https://www.kaggle.com/c/titanic/data that describes what each variable means. Here are the first couple rows, for reference:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
Some variables will not be useful in our model:
PassengerId
is just the row number of the dataset. It has nothing to do with the actual properties of passengers. We can discard it.Other variables may not be useful as-is, but they may be useful after some additional processing:
Name
has too many categories to be useful in its entirety. However, the surname alone may be useful, given that passengers in the same family likely stuck together and took similar paths leading to survival or death.
Ticket
appears to be formatted as a ticket type and ticket number. If we split those up into two variables (ticket type and ticket number), then we may be able to find some use in those.
Cabin
appears to be formatted as a cabin type and cabin number. If we split those up into two variables, then we may be able to find some use in those.
Other variables seem like they may be useful with minimal processing: Pclass
, Sex
, Age
, SibSp
, Parch
, Fare
, Embarked
.
Your task is to split Name
, Ticket
, and Cabin
into the sub-variables mentioned above (Surname
, TicketType
, TicketNumber
, CabinType
, CabinNumber
). Next time, we'll analyze all the variables to determine how much they tell us about survival, but for now, let's just worry about creating those sub-variables that we want to investigate.
Pclass
, Sex
, Age
, SibSp
, Parch
, Fare
, and Embarked
, but these variables won't need to be split like Name
, Ticket
, and Cabin
do, so we don't need to worry about them right now)Note: In the following problems, your dataframe method apply
will be useful (see problem 28-2) and so will Python's split
method (https://www.geeksforgeeks.org/python-string-split/)
a. Get the Surname
from Name
. In the way the names are formatted, it appears that the surname always consists of the characters preceding the first comma.
b. Split Cabin
into CabinType
and CabinNumber
, e.g. the cabin B51
has type B
and number 51
.
If you look at the dataset, you'll see that Cabin
sometimes has multiple cabin numbers, e.g. B51 B53 B55
. The cabin types appear to all be the same, while the cabin number is incremented by a small amount for each cabin. So, we can get a decent approximation by just considering the first entry (in the case of B51 B53 B55
, we'll just consider B51
).
Keep CabinType
as a string but set CabinNumber
to be an integer. (You may wish to write a method in your DataFrame
that converts a column to a desired type.)
c. Split Ticket
into TicketType
and TicketNumber
, e.g. the ticket SOTON/O.Q. 3101312
has type SOTON
and number 3101312
.
Watch out! Some tickets don't have a type, so it would be None
. For example, the ticket 19877
would have type None
and number 19877
.
Keep TicketType
as a string but set TicketNumber
to be an integer.
Here's an example of what the output should look like. First, read in the data as usual:
>>> import parse_line from somefile
>>> data_types = {
"PassengerId": int,
"Survived": int,
"Pclass": int,
"Name": str,
"Sex": str,
"Age": float,
"SibSp": int,
"Parch": int,
"Ticket": str,
"Fare": float,
"Cabin": str,
"Embarked": str
}
>>> df = DataFrame.from_csv("data/dataset_of_knowns.csv", data_types=data_types, parser=parse_line)
>>> df.columns
["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]
>>> df.to_array()[:5]
[[1, 0, 3, '"Braund, Mr. Owen Harris"', "male", 22.0, 1, 0, "A/5 21171", 7.25, None, "S"],
[2, 1, 1, '"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"', "female", 38.0, 1, 0, "PC 17599", 71.2833, "C85", "C"],
[3, 1, 3, '"Heikkinen, Miss. Laina"', "female", 26.0, 0, 0, "STON/O2. 3101282", 7.925, None, "S"]
[4, 1, 1, '"Futrelle, Mrs. Jacques Heath (Lily May Peel)"', "female", 35.0, 1, 0, "113803", 53.1, "C123", "S"]
[5, 0, 3, '"Allen, Mr. William Henry"', "male", 35.0, 0, 0, "373450", 8.05, None, "S"]]
Then, process your df
. You don't have to write generalized code for this part. This can be a one-off thing.
After processing, your dataframe should look like this:
>>> df.columns
["PassengerId", "Survived", "Pclass", "Surname", "Sex", "Age", "SibSp", "Parch", "TicketType", "TicketNumber", "Fare", "CabinType", "CabinNumber", "Embarked"]
>>> df.to_array()[:5]
[[1, 0, 3, "Braund", "male", 22.0, 1, 0, "A/5", 21171, 7.25, None, None, "S"],
[2, 1, 1, "Cumings", "female", 38.0, 1, 0, "PC", 17599, 71.2833, "C", 85, "C"],
[3, 1, 3, "Heikkinen", "female", 26.0, 0, 0, "STON/O2.", 3101282, 7.925, None, None, "S"]
[4, 1, 1, "Futrelle", "female", 35.0, 1, 0, None, 113803, 53.1, "C", 123, "S"]
[5, 0, 3, "Allen", "male", 35.0, 0, 0, None, 373450, 8.05, None, None, "S"]]
Commit your code to Github.
Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)
For your submission, copy and paste your links into the following template:
Repl.it link to code that generates your space-empires logs: ____
Link to diff that shows your logs are the same as the given logs: ____
Repl.it link to titanic dataset processing: ____
Resolved issue: _____
Commit links (space-empires, machine-learning): ____
Per our discussion in class, we'll refactor our level 3 games before returning to resolving discrepancies.
I've updated the wiki
https://github.com/eurisko-us/eurisko-us.github.io/wiki/Space-Empires-Rules-(Cohort-1,-Level-3)
with the following changes:
Eliminate hidden_game_state_for_combat
per George's suggestion -- put all that information in combat_state
instead.
In the game state:
instead of using an array of players, make it a dictionary where the key is the player number.
store turn_created
for all units, as this might make debugging easier
Store an num
for each ship. In the game, the units are identified in the form type-number, like scout-2, where the number is only unique to a particular type of ship, i.e. there can be a scout-2 and a destroyer-2. So we should probably take care of that now. And that allows us to keep our units in the form of an array, which is what we had before.
Due to the change in the game state, there are some resulting changes in the format of outputs in the strategy template
If you have any more refactoring ideas or disagree with any of the above refactorings, post on #machine-learning and we can discuss
Also, in the wiki, there's now a section called "Gotchas" where you can write down any subtle rules that you've encountered, that you think others may not have implemented. For example:
During combat, you can't attack a colony until all ships have been destroyed
In Problem 94-1, you needed to create a logistic regressor neural network. Previously, this was a bit difficult because we hadn't had enough practice computing weight gradients. But now, we've had much more practice, so I think it should be within reach.
Make sure this logistic regressor neural net is working. If you managed to get it working on assignment 94, then you can just submit the code that you already wrote. Otherwise, if you didn't manage to get it working before, then your task is to get it working now.
Location: machine-learning/kaggle/titanic/data_loading.py
a. Make an account on Kaggle.com so that we can walk through a Titanic prediction task.
Go to https://www.kaggle.com/c/titanic/data, scroll down to the bottom, and click "download all". You'll get a zip file called titanic.zip
.
Upload titanic.zip
into machine-learning/kaggle/titanic/data
. Then, run unzip machine-learning/kaggle/data/titanic.zip
in the command line to unzip the file.
This gives us 3 files: train.csv
, test.csv
, and gender_submission.csv
. The file train.csv
contains data about a bunch of passengers along with whether or not they survived. Our goal is to use train.csv
to build a model that will predict the outcome of passengers in test.csv
(for which the survival data is not given).
train.csv
to dataset_of_knowns.csv
, rename test.csv
to unknowns_to_predict.csv
, and rename gender_submission.csv
to predictions_from_gender_model.csv
.b. In your DataFrame
, update your method read_csv
so that it accepts the following (optional) arguments:
a line parser
a dictionary of data types
If you encounter any empty strings, then save those as None
rather than the type given in the dictionary of data types.
>>> import parse_line from somefile
>>> data_types = {
"PassengerId": int,
"Survived": int,
"Pclass": int,
"Name": str,
"Sex": str,
"Age": float,
"SibSp": int,
"Parch": int,
"Ticket": str,
"Fare": float,
"Cabin": str,
"Embarked": str
}
>>> df = DataFrame.from_csv("data/dataset_of_knowns.csv", data_types=data_types, parser=parse_line)
>>> df.columns
["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]
>>> df.to_array()[:5]
[[1, 0, 3, '"Braund, Mr. Owen Harris"', "male", 22.0, 1, 0, "A/5 21171", 7.25, None, "S"],
[2, 1, 1, '"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"', "female", 38.0, 1, 0, "PC 17599", 71.2833, "C85", "C"],
[3, 1, 3, '"Heikkinen, Miss. Laina"', "female", 26.0, 0, 0, "STON/O2. 3101282", 7.925, None, "S"]
[4, 1, 1, '"Futrelle, Mrs. Jacques Heath (Lily May Peel)"', "female", 35.0, 1, 0, "113803", 53.1, "C123", "S"]
[5, 0, 3, '"Allen, Mr. William Henry"', "male", 35.0, 0, 0, "373450", 8.05, None, "S"]]
(You don't have to make or resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
repl.it link to logistic neural net: _____
repl.it link titanic data loading: _____
commits: _____
(machine-learning, space-empires)
Announcement: There will be a quiz on Friday. Topics will include SQL, C++, and neural net gradient computations.
Note: I put some information on the wiki here:
https://github.com/eurisko-us/eurisko-us.github.io/wiki/Space-Empires-Rules-(Cohort-1,-Level-3)
If you see any mistakes or any information that you think should be added, post about it on Slack. If your classmates agree, then you can go ahead and edit the wiki entry with your updates.
a. If your strategy assumes that unit arrays are ordered in any particular way, refactor your strategy so that it doesn't. Then, send it to me again so that I can upload it into the submissions folder.
For example, David's strategy intends to take some action with half of its scouts by checking if ship_index % 2 == 1
. But there is no guarantee that this will be true for any scouts.
For example, the unit array in one person's game might be
[Scout, Shipyard, Scout, Shipyard, Scout, Shipyard, Scout, Colony]
which would result in no scouts taking the desired action.
On the other hand, the unit array in another person's game might be
[Shipyard, Scout, Shipyard, Scout, Shipyard, Scout, Colony, Scout]
which would result in all scouts taking the desired action.
As our games work right now, we can't assume that the index tells us anything about what type of ship it is. Rather, to check what type of ship it is, we need to look at game_state['players'][self.player_index]['units'][ship_index]
to check if it's actually a scout.
If you wanted to send half of your scouts, to the enemy, this is how you could do it:
units = game_state['players'][self.player_index]['units']
scout_indices = [i for i, unit in enumerate(units) if unit['type'] == 'Scout']
half_scout_indices = [item for item in scout_indices if item % 2 == 1]
if ship_index in half_scout_indices:
unit = units[ship_index]
translation_towards_enemy = get_translation_towards_enemy(unit)
return translation_towards_enemy
else:
return (0,0)
b. Make sure you have two different game modes:
"debug mode"
- stop the game whenever a player makes an invalid decision, such as moving out of bounds or trying to buy something that they can't afford.
"competition mode"
- if a player makes an invalid decision, ignore it and move on.
When you're building your strategy, you should use "debug mode"
to make sure your strategy is doing what you intend for it to do.
But when we run the matchups, we should use "competition mode"
. This way, as long as everyone's games are effectively the same, we won't have any discrepancies to debug, and we can spend more time moving the needle forward instead of debugging people's strategies (which isn't a great use of time).
c. Continue debugging level 3. If you're not able to get the debugging done due to it taking a long time, then come to the next class with some ideas for how to speed up the process.
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Note: Next time we do neural networks, we'll switch back to implementing them in code.
Compute $\dfrac{\partial E}{\partial w_{47}},$ $\dfrac{\partial E}{\partial w_{14}},$ and $\dfrac{\partial E}{\partial w_{01}}.$
To check your answer, assume that
$y_\textrm{actual}=1,$
$a_k=k+11$ and $f'_k(i_k) = k+1$ for all $k,$
$w_{ab} = a+b$ for all $a,b.$
You should get the following:
$$\begin{align*} \dfrac{\partial E}{\partial w_{47}} &= 897,600 \\[5pt] \dfrac{\partial E}{\partial w_{14}} &= 156,024,000 \\[5pt] \dfrac{\partial E}{\partial w_{01}} &= 6,925,962,560 \\[5pt] \end{align*}$$Location: machine-learning/kaggle/titanic/parse_line.py
Write a function parse_line
that parses a comma-delimited line into its respective entries. For now, return all the entries as strings.
There are a couple "gotchas" to be aware of:
If two commas appear in sequence, it means that the entry between them is empty. So, the line "7.25,,S"
would be read as three entries, ['7.25', '', 'S']
.
If a comma appears within quotes, then the comma is part of that entry. For example:
the line "'Braund', 'Mr. Owen Harris', male"
would be three entries: ['Braund', '"Mr. Owen Harris"', 'male']
the line "'Braund, Mr. Owen Harris', male"
would be two entries: ["'Braund, Mr. Owen Harris'", "male"]
Here is a template for the recommended implementation:
def parse_line(line):
entries = [] # will be our final output
entry_str = "" # stores the string of the current entry
# that we're building up
inside_quotes = False # true if we're inside quotes
quote_symbol = None # stores the type of quotes we're inside,
# i.e. single quotes "'" or
# double quotes '"'
for char in line:
# if we're at a comma that's not inside quotes,
# store the current entry string. In other words,
# append entry_str to our list of entries and reset
# the value of entry_str
# otherwise, if we're not at a comma or we're at a
# comma that's inside quotes, then keep building up
# the entry string (i.e. append char to entry_str)
# if the char is a single or double quote, and is equal
# to the quote symbol or there is no quote symbol,
# then flip the truth value of inside_quotes and
# change the quote symbol to the current character
# append the current entry string to entries and return entries
Here are some tests:
>>> line_1 = "1,0,3,'Braund, Mr. Owen Harris',male,22,1,0,A/5 21171,7.25,,S"
>>> parse_line(line_1)
['1', '0', '3', "'Braund, Mr. Owen Harris'", 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S']
>>> line_2 = '102,0,3,"Petroff, Mr. Pastcho (""Pentcho"")",male,,0,0,349215,7.8958,,S'
>>> parse_line(line_2)
['102', '0', '3', '"Petroff, Mr. Pastcho (""Pentcho"")"', 'male', '', '0', '0', '349215', '7.8958', '', 'S']
>>> line_3 = '187,1,3,"O\'Brien, Mrs. Thomas (Johanna ""Hannah"" Godfrey)",female,,1,0,370365,15.5,,Q'
['187', '1', '3', '"O\'Brien, Mrs. Thomas (Johanna ""Hannah"" Godfrey)"', 'female', '', '1', '0', '370365', '15.5', '', 'Q']
Read the following:
https://www.learncpp.com/cpp-tutorial/dynamic-memory-allocation-with-new-and-delete/
Then, answer the following questions in an overleaf doc:
What are the differences between static memory allocation, automatic memory allocation, and dynamic memory allocation?
The following statement is false. Correct it.
To dynamically allocate an integer and assign the address to a pointer so we can access it later, we use the syntax
int *ptr{ new int };
. This tells our program to download some new memory from the internet and store a pointer to the new memory.
The following statement is false. Correct it.
The syntax
destroy ptr;
destroys the dynamically allocated memory that was accessible throughptr
. Because it was destroyed, this memory address can no longer be used by the computer in the future.
What does a bad_alloc
exception mean?
What is a null pointer? What makes it different from a normal pointer? What can we use it for, that we can't use a normal pointer for?
What is a memory leak, and why are memory leaks bad?
Does the following bit of code cause a memory leak? If so, why?
int value = 5;
int *ptr{ new int{} };
ptr = &value;
Does the following bit of code cause a memory leak? If so, why?
int value{ 5 };
int *ptr{ new int{} };
delete ptr;
ptr = &value;
Does the following bit of code cause a memory leak? If so, why?
int *ptr{ new int{} };
ptr = new int{};
Does the following bit of code cause a memory leak? If so, why?
int *ptr{ new int{} };
delete ptr;
ptr = new int{};
(You don't have to make or resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
Neural net overleaf: _____
repl.it link to parser: _____
C++ overleaf link: _____
commits: _____
(machine-learning, space-empires)
If you haven't already, turn in your Titanic prediction writeup on Canvas. It's important.
Meet with each classmate over the weekend to resolve level 3 discrepancies.
Once you've resolved discrepancies with a classmate, check the corresponding box in the spreadsheet.
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Note: We've been using the symbol $\textrm d$ for our derivative, i.e. $\dfrac{\textrm dE}{\textrm dw_{ij}}.$ However, it would be more clear to write this as a partial derivative, since the error $E$ depends on all of our weights (not just one weight). So we will use the convention $\dfrac{\partial E}{\partial w_{ij}}$ going forward.
Your task: Compute $\dfrac{\partial E}{\partial w_{35}},$ $\dfrac{\partial E}{\partial w_{45}},$ $\dfrac{\partial E}{\partial w_{13}},$ $\dfrac{\partial E}{\partial w_{23}},$ $\dfrac{\partial E}{\partial w_{14}},$ $\dfrac{\partial E}{\partial w_{24}},$ $\dfrac{\partial E}{\partial w_{01}},$ and $\dfrac{\partial E}{\partial w_{02}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.
$$ \begin{matrix} & n_5 \\ & \nearrow \hspace{1.25cm} \nwarrow \\ n_3 & & n_4 \\ \uparrow & \nwarrow \hspace{1cm} \nearrow & \uparrow \\[-10pt] | & \diagdown \diagup & | \\[-10pt] | & \diagup \diagdown & | \\[-10pt] | & \diagup \hspace{1cm} \diagdown & | \\ n_1 & & n_2\\ & \nwarrow \hspace{1.25cm} \nearrow \\ & n_0 \\ \end{matrix} $$Show ALL your work! (If some work is the same as what you've already wrote down for a previous gradient computation, you can just put dot-dot-dot. But if you get stuck, then go back and write down all intermediate steps.) Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)
Check your answer by substituting the following values:
$$ y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \\ a_4 = 6 \\ a_5 = 7 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 8 \\ f_1'(i_1) = 9 \\ f_2'(i_2) = 10 \\ f_3'(i_3) = 11 \\ f_4'(i_4) = 12 \\ f_5'(i_5)=13 \end{matrix} \qquad \begin{matrix} w_{01} = 14 \\ w_{02} = 15 \\ w_{13} = 16 \\ w_{14} = 17 \\ w_{23} = 18 \\ w_{24} = 19 \\ w_{34} = 20 \\ w_{35} = 21 \\ w_{45} = 22 \end{matrix} $$You should get the following:
$$\begin{align*} \dfrac{\partial E}{\partial w_{35}} &= 780 \\[5pt] \dfrac{\partial E}{\partial w_{45}} &= 936 \\[5pt] \dfrac{\partial E}{\partial w_{13}} &= 108108 \\[5pt] \dfrac{\partial E}{\partial w_{23}} &= 144144 \\[5pt] \dfrac{\partial E}{\partial w_{14}} &= 123552 \\[5pt] \dfrac{\partial E}{\partial w_{24}} &= 164736 \\[5pt] \dfrac{\partial E}{\partial w_{01}} &= 22980672 \\[5pt] \dfrac{\partial E}{\partial w_{02}} &= 28622880 \end{align*}$$Note: I was going to have us load the Titanic survival data, but I think we need to talk about the parsing algorithm during class beforehand. So, this will need to wait until next week. Instead, we'll do some C++ and SQL.
On sqltest.net, create a sql table by copying the following script:
Then, compute the average assignment score of each student, along with the number of assignments they've completed. List the results from highest average score to lowest average score, and include the full names of the students.
This is what your output should look like:
name avgScore numCompleted
Sylvia Sanchez 95.0000 2
Ishmael Smith 91.2500 4
Franklin Walton 90.0000 1
Kinga Shenko 83.3333 3
Harry Ng 72.5000 4
Observe that the following code can be used to increase the entries in an array by some amount, via a helper function:
# include <iostream>
void incrementArray(int arr[], int length, int amt)
{
for (int i = 0; i < length; i++)
arr[i] += amt;
}
int main()
{
int array[] = {10, 20, 30, 40};
int length = sizeof(array) / sizeof(array[0]);
int amt = 3;
incrementArray(array, length, amt);
for (int i = 0; i < 4; i++)
std::cout << array[i] << " ";
return 0;
}
--- output ---
11 12 13 14
Write a function dotProduct
that computes the dot product of two input arrays. (You'll need to include the length as the input, too.)
# include <iostream>
# include <cassert>
// write dotProd here
int main()
{
int array1[] = {1, 2, 3, 4};
int array2[] = {5, 6, 7, 8};
int length = sizeof(array1) / sizeof(array1[0]);
int ans = dotProduct(array1, array2, length);
std::cout << "Testing...\n";
assert(ans == 70);
std::cout << "Success!";
return 0;
}
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
(You don't have to resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
Neural net overleaf: _____
C++ repl.it link: _____
sqltest.net link: _____
commits: _____
(assignment-problems, space-empires)
Created issue: _____
a. If you haven't already, fix any issues in your level 3 strategy and send it to me. During class, I heard that Colby, George, and Elijah may have things to fix:
Colby -- can't use tactics in level 3
George -- only colonies have "turn created"
Elijah -- can't get "type" from the combat dictionary. Instead, you need to get the player number and unit number from the combat dictionary, and then look up "type" in hidden_game_state_for_combat
.
b. Once everyone has finalized their strategies, simulate the matchups.
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
It sounds like a lot of our models were breaking on the Titanic dataset, which caused the previous assignment to require waaaaaay more debugging time than I had budgeted for. My apologies; I didn't intend for it to become a day-long problem.
You can have another 2 days to finish up your writeup. Let's focus on just getting the writeup done, even if the numbers aren't looking right. If some model breaks, and you're not able to fix it after a couple minutes, just move on to the rest of the models, even if the numbers look wrong.
Over the next several weeks, we'll step through the same modeling process more carefully, one step at a time, one model at a time, fixing any errors that arise along the way. Your task right now is just to run your existing models on the dataset and write up what you get, even if the numbers don't look right. In order to plan out the step-by-step approach, I need to know where we stand right now (i.e. what results you're currently getting for the models).
Submit corrections for any problem you got wrong. Try to do these corrections without looking at the recording of what we went over in class.
You don't have to explain what you got wrong or why. Just send in the correct results.
Put the answers to these questions in an overleaf doc.
In C++, you can think of strings as arrays of numbers that represent characters.
char myString[]{ "hello world" };
int length = sizeof(myString) / sizeof(myString[0]);
for(int i=0; i<length; i++) {
std::cout << myString[i];
}
std::cout << "\n";
std::cout << "the length of this string is " << length;
--- output ---
hello world
the length of this string is 12
Note that the length of the string is always one more than the number of characters (including spaces) in the string. This is because, under the hood, C++ needs to add a "null terminator" to the end of the string so that it knows where the string stops.
So the array contains all the numeric codes of the letters in the string, plus a null terminator at the end (which you don't see when the string is printed out).
Question. Suppose you create an array that contains all the lowercase letters of the English alphabet in alphabetical order. What would the length of this array be? (If your answer is 26, please re-read the paragraphs above.)
b. Read about pointers here: https://www.learncpp.com/cpp-tutorial/introduction-to-pointers/
Then, answer the following questions:
Suppose you use int x{ 5 }
to set the variable x
to have the value of 5.
What is the difference between x
and &x
?
Suppose you want to make a pointer p
that points to the memory address of x
(from question 1). How do you initialize p
?
Suppose you have
int v{ 5 };
int* ptr{ &v };
Without using the symbol v
, what notation can you use to get the value of v
? (Hint: get the value stored at the memory address of v
)
Suppose you initialize a pointer as an int
. Can you use it to point to the memory address of a variable that is a char
?
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
(You don't have to resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
Modeling writeup: _____
Overleaf (C++ answers): _____
commit: _____
(machine-learning)
Created issue: _____
This problem is the beginning of some more involved machine learning tasks. To ease the transition, this will be the only problem on this assignment.
This problem is just as important as space empires and neural nets, and the modeling techniques covered will 100% be on future quizzes and the final. Be sure to do this problem well. If you've run into any issues with your space empires simulations, DO THIS PROBLEM FIRST before you go back to space empires.
Make an account on Kaggle.com so that we can walk through a Titanic prediction task.
Go to https://www.kaggle.com/c/titanic/data, scroll down to the bottom, and click "download all". You'll get a zip file called titanic.zip
.
Upload titanic.zip
into machine-learning/datasets/titanic/
. Then, run unzip machine-learning/datasets/titanic/titanic.zip
in the command line to unzip the file.
This gives us 3 files: train.csv
, test.csv
, and gender_submission.csv
. The file train.csv
contains data about a bunch of passengers along with whether or not they survived. Our goal is to use train.csv
to build a model that will predict the outcome of passengers in test.csv
(for which the survival data is not given).
IMPORTANT: To prevent confusion, rename train.csv
to dataset_of_knowns.csv
, rename test.csv
to unknowns_to_predict.csv
, and rename gender_submission.csv
to predictions_from_gender_model.csv
.
The file predictions_from_gender_model.csv
is an example of predictions from a really, really basic model: if the passenger is female, predict that they survived; if the passenger is male, predicte that they did not survive.
To build a model, we will proceed with the following steps:
Feature Selection - deciding which variables we want in our model. This is usually a subset of the original number of features.
Model Selection - ranking our models from best to worst, based on cross-validation performance. (We'll train each model on half the data, use it to predict the other half of the data, and see how accurate it is.)
Submission - taking our best model, training it on the full dataset_of_knowns.csv
, running it on unknowns_to_predict.csv
, generating a predictions.csv
file, and uploading it to Kaggle.com for scoring.
For this problem, you will need to write what you did for each of these steps in an Overleaf doc (kind of like you would in a lab journal). So, open up one now and let's continue.
In your Overleaf doc, create a section called "Feature Selection". Make a bulleted list of all the features along with your justification for using or not using the feature in your model.
Important: There is a data dictionary at https://www.kaggle.com/c/titanic/data that describes what each feature means.
It will be helpful to look at the actual values of the variables as well. For example, here are the first 5 records in the dataset:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
For every feature that you decide to keep, give a possible theory for how the feature may help predict whether a passenger survived.
Age
feature because it's likely that younger passengers were given priority when boarding lifeboats.For every feature that you decide to remove, explain why it 1) is irrelevant to the prediction task, or 2) would take too long to transform into a worthwhile feature.
ticket
feature because it's formatted so weirdly (e.g. A/5 21171
). It's possible that there may be some information here, but it would take a while to figure out how to turn this into a worthwhile feature that we could actually plug into our model. (There's multiple parts to the ticket number and it's a combination of letters and numbers, so it's not straightforward how to use it.)Split dataset_of_knowns.csv
in half, selecting every other row for training and leaving the leftover rows for testing.
Fit several models to the training dataset:
Linear regressor - if output is greater than or equal to 0.5, predict category 1 (survived); if output is less than 0.5, predict category 0 (didn't survive). You may wish to include interaction terms that you think would be important, since linear regressors do not capture interactions by default.
Logistic regressor - same notes as above (for the linear regressor)
Gini decision tree - conveniently, decision trees predict categorical variables by default ("survived" has 2 categories, 0 and 1), and they also capture interactions by default (so don't include any interaction terms when you feed the data into the Gini tree). We'll try out 2 different models, one with max_depth=5
and another with max_depth=10
.
Random forest - same notes as above (for the Gini decision tree). We'll try out 2 different models, one with max_depth=3
and num_trees=1000
, and another with max_depth=5
and num_trees=1000
.
Naive Bayes - note that you'll need to take any features that are quantitative and re-label their values by categories. By default, you can just use 2 categories: "low" and "high", where a value is labeled as "low" if it's less than or equal to the mean and "high" if it's above the mean.
k-Nearest Neighbors - for any variables that are quantitative, transform them as
$$
x \to \dfrac{x - \min(x)}{\max(x) - \min(x)}
$$
so that they fit into the interval [0,1].
For any variables that are categorical, leave them be. Use a "Mahnattan" distance metric (the sum of absolute differences). Note that if a variable is categorical, then their distance between 2 values should be counted as $0$ if they are the same and $1$ if they are different. We'll try 2 different models, k=5
and k=10
.
The reason for the Manhattan distance metric instead of the Euclidean distance metric is so that differences between categorical variables do not drastically overpower differences between quantitative variables.
For example, suppose you had two data points (0.2, 0.7, "dog", "red")
and (0.5, 0.1, "cat", "red")
. Then the distance would be as follows:
distance
= |0.2-0.5| + |0.7-0.1| + int("dog"!="cat") + int("red"!="red")
= 0.3 + 0.6 + 1 + 0
= 1.9
Then, use these models to predict survival for the training dataset and the testing dataset separately. Make a table in your Overleaf doc that contains the resulting accuracy rates.
max_depth=10
on the training dataset, and then use it to predict on the testing dataset. You get 98 correct predictions on the training dataset and 70 correct predictions on the testing dataset (which, by the way, is an indication that you're overfitting -- your max_depth
is probably too high). Then your table looks like this:Model | Training Accuracy | Testing Accuracy
----------------------------------------------------------
Gini depth 10 | 98% | 70%
To be clear, your table should have 9 rows, one for each model: linear, logistic, Gini depth 5, Gini depth 10, random forest depth 5, random forest depth 10, naive Bayes, 5-nearest-neighbors, 10-nearest-neighbors.
Take your best model (i.e. the one with the highest testing accuracy), re-train it on the entire dataset_of_knowns.csv
, and evaluate its predictions on unknowns_to_predict.csv
.
Save your results as predictions.csv
, and make sure they follow the exact same format as predictions_from_gender_model.csv
.
0
's and 1
s as integers, not strings. Make sure that you include the PassengerId
column. Make sure that the values in the PassengerId
column match up exactly with those in predictions_from_gender_model.csv
. The only thing that should be different is the values in the Survived
column.Click on the "Submit Predictions" button on the right side of the screen and submit your file predictions.csv
. You should get a screen that looks like the image below, but has your predictions.csv
instead of gender-submissions.csv
. You should hopefully get a higher score than 0.76555
(which is the baseline accuracy of the "all women survive" model).
Take a screenshot of this screen, post it on #machine-learning, and include it in your Overleaf writeup.
Just the Overleaf writeup and a commit link to your machine-learning
repo. That's it.
Once your strategy is finalized, Slack it to me and I'll upload it here.
https://github.com/eurisko-us/eurisko-us.github.io/tree/master/files/strategies/cohort-1/level-3
Then, once everyone's strategies are submitted, I'll make an announcement, and you can download the strategies from the above folder and run all pairwise battles for 100 games.
Go through max_turns=100
before declaring a draw. I think this should run quick enough, since we decreased from 500 games to 100 games, but if any 100-game matchups are taking longer than a couple minutes to run, then post about it and we'll figure something out.
Put your data in the spreadsheet:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Remember to switch the order of the players halfway through the simulation so that each player goes first an equal number of times.
Seed the games: game 1 has seed 1, game 2 has seed 2, and so on. This way, we should all get exactly the same results.
As usual, there will be prizes:
Compute $\dfrac{\textrm dE}{\textrm dw_{34}},$ $\dfrac{\textrm dE}{\textrm dw_{24}},$ $\dfrac{\textrm dE}{\textrm dw_{13}},$ $\dfrac{\textrm dE}{\textrm dw_{12}},$ and $\dfrac{\textrm dE}{\textrm dw_{01}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.
$$ \begin{matrix} & & n_4 \\ & \nearrow & & \nwarrow \\ n_2 & & & & n_3 \\ & \nwarrow & & \nearrow \\ & & n_1 \\ & & \uparrow \\ & & n_0 \\ \end{matrix} $$Show ALL your work! Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)
Check your answer by substituting the following values:
$$ y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \\ a_4 = 6 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 7 \\ f_1'(i_1) = 8 \\ f_2'(i_2) = 9 \\ f_3'(i_3) = 10 \\ f_4'(i_4) = 11 \end{matrix} \qquad \begin{matrix} w_{01} = 12 \\ w_{12} = 13 \\ w_{13} = 14 \\ w_{24} = 15 \\ w_{34} = 16 \end{matrix} $$You should get $$ \dfrac{\textrm dE}{\textrm d w_{34}} = 550, \qquad \dfrac{\textrm dE}{\textrm d w_{24}} = 440, \qquad \dfrac{\textrm dE}{\textrm d w_{13}} = 52800, \qquad \dfrac{\textrm dE}{\textrm d w_{12}} = 44550, \qquad \dfrac{\textrm dE}{\textrm d w_{01}} = 7031200. $$
Write a recursive function merge
that merges two sorted lists. To do this, you can check the first elements of each list, and make the lesser one the next element, then merge the lists that remain.
merge (x:xs) (y:ys) = if x < y
then _______
else _______
merge [] xs = ____
merge xs [] = ____
main = print(merge [1,2,5,8] [3,4,6,7,10])
-- should return [1,2,3,4,5,6,7,8,10]
On sqltest.net, create a sql table by copying the following script:
Then, compute the average assignment score of each student. List the results from highest to lowest, along with the full names of the students.
This is what your output should look like:
fullname avgScore
Ishmael Smith 90.0000
Sylvia Sanchez 86.6667
Kinga Shenko 85.0000
Franklin Walton 80.0000
Harry Ng 78.3333
Hint: You'll have to use a join and a group by.
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
(You don't have to resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
neural nets overleaf: _____
put your game results in the spreadsheet (but you don't have to paste the link)
Repl.it link to haskell file: _____
sqltest.net link: _____
commits: _____
(space-empires, assignment-problems)
Created issue: _____
Notation
$n_k$ - the $k$th neuron
$a_k$ - the activity of the $k$th neuron
$i_k$ - the input to the $k$th neuron. This is the weighted sum of activities of the parents of $n_k.$ If $n_k$ has no parents, then $i_k$ comes from the data directly.
$f_k$ - the activation function of the $k$th neuron. Note that in general, we have $a_k = f_k(i_k)$
$w_{k \ell}$ - the weight of the connection $n_k \to n_\ell.$ In your code, this is weights[(k,l)]
.
$E = (y_\textrm{predicted} - y_\textrm{actual})^2$ is the squared error that results from using the neural net to predict the value of the dependent variable, given values of the independent variables
$w_{k \ell} \to w_{k \ell} - \alpha \dfrac{\textrm dE}{\textrm dw_{k\ell}}$ is the gradient descent update, where $\alpha$ is the learning rate
Example
For a simple network $$ \begin{matrix} & & n_2 \\ & \nearrow & & \nwarrow \\ n_0 & & & & n_1,\end{matrix} $$ we have:
$$\begin{align*} y_\textrm{predicted} &= a_2 \\ &= f_2(i_2) \\ &= f_2(w_{02} a_0 + w_{12} a_1) \\ &= f_2(w_{02} f_0(i_0) + w_{12} f_1(i_1) ) \\ \\ \dfrac{\textrm dE}{\textrm dw_{02}} &= \dfrac{\textrm d}{\textrm dw_{02}} \left[ (y_\textrm{predicted} - y_\textrm{actual})^2 \right] \\ &= \dfrac{\textrm d}{\textrm dw_{02}} \left[ (a_2 - y_\textrm{actual})^2 \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d}{\textrm dw_{02}} \left[ a_2 - y_\textrm{actual} \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d }{\textrm dw_{02}} \left[ a_2 \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d }{\textrm dw_{02}} \left[ f_2(i_2) \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ i_2 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ w_{02} a_0 + w_{12} a_1 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ w_{02} a_0 + w_{12} a_1 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) a_0 \\ \\ \dfrac{\textrm dE}{\textrm dw_{12}} &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) a_1 \end{align*}$$THE ACTUAL PROBLEM STATEMENT
Compute $\dfrac{\textrm dE}{\textrm dw_{23}},$ $\dfrac{\textrm dE}{\textrm dw_{12}},$ and $\dfrac{\textrm dE}{\textrm dw_{01}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.
$$ \begin{matrix} n_3 \\ \uparrow \\ n_2 \\ \uparrow \\ n_1 \\ \uparrow \\ n_0 \end{matrix} $$Show ALL your work! Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)
Check your answer by substituting the following values:
$$ y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 6 \\ f_1'(i_1) = 7 \\ f_2'(i_2) = 8 \\ f_3'(i_3) = 9 \end{matrix} \qquad \begin{matrix} w_{01} = 10 \\ w_{12} = 11 \\ w_{23} = 12 \end{matrix} $$You should get $$ \dfrac{\textrm dE}{\textrm d w_{23}} = 288, \qquad \dfrac{\textrm dE}{\textrm d w_{12}} = 20736, \qquad \dfrac{\textrm dE}{\textrm d w_{01}} = 1064448. $$
Note: On the next couple assignments, we'll do the same exercise with progressively more advanced networks. This problem is relatively simple so that you have a chance to get used to working with the notation.
Finish creating your game level 3 strategy. (See problem 93-1 for a description of game level 3, which you should have implemented by now.) Then, implement the following strategy and run it against your level 3 strategy:
NumbersBerserkerLevel3
- always buys as many scouts as possible, and each time it buys a scout, immediately sends it on a direct route to attack the opponent.Post on #machine-learning with your strategy's stats against these strategies:
MyStrategy vs NumbersBerserker
- MyStrategy win rate: __%
- MyStrategy loss rate: __%
- draw rate: __%
On the next assignment, we'll have the official matchups.
Write a function calcSum(m,n)
that computes the sum of the matrix product of an ascending $m \times n$ and a descending $n \times m$ array, where the array entries are taken from $\{ 1, 2, ..., mn \}.$ For example, if $m=2$ and $n=3,$ then
#include <iostream>
#include <cassert>
// define calcSum
int main() {
// write an assert for the test case m=2, n=3
}
On sqltest.net, create the following tables:
CREATE TABLE age (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
lastname VARCHAR(30),
age VARCHAR(30)
);
INSERT INTO `age` (`id`, `lastname`, `age`)
VALUES ('1', 'Walton', '12');
INSERT INTO `age` (`id`, `lastname`, `age`)
VALUES ('2', 'Sanchez', '13');
INSERT INTO `age` (`id`, `lastname`, `age`)
VALUES ('3', 'Ng', '14');
INSERT INTO `age` (`id`, `lastname`, `age`)
VALUES ('4', 'Smith', '15');
INSERT INTO `age` (`id`, `lastname`, `age`)
VALUES ('5', 'Shenko', '16');
CREATE TABLE name (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(30),
lastname VARCHAR(30)
);
INSERT INTO `name` (`id`, `age`, `lastname`)
VALUES ('1', 'Franklin', 'Walton');
INSERT INTO `name` (`id`, `firstname`, `lastname`)
VALUES ('2', 'Sylvia', 'Sanchez');
INSERT INTO `name` (`id`, `firstname`, `lastname`)
VALUES ('3', 'Harry', 'Ng');
INSERT INTO `name` (`id`, `firstname`, `lastname`)
VALUES ('4', 'Ishmael', 'Smith');
INSERT INTO `name` (`id`, `firstname`, `lastname`)
VALUES ('5', 'Kinga', 'Shenko');
Then, write a query to get the full names of the people, along with their ages, in alphabetical order of last name. The output should look like this:
Harry Ng is 14.
Sylvia Sanchez is 13.
Kinga Shenko is 16.
Ishmael Smith is 15.
Franklin Walton is 12.
Tip: You'll need to use string concatenation and a join.
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
(You don't have to resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
Overleaf: _____
Repl.it link to C++ file: _____
sqltest.net link: _____
assignment-problems commit: _____
space-empires commit: _____
Created issue: _____
Reconcile remaining discrepancies in game level 2 so we can crown the winners:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Then, write your first custom strategy for the level 3 game. We'll start matchups on Wednesday. We'll go through several rounds of matchups on this level since the game is starting to become more rich.
(We'll have the same extra credit prizes for 1st / 2nd / 3rd place)
Note: In decide_which_unit_to_attack
, be sure to use 'player'
and 'unit'
instead of 'player_index'
and 'unit_index'
.
# combat_state is a dictionary in the form coordinates : combat_order
# {
# (1,2): [{'player': 1, 'unit': 0},
# {'player': 0, 'unit': 1},
# {'player': 1, 'unit': 1},
# {'player': 1, 'unit': 2}],
# (2,2): [{'player': 2, 'unit': 0},
# {'player': 3, 'unit': 1},
# {'player': 2, 'unit': 1},
# {'player': 2, 'unit': 2}]
# }
Make sure you get this problem done completely. Neural nets have a very steep learning curve and they're going to be sticking with us until the end of the semester.
a. Given $\sigma(x) = \dfrac{1}{1+e^{-x}},$ prove that $\sigma'(x) = \sigma(x) (1-\sigma(x)).$ Write this proof in an Overleaf doc.
b. In neural networks, neurons are often given "activation functions", where
node.activity = node.activation_function(weighted sum of inputs to node)
In this problem, you'll extend your neural net to include activation functions. Then, you'll equip the neurons with activations so as to implement a logistic regressor.
>>> weights = {(0,2): -0.1, (1,2): 0.5}
>>> def linear_function(x):
return x
>>> def linear_derivative(x):
return 1
>>> def sigmoidal_function(x):
return 1/(1+math.exp(-x))
>>> def sigmoidal_derivative(x):
s = sigmoidal_function(x)
return s * (1 - s)
>>> activation_types = ['linear', 'linear', 'sigmoidal']
>>> activation_functions = {
'linear': {
'function': linear_function,
'derivative': linear_derivative
},
'sigmoidal': {
'function': sigmoidal_function,
'derivative': sigmoidal_derivative
}
}
>>> nn = NeuralNetwork(weights, activation_types, activation_functions)
>>> data_points = [
{'input': [1,0], 'output': [0.1]},
{'input': [1,1], 'output': [0.2]},
{'input': [1,2], 'output': [0.4]},
{'input': [1,3], 'output': [0.7]}
]
>>> for i in range(1,10001):
err = 0
for data_point in data_points:
nn.update_weights(data_point)
err += nn.calc_squared_error(data_point)
if i < 5 or i % 1000 == 0:
print('iteration {}'.format(i))
print(' gradient: {}'.format(nn.calc_gradient(data_point))
print(' updated weights: {}'.format(nn.weights))
print(' error: {}'.format(err))
print()
iteration 1
gradient: {(0, 2): 0.03184692266577955, (1, 2): 0.09554076799733865}
updated weights: {(0, 2): -0.10537885784041535, (1, 2): 0.4945789883636697}
error: 0.40480006957774683
iteration 2
gradient: {(0, 2): 0.031126202300065627, (1, 2): 0.09337860690019688}
updated weights: {(0, 2): -0.11072951375555531, (1, 2): 0.48919868238711295}
error: 0.3989945995186133
iteration 3
gradient: {(0, 2): 0.030367826123201307, (1, 2): 0.09110347836960392}
updated weights: {(0, 2): -0.11605116651884796, (1, 2): 0.4838609744178689}
error: 0.3932640005281893
iteration 4
gradient: {(0, 2): 0.029572207383720784, (1, 2): 0.08871662215116236}
updated weights: {(0, 2): -0.12134303561025003, (1, 2): 0.4785677220228999}
error: 0.3876106111541695
iteration 1000
gradient: {(0, 2): -0.04248103992359947, (1, 2): -0.12744311977079842}
updated weights: {(0, 2): -1.441870816044744, (1, 2): 0.6320712307086241}
error: 0.03103391055967604
iteration 2000
gradient: {(0, 2): -0.026576913835657988, (1, 2): -0.07973074150697396}
updated weights: {(0, 2): -1.8462575194764488, (1, 2): 0.8112377281576201}
error: 0.010469324799663702
iteration 3000
gradient: {(0, 2): -0.019389915442213898, (1, 2): -0.058169746326641694}
updated weights: {(0, 2): -2.0580006793189596, (1, 2): 0.903267622168482}
error: 0.004993174823452696
iteration 4000
gradient: {(0, 2): -0.01536481706566838, (1, 2): -0.04609445119700514}
updated weights: {(0, 2): -2.187017035077964, (1, 2): 0.9588032475551099}
error: 0.002982405174006053
iteration 5000
gradient: {(0, 2): -0.012858896793162088, (1, 2): -0.038576690379486266}
updated weights: {(0, 2): -2.2717393677429842, (1, 2): 0.995065996436664}
error: 0.00211991513136444
iteration 6000
gradient: {(0, 2): -0.011201146193726709, (1, 2): -0.033603438581180124}
updated weights: {(0, 2): -2.3298248394321606, (1, 2): 1.0198377357361068}
error: 0.0017156674543843792
iteration 7000
gradient: {(0, 2): -0.010062009597155228, (1, 2): -0.030186028791465685}
updated weights: {(0, 2): -2.370740520022862, (1, 2): 1.037244660012689}
error: 0.0015153961429219282
iteration 8000
gradient: {(0, 2): -0.009259319779522148, (1, 2): -0.027777959338566444}
updated weights: {(0, 2): -2.400083365137227, (1, 2): 1.0497070597284772}
error: 0.0014124679719747604
iteration 9000
gradient: {(0, 2): -0.008683873946383038, (1, 2): -0.026051621839149115}
updated weights: {(0, 2): -2.4213875864199608, (1, 2): 1.058744505427183}
error: 0.0013582149901490035
iteration 10000
gradient: {(0, 2): -0.00826631063707707, (1, 2): -0.024798931911231212}
updated weights: {(0, 2): -2.4369901278483534, (1, 2): 1.065357551487286}
error: 0.001329102258719855
>>> nn.weights
should be close to
{(0,2): -2.44, (1,2): 1.07}
because the data points all lie approximately on the sigmoid
output = 1/(1 + e^(-(input[0] * -2.44 + input[1] * 1.07)) )
Super Important: You'll have to update your gradient descent to account for the activation functions. This will require using the chain rule. In our case, we'll have
squared_error = (y_predicted - y_actual)^2
d(squared_error)/d(weights)
= 2 (y_predicted - y_actual) d(y_predicted - y_actual)/d(weights)
= 2 (y_predicted - y_actual) [ d(y_predicted)/d(weights) - 0]
= 2 (y_predicted - y_actual) d(y_predicted)/d(weights)
y_predicted
= nodes[2].activity
= nodes[2].activation_function(nodes[2].input)
= nodes[2].activation_function(
weights[(0,2)] * nodes[0].activity
+ weights[(1,2)] * nodes[1].activity
)
= nodes[2].activation_function(
weights[(0,2)] * nodes[0].activation_function(nodes[0].input)
+ weights[(1,2)] * nodes[1].activation_function(nodes[1].input)
)
d(y_predicted)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* d(nodes[2].input)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* d(weights[(0,2)] * nodes[0].activity + weights[(1,2)] * nodes[1].activity)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* nodes[0].activity
by the same reasoning as above:
d(y_predicted)/d(weights[(1,2)]
= nodes[2].activation_derivative(nodes[2].input)
* nodes[1].activity
Note: If no activation_functions
variable is passed in, then assume all activation functions are linear.
Write a class HashTable
that generalizes the hash table you previously wrote. This class should store an array of buckets, and the hash function should add up the alphabet indices of the input string and mod the result by the number of buckets.
>>> ht = HashTable(num_buckets = 3)
>>> ht.buckets
[[], [], []]
>>> ht.hash_function('cabbage')
2 (because 2+0+1+1+0+6+4 mod 3 = 14 mod 3 = 2)
>>> ht.insert('cabbage', 5)
>>> ht.buckets
[[], [], [('cabbage',5)]]
>>> ht.insert('cab', 20)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5)]]
>>> ht.insert('c', 17)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5), ('c',17)]]
>>> ht.insert('ac', 21)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5), ('c',17), ('ac', 21)]]
>>> ht.find('cabbage')
5
>>> ht.find('cab')
20
>>> ht.find('c')
17
>>> ht.find('ac')
21
This is a really quick problem, mostly just getting you to learn the ropes of the process we'll be using for doing SQL problems going forward (now that we're done with SQL Zoo).
On https://sqltest.net/, create table with the following script:
CREATE TABLE people (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(30) NOT NULL,
age VARCHAR(50)
);
INSERT INTO `people` (`id`, `name`, `age`)
VALUES ('1', 'Franklin', '12');
INSERT INTO `people` (`id`, `name`, `age`)
VALUES ('2', 'Sylvia', '13');
INSERT INTO `people` (`id`, `name`, `age`)
VALUES ('3', 'Harry', '14');
INSERT INTO `people` (`id`, `name`, `age`)
VALUES ('4', 'Ishmael', '15');
INSERT INTO `people` (`id`, `name`, `age`)
VALUES ('5', 'Kinga', '16');
Then select all teenage people whose names do not start with a vowel, and order by oldest first.
In order to run the query, you need to click the "Select Database" dropdown in the very top-right corner (so top-right that it might partially run off your screen) and select MySQL 5.6.
This is what your result should be:
id name age
5 Kinga 16
3 Harry 14
2 Sylvia 13
Copy the link where it says "Link for sharing your example:". This is what you'll submit for your assignment.
There will be a quiz on Friday over things that we've done with C++, Haskell, SQL, and Neural Nets.
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
(You don't have to resolve any issues on this assignment)
For your submission, copy and paste your links into the following template:
Repl.it link to custom level 3 strategy: ____
Overleaf link to proof of derivative of sigmoid: ____
Repl.it link to neural network: ____
Repl.it link to hash table: ____
SQLtest.net link: ____
Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
Commit link for machine-learning repo: _____
Created issue: _____
Reconcile higlighted discrepancies
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Implement game level 3
Regular (repeated) economic phases -- once every turn
Change the starting CP back to 0 (now that we have repeated economic phases, we no longer need the extra CP boost at the beginning).
3 movement rounds on each turn
7x7 board - starting positions are now (3,0) and (3,6)
Since we had to postpone the neural net problem, you can use the extra time to begin implementing your custom player for the level 3 game (we'll have the level 3 battles soon).
Location: assignment-problems/hash_table.py
Under the hood, Python dictionaries are hash tables.
The most elementary (and inefficient) version of a hash table would be a list of tuples. For example, if we wanted to implement the dictionary {'a': [0,1], 'b': 'abcd', 'c': 3.14}
, then we'd have the following:
list_of_tuples = [('a', [0,1]), ('b', 'abcd'), ('c', 3.14)]
To add a new key-value pair to the dictionary, we'd just append the corresponding tuple to list_of_tuples
, and to look up the value for some key, we'd just loop through list_of_tuples
until we got to the tuple with the key we wanted (and return the value).
But searching through a long array is very slow. So, to be more efficient, we use several list_of_tuple
s (which we'll call "buckets"), and we use a hash_function
to tell us which bucket to put the new key-value pair in.
Complete the code below to implement a special case of an elementary hash table. We'll expand on this example soon, but let's start with something simple.
array = [[], [], [], [], []] # has 5 empty "buckets"
def hash_function(string):
# return the sum of character indices in the string
# (where "a" has index 0, "b" has index 1, ..., "z" has index 25)
# modulo 5
# for now, let's just assume the string consists of lowercase
# letters with no other characters or spaces
def insert(array, key, value):
# apply the hash function to the key to get the bucket index.
# then append the (key, value) pair to the bucket.
def find(array, key):
# apply the hash function to the key to get the bucket index.
# then loop through the bucket until you get to the tuple with the desired key,
# and return the corresponding value.
Here's an example of how the hash table will work:
>>> print(array)
array = [[], [], [], [], []]
>>> insert(array, 'a', [0,1])
>>> insert(array, 'b', 'abcd')
>>> insert(array, 'c', 3.14)
>>> print(array)
[[('a',[0,1])], [('b','abcd')], [('c',3.14)], [], []]
>>> insert(array, 'd', 0)
>>> insert(array, 'e', 0)
>>> insert(array, 'f', 0)
>>> print(array)
[[('a',[0,1]), ('f',0)], [('b','abcd')], [('c',3.14)], [('d',0)], [('e',0)]]
Test your code as follows:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
for i, char in enumerate(alphabet):
key = 'someletters'+char
value = [i, i**2, i**3]
insert(array, key, value)
for i, char in enumerate(alphabet):
key = 'someletters'+char
output_value = find(array, key)
desired_value = [i, i**2, i**3]
assert output_value == desired_value
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-3/problem
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/Using_Null (queries 7, 8, 9, 10)
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to neural network: ____
Repl.it link to hash table: ____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for assignment-problems repo: _____
Commit link for machine-learning repo: _____
Created issue: _____
Resolved issue: _____
a. Once your strategy is finalized, Slack it to me and I'll upload it here.
https://github.com/eurisko-us/eurisko-us.github.io/tree/master/files/strategies/cohort-1/level-2
If your strategy is getting crushed by NumbersBerserker, keep in mind that it's okay to copy NumbersBerserker and then tweak it a little bit with your own spin. Your strategy should have some original component, but it does not need to be 100% original (or even mostly original).
Then, once everyone's strategies are submitted, download the strategies from the above folder and run all pairwise battles for 500 games.
Put your data in the spreadsheet:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Remember to switch the order of the players halfway through the simulation so that each player goes first an equal number of times.
Seed the games: game 1 has seed 1, game 2 has seed 2, and so on. This way, we should all get exactly the same results.
Assuming our games match up so that we can actually agree about who won, there will be prizes:
b. Time for an introduction to neural nets! In this problem, we'll create a really simple neural network that is essentially a "neural net"-style implementation of linear regression. We'll start off with something simple and familiar, but we'll implement much more advanced models in the near future.
Note: It seems like we need to merge our graph
library into our machine-learning
library. So, let's do that. The src
your machine-learning
library should now look like this:
src/
- models/
- linear_regressor.py
- neural_network.py
- ...
- graphs/
- weighted_graph.py
- ...
(If you have a better idea for the structure of our library, feel free to do it your way and bring it up for discussion during the next class)
Create a NeuralNetwork
class that inherits from weighted graph. Pass in dictionary of weights to determine connectivity and initial weights.
>>> weights = {(0,2): -0.1, (1,2): 0.5}
>>> nn = NeuralNetwork(weights)
This is a graphical representation of the model:
nodes[2] ("output layer")
^ ^
/ \
weights[(0,2)] weights[(1,2)]
^ ^
/ \
nodes[0] nodes[1] ("input layer")
To make a prediction, our simple neural net computes a weighted sum of the input values. (Again, this will become more involved in the future, but let's not worry about that just yet.)
>>> nn.predict([1,3])
1.4
behind the scenes:
assign nodes[0] a value of 1 and nodes[1] a value of 3,
and then return the following:
weights[(0,2)] * nodes[0].value + weights[(1,2)] * nodes[1].value
= -0.1 * 1 + 0.5 * 3
= 1.4
If we know the output that's supposed to be associated with a given input, we can compute the error in the prediction.
We'll use the squared error, so that we can frame the problem of fitting the neural network as "choosing weights which minimize the squared error".
To find the weights which minimize the squared error, we can perform gradient descent. As we'll see in the future, calculating the gradient of the weights can get a little tricky (it requires a technique called "backpropagation"). But for now, you can just hard-code the process for this particular network.
>>> data_point = {'input': [1,3], 'output': [7]}
>>> nn.calc_squared_error(data_point)
31.36 [ because (7-1.4)^2 = 5.6^2 = 31.36 ]
>>> nn.calc_gradient(data_point)
{(0,2): -11.2, (1,2): -33.6}
behind the scenes:
squared_error = (y_actual - y_predicted)^2
d(squared_error)/d(weights)
= 2 (y_actual - y_predicted) d(y_actual - y_predicted)/d(weights)
= 2 (y_actual - y_predicted) [ 0 - d(y_predicted)/d(weights) ]
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights)
remember that
y_predicted = weights[(0,2)] * nodes[0].value + weights[(1,2)] * nodes[1].value
so
d(y_predicted)/d(weights[(0,2)]) = nodes[0].value
d(y_predicted)/d(weights[(1,2)]) = nodes[1].value
Therefore
d(squared_error)/d(weights[(0,2)])
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights[(0,2)])
= -2 (y_actual - y_predicted) nodes[0].value
= -2 (7 - 1.4) (1)
= -11.2
d(squared_error)/d(weights[(1,2)])
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights[(1,2)])
= -2 (y_actual - y_predicted) nodes[1].value
= -2 (7 - 1.4) (3)
= -33.6
Once we've got the gradient, we can update the weights using gradient descent.
>>> nn.update_weights(data_point, learning_rate=0.01)
new_weights = old_weights - learning_rate * gradient
= {(0,2): -0.1, (1,2): 0.5}
- 0.01 * {(0,2): -11.2, (1,2): -33.6}
= {(0,2): -0.1, (1,2): 0.5}
+ {(0,2): 0.112, (1,2): 0.336}
= {(0,2): 0.012, (1,2): 0.836}
If we repeatedly loop through a dataset and update the weights for each data point, then we should get a model whose error is minimized.
Caveat: the minimum will be a local minimum, which is not guaranteed to be a global minimum.
Here is a test case with some data points that are on the line $y=1+2x.$ Our network is set up to fit any line of the form $y = \beta_0 \cdot 1 + \beta_1 \cdot x,$ where $\beta_0 = $ weights[(0,2)]
and $\beta_1=$ weights[(1,2)]
.
Note that this line can be written as
output = 1 * input[0] + 2 * input[1]
In this particular case, the weights should converge to the true values (1
and 2
).
>>> weights = {(0,2): -0.1, (1,2): 0.5}
>>> nn = NeuralNetwork(weights)
>>> data_points = [
{'input': [1,0], 'output': [1]},
{'input': [1,1], 'output': [3]},
{'input': [1,2], 'output': [5]},
{'input': [1,3], 'output': [7]}
]
>>> for _ in range(1000):
for data_point in data_points:
nn.update_weights(data_point)
>>> nn.weights
should be really close to
{(0,2): 1, (1,2): 2}
because the data points all lie on the line
output = input[0] * 1 + input[1] * 2
Once you've got your final weights, post them on #results.
Originally I was going to put the hash table problem here, but I figured we should discuss it in class first. Also, we should do quiz corrections. So it will be on the next assignment instead.
For this assignment, please correct any errors on your quiz (if you got a score under 100%). You'll just need to submit your repl.it links again, with the corrected code.
Remember that we went through the quiz during class, so if you have any questions or need any help, look at the recording first.
Note: Since this quiz corrections problem is much lighter than the usual problem that would go in its place, there will be a couple more Shell and SQL problems than usual.
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Resources:
https://www.robelle.com/smugbook/regexpr.html
https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Problems:
https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-4/problem
https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-5/problem
https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-1/problem
https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-2/problem
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/Using_Null (queries 1, 2, 3, 4, 5, 6)
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to neural network: ____
Repl.it links to quiz corrections (if applicable): _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for machine-learning repo: _____
Created issue: _____
Resolved issue: _____
a. Re-run your decision tree on the sex prediction problem. Make 5 train-test splits of 80% train and 20% test, like we originally did. Now that your Gini trees match up from the previous assignment, they should match up here. Also, make sure to propagate any changes in your Gini tree to your random tree. Our random forest results should be pretty close as well.
b. Create a custom strategy for the level 2 game. Test it against NumbersBerserkerLevel2 and FlankerLevel2. On Wednesday's assignment, we'll have our strategies battle against each other.
Put your results in the usual spreadsheet:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Commit your code to Github.
We'll skip reviews on this assignment, to save you a bit of time.
For your submission, copy and paste your links into the following template:
Repl.it link to hash table: _____
Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
This weekend, your only primary problem is to resolve discrepancies in your Gini decision tree & games (both level 1 and level 2).
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Please be sure to get the game discrepancies resolved, so that we can have our custom level 2 strategies battle next week. Then, I'll let Jason know we're ready to speak with Prof. Wierman about designing optimal strategies for our level 2 game.
At the beginning of the year, we wrote a Python function called simple_sort
that sorts a list by repeatedly finding the smallest element and appending it to a new list.
Now, you will sort a list in C++ using a similar technique. However, because working with arrays in C++ is a bit trickier, we will modify the implementation so that it only involves the use of a single array. The way we do this is by swapping:
For example:
array: [30, 50, 20, 10, 40]
indices to consider: 0, 1, 2, 3, 4
elements to consider: 30, 50, 20, 10, 40
smallest element: 10
swap with first element: [10, 50, 20, 30, 40]
---
array: [10, 50, 20, 30, 40]
indices to consider: 1, 2, 3, 4
elements to consider: 50, 20, 30, 40
smallest element: 20
swap with second element: [10, 20, 50, 30, 40]
---
array: [10, 20, 50, 30, 40]
indices to consider: 2, 3, 4
elements to consider: 50, 30, 40
smallest element: 30
swap with second element: [10, 20, 30, 50, 40]
...
final array: [10, 20, 30, 40, 50]
Write your code in the template below.
# include <iostream>
# include <cassert>
int main()
{
int array[5]{ 30, 50, 20, 10, 40 };
// your code here
std::cout << 'Testing...\n';
assert(array[0]==10);
assert(array[1]==20);
assert(array[2]==30);
assert(array[3]==40);
assert(array[4]==50);
std::cout << 'Succeeded';
return 0;
}
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Resources:
https://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/
Problems:
https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-1/problem
https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-2/problem
https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-3/problem
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/More_JOIN_operations (queries 13, 14, 15)
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____
Created issue: _____
Resolved issue: _____
On this problem, we'll do some debugging based on the results from our spreadsheet:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
a. Compare your results to your classmates' results for Indices of misclassified data points (zero-indexed: the index of the first data point in the dataset would be index 0). If you and a classmate have different results, do some pair debugging to figure out what caused the difference and how you guys need to reconcile it.
b. Compare your results to your classmates' results for Flanker vs Berserker | Simulate 10 games with random seeds 1-10; list game numbers on which Flanker wins. If you and a classmate have different results, do some pair debugging to figure out what caused the difference and how you guys need to reconcile it.
c. Modify your level 2 game so that each player starts with 4 shipyards in addition to 3 scouts. (If a player doesn't start out with shipyards, then the NumberBerserker
strategy can't actually do what it's intended to do.)
Then, re-run the game level 2 matchups and put your results in the sheet (put them on the sheet for the current assignment, #89).
d. Make the following adjustment to your random forest:
In your random decision tree, create a training_percentage
parameter that governs the percent of the training data that you actually use to fit the model.
In our case, we have about 70 records, and in each test-train split, we're using 80% as training data, so that's about 56 records. Now, if we set training_percentage = 0.3
, then we randomly choose $0.3 \times 56 \approx 17$ records from the training data to actually fit the decision tree.
When randomly selecting the records, use random selection with replacement. In other words, it's okay to select duplicate data records.
When you initialize the random forest, pass a training_percentage
parameter that, in turn, gets passed to the random decision trees.
The reason why choosing training_percentage < 1
can be useful is that it speeds up the time to train the random forest, and also, it allows different models to get different "perspectives" on the data, thereby creating a more diverse "hive mind" (and higher diversity generally leads to higher performance when it comes to ensemble models, i.e. models consisting of many smaller sub-models)
e. On the sex prediction dataset, train the following models on the first half of the data and test on the second half of the data.
A single random decision tree with max_depth = 4
and training_percentage = 0.3
.
Random forest with 10 trees with max_depth = 4
and training_percentage = 0.3
.
Random forest with 100 trees with max_depth = 4
and training_percentage = 0.3
.
Random forest with 1,000 trees with max_depth = 4
and training_percentage = 0.3
.
Random forest with 10,000 trees with max_depth = 4
and training_percentage = 0.3
.
Paste the accuracy into the spreadsheet.
First, observe the following Haskell code which computes the sum of all the squares under 1000:
>>> sum (takeWhile (<1000) (map (^2) [1..]))
10416
(If you don't see why this works, then run each part of the expression: first map (^2) [1..]
, and then takeWhile (<1000) (map (^2) [1..])
, and then the full expression sum (takeWhile (<1000) (map (^2) [1..]))
.)
Now, recall the Collatz conjecture (if you don't remember it, ctrl+F "collatz conjecture" to jump to the problem where we covered it).
The following Haskell code can be used to recursively generate the sequence or "chain" of Collatz numbers, starting with an initial number n
.
chain :: (Integral a) => a -> [a]
chain 1 = [1]
chain n
| even n = n:chain (n `div` 2)
| odd n = n:chain (n*3 + 1)
Here are the chains for several initial numbers:
>>> chain 10
[10,5,16,8,4,2,1]
>>> chain 1
[1]
>>> chain 30
[30,15,46,23,70,35,106,53,160,80,40,20,10,5,16,8,4,2,1]
Your problem: Write a Haskell function firstNumberWithChainLengthGreaterThan n
that finds the first number whose chain length is at least n
.
Check: firstNumberWithChainLengthAtLeast 15
should return 7
.
To see why this check works, observe the first few chains shown below:
1: [1] (length 1)
2: [2,1] (length 2)
3: [3,10,5,16,8,4,2,1] (length 8)
4: [4,2,1] (length 3)
5: [5,16,8,4,2,1] (length 6)
6: [6,3,10,5,16,8,4,2,1] (length 9)
7: [7,22,11,34,17,52,26,13,40,20,10,5,16,8,4,2,1] (length 17)
7 is the first number whose chain is at least 15 numbers long.
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Helpful resources:
https://www.geeksforgeeks.org/awk-command-unixlinux-examples/
https://www.thegeekstuff.com/2010/02/awk-conditional-statements/
Problems:
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/More_JOIN_operations (queries 9, 10, 11, 12)
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to Haskell code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____
Created issue: _____
Resolved issue: _____
There will be a 45-minute quiz that you can take any time on Thursday. (We don't have school Friday.)
The quiz will cover C++ and Haskell.
For C++, you will need to be comfortable working with arrays.
For Haskell, you'll need to be comfortable working with list comprehensions and compositions of functions.
You will need to write C++ and Haskell functions to calculate some values. It will be somewhat similar to the meta-Fibonacci sum problem, except the computation will be different (and simpler).
This is the results spreadsheet that you'll paste your results into: https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
On the sex prediction dataset, train a Gini decision tree on the first half of the data and test on the second half of the data.
(If there's an odd number of data points, then round so that the first half of the data will have one more record than the second half)
Paste your prediction accuracy into the spreadsheet, along with the indices of any misclassified data points (zero-indexed: the index of the first data point in the dataset would be index 0).
Note the following from the rulebook:
(7.6) There is no limit to the number of Ship Yards that may occupy the same system
(8.2.2) Building Ship Yards: Ship Yards may only be built at planets that produced income (not new colonies) in the Economic Phase. Ship Yards may be purchased and placed at multiple planets, but no more than one per planet. Additional Ship Yards may be purchased at those planets in future Economic Phases. Ship Yards are produced by the Colony itself and therefore do not require Ship Yards to build them.
So, only a colony can buy a shipyard, and only once per economic phase. Shipyards cannot build other shipyards. If a colony with existing shipyards builds a shipyard, the building of the new shipyard does not it affect how much hullsize the other shipyards can build on that turn. That is to say, building a shipyard at a colony uses up CP but not hullsize building capacity.
Problem: Using your level 1 game, simulate 20 games of Flanker vs Berserker with random seeds 1 through 20.
Define the random seed at the beginning of the game. Make sure your die rolls match up with those shown in the demonstration below.
Let Flanker go first on games 1-10, and let Berserker go first on games 11-20.
For each of the 20 games, store the game log in space-empires/logs/21-02-05-flanker-vs-berserker.txt
. In the game log, on each turn, you should log any ship movements, any battle locations, the combat order, the dice rolls on each attack during combat, and whether or not each attack resulted in a hit.
In the spreadsheet, paste the game numbers on which the Flanker won. For example, if Flanker won on games 2, 3, 5, 8, 9, 13, 15, 19, then you'd paste 2, 3, 5, 8, 9, 13, 15, 19
into the spreadsheet.
Check the game numbers you pasted in against those of your classmates. Any discrepancy corresponds to a game on which you and your classmate had different outcomes. So, for any discrepancies, inspect your game logs against your classmate's, and figure out where your game logs started to differ.
import random
import math
for game_num in range(1,6):
random.seed(game_num)
first_few_die_rolls = [math.ceil(10*random.random()) for _ in range(7)]
print('first few die rolls of game {}'.format(game_num))
print('\t',first_few_die_rolls,'\n')
---
first few die rolls of game 1
[2, 9, 8, 3, 5, 5, 7]
first few die rolls of game 2
[10, 10, 1, 1, 9, 8, 7]
first few die rolls of game 3
[3, 6, 4, 7, 7, 1, 1]
first few die rolls of game 4
[3, 2, 4, 2, 1, 5, 10]
first few die rolls of game 5
[7, 8, 8, 10, 8, 10, 1]
Implement toggles that you can use to set level 2 of the game:
Change initial CP to 10. So really, the players start with 10 CP, and then get 20 CP income, for a total of 30 CP that they're able to spend on ships / technology / maintenance.
Allow players to buy technology (but as for ships -- they can still only buy scouts)
Have 1 economic phase and that's it.
In the level 2 game, we will have matchups between several strategies.
NumbersBerserkerLevel2
- spends all its CP buying more scouts. This Berserker thinks that the best way to win is to bring in a bunch of unskilled reinforcements. Sends all the scouts directly towards the enemy home base.
MovementBerserkerLevel2
- buys movement technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.
AttackBerserkerLevel2
- buys attack technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.
DefenseBerserkerLevel2
- buys attack technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.
FlankerLevel2
- buys movement technology then buys another scout. Then uses that fast scout to perform the flanking maneuver.
Perform 1000 simulations for each matchup, just like you did with level 1. Remember to randomize the die rolls and switch who goes first at game 500. Put your results in the spreadsheet.
When doing the 1000 simulations, set random.seed(game_num)
like you are now doing with the level 1 game. This way, we'll be able to backtrack any discrepancies to the individual game number.
Implement the metaFibonacciSum
function in C++:
# include <iostream>
# include <cassert>
int metaFibonacciSum(int n)
{
// return the result immediately if n<2
// otherwise, construct a an array called "terms"
// that contains the Fibonacci terms at indices
// 0, 1, ..., n
// construct an array called "extendedTerms" that
// contains the Fibonacci terms at indices
// 0, 1, ..., a_n (where a_n is the nth Fibonacci term)
// when you fill up this array, many of the terms can
// simply copied from the existing "terms" array. But
// if you need additional terms, you'll have to compute
// them the usual way (by adding the previous 2 terms)
// then, create an array called "partialSums" that
// contains the partial sums S_0, S_1, ..., S_{a_n}
// finally, add up the desired partial sums,
// S_{a_0} + S_{a_1} + ... + S_{a_n},
// and return this result
}
int main()
{
std::cout << "Testing...\n";
assert(metaFibonacciSum(6)==74);
std::cout << "Success!";
return 0;
}
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Helpful resource: https://www.geeksforgeeks.org/awk-command-unixlinux-examples/
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/More_JOIN_operations (queries 5, 6, 7, 8)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to space-empires/logs: ___
Repl.it link to space empires game lvl 1 simulation runner: ___
Repl.it link to space empires game lvl 2 simulation runner: ___
Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Created issue: _____
Resolved issue: _____
This is the results spreadsheet that you'll paste your results into: https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
a.
It's possible that unit 0 might be a colony (not a scout), which would be problematic for the current implementation of the Flanker
strategy. Fix the implementation so that the flanking unit is chosen as the first scout, not just the unit at index 0
(since there is no guarantee this is a scout). Check your game logs to make sure that the scout is actually doing the flanking, as intended.
Re-specify hidden_game_state_for_combat
. Currently, it shows all of the opponent's units, but it should really only show those unit involved in the particular combat that's taking place.
hidden_game_state_for_combat
- like hidden_game_state
, but reveal the type / hits_left / technology of the opponent's ships that are in the particular combat.Run 1000 random simulations for each of the following matchups. Remember to have both strategies get an equal number of games as the player who goes first. Paste your results into the spreadsheet.
Note that there should be no ties. (If you're getting a tie, post on Slack so we can clear up what's going wrong.)
Make sure you're using a 10-sided die.
b. Re-run the sex prediction problem (Problem 77-1) and paste your results in the spreadsheet. Now that our decision trees and random forests are passing tests, we should get very similar accuracy results.
c. Submit quiz corrections -- say what you got wrong, why you got it wrong, what the correct answer is, and why it's correct.
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
Let $a_k$ be the $k$th Fibonacci number and let $S_k$ be the sum of the first $k$ Fibonacci numbers. Write a function metaFibonacciSum
that takes an input $n$ and computes the sum
For example, if we wanted to compute the result for n=6
, then we'd need to
compute the first $6$ Fibonacci numbers: $$ a_0=0, a_1=1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8 $$
compute the first $8$ Fibonacci sums: $$ \begin{align*} S_0 &= 0 \\ S_1 &= 0 + 1 = 1 \\ S_2 &= 0 + 1 + 1 = 2 \\ S_3 &= 0 + 1 + 1 + 2 = 4 \\ S_4 &= 0 + 1 + 1 + 2 + 3 = 7 \\ S_5 &= 0 + 1 + 1 + 2 + 3 + 5 = 12 \\ S_6 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 = 20 \\ S_7 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 + 13 = 33 \\ S_8 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 + 13 + 21 = 54 \\ \end{align*} $$
Add up the desired sums:
$$ \begin{align*} \sum\limits_{k=0}^6 S_{a_k} &= S_{a_0} + S_{a_1} + S_{a_2} + S_{a_3} + S_{a_4} + S_{a_5} + S_{a_6} \\ &= S_{0} + S_{1} + S_{1} + S_{2} + S_{3} + S_{5} + S_{8} \\ &= 0 + 1 + 1 + 2 + 4 + 12 + 54 \\ &= 74 \end{align*} $$Here's a template:
-- first, define a recursive function "fib"
-- to compute the nth Fibonacci number
-- once you've defined "fib", proceed to the
-- steps below
firstKEntriesOfSequence k = -- your code here; should return the list [a_0, a_1, ..., a_k]
kthPartialSum k = -- your code here; returns a single number
termsToAddInMetaSum n = -- your code here; should return the list [S_{a_0}, S_{a_1}, ..., S_{a_k}]
metaSum n = -- your code here; returns a single number
main = print (metaSum 6) -- should come out to 74
Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Helpful resource: https://www.theunixschool.com/2012/07/10-examples-of-paste-command-usage-in.html
https://www.hackerrank.com/challenges/paste-1/problem
https://www.hackerrank.com/challenges/paste-2/problem
Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/More_JOIN_operations (queries 1, 2, 3, 4)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
REMEMBER TO PASTE YOUR RESULTS IN HERE:
https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Quiz corrections: ____
Repl.it link to Haskell code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
Created issue: _____
Resolved issue: _____
Primary problems; 60% of assignment grade; 90 minutes estimate
a. Assert that your decision trees pass some tests. (They likely will, so this problem will likely only take 10 minutes or so, I just to make sure we're all clear before we go back to improving our random forest, modeling real-world datasets, and moving on to neural nets.)
(i) Assert that BOTH your gini decision tree and random decision tree pass the following test.
Create a dataset consisting of 100 points $$ \Big[ (x,y,\textrm{label}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \big], $$ where $$ \textrm{label} = \begin{cases} \textrm{positive}, \quad x>0, y > 0 \\ \textrm{negative}, \quad \textrm{otherwise} \end{cases} $$
Predict the label of this dataset. Train on 100% of the data and test on 100% of the data.
You should get an accuracy of 100%.
You should have exactly 2 splits
Note: Your tree should look exactly like one of these:
split y=0
/ \
y < 0 y > 0
pure neg split x=0
/ \
x < 0 x > 0
pure neg pure pos
.
or
.
split x=0
/ \
x < 0 x > 0
pure neg split y=0
/ \
y < 0 y > 0
pure neg pure pos
(ii) Assert that your gini decision tree passes Tests 1,2,3,4 from problem 84-1.
(iii) Assert that your random forest with 10 trees passes Tests 1,2,3,4 from problem 84-1.
b. Run each Level1
player against each other for 100
random games. Do this in space-empires/analysis/level_1_matchups.py
. Then, post your results to #results:
(I've included FlankerStrategyLevel1
at the bottom of this problem.)
Simulation results for 100 games, level 1:
Random vs Dumb:
- Random wins __% of the time
- Dumb wins __% of the time
Berserker vs Dumb:
- Berserker wins __% of the time
- Dumb wins __% of the time
Berserker vs Random:
- Berserker wins __% of the time
- Random wins __% of the time
Flanker vs Random:
- Sidestep wins __% of the time
- Random wins __% of the time
Flanker vs Berserker:
- Sidestep wins __% of the time
- Random wins __% of the time
Important simulation requirements:
Use (at least) 100 simulated games to generate the win percentages for each matchup. This way, we can guarantee that we should all get similar win percentages.
Use actual random rolls (not just increasing or decreasing rolls). We want each of the 100 simulated games to occur under different rolling conditions.
Randomize who goes first. For example, in Flanker vs Berserker, Flanker should go first on 50 games and Berserker should go first on 50 games.
Use a 10-sided die (this is what's used in the official game)
class FlankerStrategyLevel1:
# Sends 2 of its units directly towards the enemy. home colony
# Sends 1 unit slightly to the side to avoid any combat
# that happens on the direct path between home colonies.
def __init__(self, player_index):
self.player_index = player_index
self.flank_direction = (1,0)
def decide_ship_movement(self, unit_index, hidden_game_state):
myself = hidden_game_state['players'][self.player_index]
opponent_index = 1 - self.player_index
opponent = hidden_game_state['players'][opponent_index]
unit = myself['units'][unit_index]
x_unit, y_unit = unit['coords']
x_opp, y_opp = opponent['home_coords']
translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)]
# unit 0 does the flanking
if unit_index == 0:
dist = abs(x_unit - x_opp) + abs(y_unit - y_opp)
delta_x, delta_y = self.sidestep_direction
reverse_flank_direction = (-delta_x, -delta_y)
# at the start, sidestep
if unit['coords'] == myself['home_coords']:
return self.flank_direction
# at the end, reverse the sidestep to get to enemy
elif dist == 1:
reverse_flank_direction
# during the journey to the opponent, don't
# reverse the sidestep
else:
translations.remove(self.flank_direction)
best_translation = (0,0)
smallest_distance_to_opponent = 999999999999
for translation in translations:
delta_x, delta_y = translation
x = x_unit + delta_x
y = x_unit + delta_y
dist = abs(x - x_opp) + abs(y - y_opp)
if dist < smallest_distance_to_opponent:
best_translation = translation
smallest_distance_to_opponent = dist
return best_translation
def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index):
# attack opponent's first ship in combat order
combat_order = combat_state[coords]
player_indices = [unit['player_index'] for unit in combat_order]
opponent_index = 1 - self.player_index
for combat_index, unit in enumerate(combat_order):
if unit['player_index'] == opponent_index:
return combat_index
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
a. Skim the following section of http://learnyouahaskell.com/higher-order-functions.
Function composition
Consider the function $$ f(x,y) = \max \left( x, -\tan(\cos(y)) \right) $$
This function can be implemented as
>>> f x y = negate (max (x tan (cos y)))
or, we can implement it using function composition notation as follows:
>>> f x = negate . max x . tan . cos
Note that although max
is a function of two variables, max x
is a function of one variable (since one of the inputs is already supplied). So, we can chain it together with other single-variable functions.
Previously, you wrote a function tail'
in Tail.hs
that finds the last n
elements of a list by reversing the list, finding the head n
elements of the reversed list, and then reversing the result.
Rewrite the function tail'
using composition notation, so that it's cleaner. Run Tail.hs
again to make sure it still gives the same output as before.
b. Write a function isPrime
that determines whether a nonnegative integer x
is prime. You can use the same approach that you did with one of our beginning Python problems: loop through numbers between 2
and x-1
and see if you can find any factors.
Note that neither 0
nor 1
are prime.
Here is a template for your file isPrime.cpp
:
#include <iostream>
#include <cassert>
bool isPrime(int x)
{
// your code here
}
int main()
{
assert(!isPrime(0));
assert(!isPrime(1));
assert(isPrime(2));
assert(isPrime(3));
assert(!isPrime(4));
assert(isPrime(5));
assert(isPrime(7));
assert(!isPrime(9));
assert(isPrime(11));
assert(isPrime(13));
assert(!isPrime(15));
assert(!isPrime(16));
assert(isPrime(17));
assert(isPrime(19));
assert(isPrime(97));
assert(!isPrime(99));
assert(!isPrime(99));
assert(isPrime(13417));
std::cout << "Success!";
return 0;
}
Your program should work like this
>>> g++ isPrime.cpp -o isPrime
>>> ./isPrime
Success!
c. Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Here's a reference to the sort
command: https://www.thegeekstuff.com/2013/04/sort-files/
Note that the "tab" character must be specified as $'\t'
.
These problems are super quick, so we'll do several.
https://www.hackerrank.com/challenges/text-processing-sort-5/tutorial
https://www.hackerrank.com/challenges/text-processing-sort-6/tutorial
https://www.hackerrank.com/challenges/text-processing-sort-7/tutorial
d. Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/The_JOIN_operation (queries 12, 13)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to Github.
Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Repl.it link to machine-learning/tests/test_random_forest.py: _____
Repl.it link to space-empires/analysis/level_1_matchups.py: _____
Repl.it link to Haskell code: _____
Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for machine-learning repo: _____
Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
Created issue: _____
Resolved issue: _____
Primary problems; 60% of assignment grade; 90 minutes estimate
a. Make the following updates to your game:
put "board_size"
as an attribute in the game_state
. We've got our grid size set as $(5,5)$ for now (i.e. a $5 \times 5$ grid)
Player loses game when their home colony is destroyed
Ships cannot move diagonally
For most of the strategy functions, the input will need to be a partially-hidden game state. There are two types of hidden game states in particular:
hidden_game_state_for_combat
- has all the information except for planet locations and the opponent's CP
hidden_game_state
- has all the information except for planet locations, the opponent's CP, and the type / hits_left / technology of the opponent's units. (The opponent's units are still in array form, and you can see their locations, but that's it -- you don't know anything else about them.)
b.
Some background...
We need to get some sort of game competition going in the upcoming week so that you guys have something to work on with Prof. Wierman from Caltech. Jason and I talked a bit and came to the conclusion that we need to something working as soon as possible, even if it doesn't use all the stuff we've implemented so far. Plus, the main thing that will be of interest to Prof. Wierman is the types of algorithms you guys are using in your strategies (he won't care if the game doesn't have all the features we want -- it just needs to be rich enough to permit some different strategies).
So, let's focus on a very limited type of game and gradually expand it after we get it working. The first type of game we'll consider will be subject to the following constraints.
(i) Implement optional arguments in your game that "swtich off" some of the parts when they are set to False
:
There are only 2 planets: one for each home colony. That's it. Switch off the part that creates additional planets.
Players start with 3 scouts and their home colony and that's it. No colonyships. No shipyards. Just switch off the line where you give the player the colonyships / shipyards.
Movement phase consists of just 1 round. Switch off the lines in your game for the other 2 rounds.
There will be no economic phase. Just switch off that line of your game.
Players are not allowed to screen ships. Switch off the line where the game asks the player what ships they want to screen.
So, the game will consist of each player starting with 3 scouts, moving them around the board, having combat whenever they meet, and trying to reach and destroy the opponent's home colony.
Note that nothing we've done is a wasted effort. We're just going to put the other features (technology, other ship types, planets, ship screening, etc) on pause until we get our games working under the simplest constraints. Then, we'll bring all that other stuff back in.
(ii) I've included code for two strategies at the bottom of this problem. The strategies are named with Level1
at the end because this is like the "level 1" version of our game. We'll make a level 2 version in the next week, and then a level 3 version, and so on, until we've re-introduced all the features we've been working on.
DumbStrategyLevel1
- sends all of its units to the right
RandomStrategyLevel1
- moves its units randomly
BerserkerStrategyLevel1
- sends all of its units directly towards the enemy home colony.
Write some tests for these strategies:
If a BerserkerStrategyLevel1
plays against a DumbStrategyLevel1
, the BerserkerStrategyLevel1
wins, and in the final game state each player still has 3 scouts. Announce on Slack once you have this test working.
If a BerserkerStrategyLevel1
plays against a BerserkerStrategyLevel1
, there is a winner, and in the final game state one player has 0 scouts. Announce on Slack once you have this test working.
If a BerserkerStrategyLevel1
plays against a RandomStrategyLevel1
, the BerserkerStrategyLevel1
should win the majority of the time. (To test this, run 100 games and compute how many times BerserkerStrategyLevel1
wins.) Announce on Slack once you have this test working.
(iii) Write a custom strategy called YournameStrategyLevel1
.
DumbStrategyLevel1
and RandomStrategyLevel1
. Explain why you think it might beat BerserkerStrategyLevel1
.(iv) Make sure your custom strategy passes the following tests.
Make sure it defeats the DumbStrategyLevel1
all the time. Announce on Slack once you have this test working.
Make sure it defeats RandomStrategyLevel1
the majority of the time. Announce on Slack once you have this test working.
Try to have your strategy defeat BerserkerStrategyLevel1
the majority of the time, too. (To test this, run 100 games and compute how many times BerserkerStrategyLevel1
wins.) Announce on Slack if you get this test working.
class DumbStrategyLevel1:
# Sends all of its units to the right
def __init__(self, player_index):
self.player_index = player_index
def decide_ship_movement(self, unit_index, hidden_game_state):
myself = hidden_game_state['players'][self.player_index]
unit = myself['units'][unit_index]
x_unit, y_unit = unit['coords']
board_size_x, board_size_y = game_state['board_size']
unit_is_at_edge = (x_unit == board_size_x-1)
if unit_is_at_edge:
return (0,0)
else:
return (1,0)
def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index):
# attack opponent's first ship in combat order
combat_order = combat_state[coords]
player_indices = [unit['player_index'] for unit in combat_order]
opponent_index = 1 - self.player_index
for combat_index, unit in enumerate(combat_order):
if unit['player_index'] == opponent_index:
return combat_index
class RandomStrategyLevel1:
# Sends all of its units to the right
def __init__(self, player_index):
self.player_index = player_index
def decide_ship_movement(self, unit_index, hidden_game_state):
myself = hidden_game_state['players'][self.player_index]
unit = myself['units'][unit_index]
x_unit, y_unit = unit['coords']
translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)]
board_size_x, board_size_y = hidden_game_state['board_size']
while True:
translation = random.choice(translations)
delta_x, delta_y = translation
x_new = x_unit + delta_x
y_new = y_unit + delta_y
if 0 <= x_new and 0 <= y_new and x_new <= board_size_x-1 and y_new <= board_size_y-1:
return translation
def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index):
# attack opponent's first ship in combat order
combat_order = combat_state[coords]
player_indices = [unit['player_index'] for unit in combat_order]
opponent_index = 1 - self.player_index
for combat_index, unit in enumerate(combat_order):
if unit['player_index'] == opponent_index:
return combat_index
class BerserkerStrategyLevel1:
# Sends all of its units directly towards the enemy home colony
def __init__(self, player_index):
self.player_index = player_index
def decide_ship_movement(self, unit_index, hidden_game_state):
myself = hidden_game_state['players'][self.player_index]
opponent_index = 1 - self.player_index
opponent = hidden_game_state['players'][opponent_index]
unit = myself['units'][unit_index]
x_unit, y_unit = unit['coords']
x_opp, y_opp = opponent['home_coords']
translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)]
best_translation = (0,0)
smallest_distance_to_opponent = 999999999999
for translation in translations:
delta_x, delta_y = translation
x = x_unit + delta_x
y = x_unit + delta_y
dist = abs(x - x_opp) + abs(y - y_opp)
if dist < smallest_distance_to_opponent:
best_translation = translation
smallest_distance_to_opponent = dist
return best_translation
def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index):
# attack opponent's first ship in combat order
combat_order = combat_state[coords]
player_indices = [unit['player_index'] for unit in combat_order]
opponent_index = 1 - self.player_index
for combat_index, unit in enumerate(combat_order):
if unit['player_index'] == opponent_index:
return combat_index
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
a. Skim the following section of http://learnyouahaskell.com/higher-order-functions.
Maps and filters
Pay attention to the following examples:
>>> map (+3) [1,5,3,1,6]
[4,8,6,4,9]
>>> filter (>3) [1,5,3,2,1,6,4,3,2,1]
[5,6,4]
Create a Haskell file SquareSingleDigitNumbers.hs
and write a function squareSingleDigitNumbers
that takes a list returns the squares of the values that are less than 10.
To check your function, print squareSingleDigitNumbers [2, 7, 15, 11, 5]
. You should get a result of [4, 49, 25]
.
This is a one-liner. If you get stuck for more than 10 minutes, ask for help on Slack.
b. Write a C++ program to calculate the height of a ball that falls from a tower.
constants.h
to hold your gravity constant:#ifndef CONSTANTS_H
#define CONSTANTS_H
namespace myConstants
{
const double gravity(9.8); // in meters/second squared
}
#endif
simulateFall.cpp
#include <iostream>
#include "constants.h"
double calculateDistanceFallen(int seconds)
{
// approximate distance fallen after a particular number of seconds
double distanceFallen = myConstants::gravity * seconds * seconds / 2;
return distanceFallen;
}
void printStatus(int time, double height)
{
std::cout << "At " << time
<< " seconds, the ball is at height "
<< height << " meters\n";
}
int main()
{
using namespace std;
cout << "Enter the initial height of the tower in meters: ";
double initialHeight;
cin >> initialHeight;
// your code here
// use calculateDistanceFallen to find the height now
// use calculateDistanceFallen and printStatus
// to generate the desired output
// if the height now goes negative, then the status
// should say that the height is 0 and the program
// should stop (since the ball stops falling at height 0)
return 0;
}
Your program should work like this
>>> g++ simulateFall.cpp -o simulateFall
>>> ./simulateFall
Enter the initial height of the tower in meters: 100
At 0 seconds, the ball is at height 100 meters
At 1 seconds, the ball is at height 95.1 meters
At 2 seconds, the ball is at height 80.4 meters
At 3 seconds, the ball is at height 55.9 meters
At 4 seconds, the ball is at height 21.6 meters
At 5 seconds, the ball is at height 0 meters
c. Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
Here's a reference to the sort
command: https://www.thegeekstuff.com/2013/04/sort-files/
These problems are super quick, so we'll do several.
https://www.hackerrank.com/challenges/text-processing-sort-1/tutorial
https://www.hackerrank.com/challenges/text-processing-sort-2/tutorial
https://www.hackerrank.com/challenges/text-processing-sort-3/tutorial
https://www.hackerrank.com/challenges/text-processing-sort-4/tutorial
d. Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
https://sqlzoo.net/wiki/The_JOIN_operation (queries 10, 11)
Review; 10% of assignment grade; 15 minutes estimate
Now, everyone should have a handful of issues on their repositories. So we'll go back to making 1 issue and resolving 1 issue.
Make 1 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate.
Resolve 1 GitHub issue on one of your own repositories.
For your submission, copy and paste your links into the following template:
Link to space empires tests with the new strategies: _____
Link to overleaf doc with your custom strategy rationale: _____
Repl.it link to Haskell code: _____
Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
Created issue: _____
Resolved issue: _____
Primary problems; 60% of assignment grade; 90 minutes estimate
a. Implement calc_shortest_path(start_node, end_node)
in your weighted graph.
To do this, you first need to carry out Dijkstra's algorithm to find the d-values.
Then, you need to find the edges for the shortest-path tree. To do this, loop through all the edges (a,b)
, and if the difference in d-values is equal to the weight, i.e. nodes[b].dvalue - nodes[a].dvalue == weight[(a,b)]
, include the edge in your list of edges for the shortest-path tree.
Using your list of edges for the shortest-path tree, create a Graph
object and run calc_shortest_path
on it. By constructing the shortest-path tree, we have reduced the problem of finding the shortest path in a weighted graph to the problem of finding the shortest path in an undirected graph, which we have already solved.
Check your function by carrying out the following tests for the graph given in Problem 83-1.
>>> weighted_graph.calc_shortest_path(8,4)
[8, 0, 3, 4]
>>> weighted_graph.calc_shortest_path(8,7)
[8, 0, 1, 7]
>>> weighted_graph.calc_shortest_path(8,6)
[8, 0, 3, 2, 5, 6]
b. Assert that your random decision tree passes the following tests.
Test 1
Create a dataset consisting of 100 points $$ \Big[ (x,y,\textrm{label}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \big], $$ where $$ \textrm{label} = \begin{cases} \textrm{positive}, \quad xy > 0 \\ \textrm{negative}, \quad xy < 0 \end{cases} $$
Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 100%.
Test 2
Create a dataset consisting of 150 points $$ \begin{align*} &\Big[ (x,y,\textrm{A}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \Big] \\ &+ \Big[ (x,y,\textrm{B}) \mid x,y \in \mathbb{Z}, \,\, 1 \leq x,y \leq 5 \Big] \\ &+ \Big[ (x,y,\textrm{B}) \mid x,y \in \mathbb{Z}, \,\, 1 \leq x,y \leq 5 \Big]. \end{align*} $$ This dataset consists of $100$ data points labeled "A" distributed evenly throughout the plane and $50$ data points labeled "B" in quadrant I. Each integer pair in quadrant I will have $1$ data point labeled "A" and $2$ data points labeled "B".
Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 83.3% (25/150 misclassified)
Test 3
Create a dataset consisting of 1000 points $$ \Big[ (x,y,z,\textrm{label}) \mid x,y,z \in \mathbb{Z}, \,\, -5 \leq x,y,z \leq 5, \,\, xyz \neq 0 \big], $$ where $$ \textrm{label} = \begin{cases} \textrm{positive}, \quad xyz > 0 \\ \textrm{negative}, \quad xyz < 0 \end{cases} $$
Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 100%.
Note: These are a lot of data points, but the tree won't need to do many splits, so the code should run quickly. If the code takes a long time to run, it means you've got an issue, and you should post on Slack if you can't figure out why it's taking so long.
Test 4
Create a dataset consisting of 1250 points $$ \begin{align*} &\Big[ (x,y,z,\textrm{A}) \mid x,y,z \in \mathbb{Z}, \,\, -5 \leq x,y,z \leq 5, \,\, xyz \neq 0 \Big] \\ &+ \Big[ (x,y,z,\textrm{B}) \mid x,y,z \in \mathbb{Z}, \,\, 1 \leq x,y,z \leq 5 \Big] \\ &+ \Big[ (x,y,z,\textrm{B}) \mid x,y,z \in \mathbb{Z}, \,\, 1 \leq x,y,z \leq 5 \Big]. \end{align*} $$ This dataset consists of $1000$ data points labeled "A" distributed evenly throughout the eight octants and $250$ data points labeled "B" in octant I. Each integer pair in octant I will have $1$ data point labeled "A" and $2$ data points labeled "B".
Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 90% (125/1250 misclassified)
Note: These are a lot of data points, but the tree won't need to do many splits, so the code should run quickly. If the code takes a long time to run, it means you've got an issue, and you should post on Slack if you can't figure out why it's taking so long.
c. Update your game to use 0
at the head of the prices list for the technologies that start at level 1.
'technology_data': {
# lists containing price to purchase the next level level
'shipsize': [0, 10, 15, 20, 25, 30],
'attack': [20, 30, 40],
'defense': [20, 30, 40],
'movement': [0, 20, 30, 40, 40, 40],
'shipyard': [0, 20, 30]
}
This way, you can do this:
price = game_state['technology_data'][tech_type][level]
instead of this:
if tech_type in ['shipsize', 'movement', 'shipyard']:
price = game_state['technology_data'][tech_type][level-1]
else:
price = game_state['technology_data'][tech_type][level]
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/recursion.
A few more recursive functions
Pay attention to the following example. take n myList
returns the first n
entries of myList
.
take' :: (Num i, Ord i) => i -> [a] -> [a]
take' n _
| n <= 0 = []
take' _ [] = []
take' n (x:xs) = x : take' (n-1) xs
Create a Haskell file Tail.hs
and write a function tail'
that takes a list and returns the last n
values of the list.
Here's the easiest way to do this...
Write a helper function reverseList
that reverses a list. This will be a recursive function, which you can define using the following template:
reverseList :: [a] -> [a]
reverseList [] = (your code here -- base case)
reverseList (x:xs) = (your code here -- recursive formula)
Here, x
is the first element of the input list and xs
is the rest of the elements. For the recursive formula, just call reverseList
on the rest of the elements and put the first element of the list at the end. You'll need to use the ++
operation for list concatenation.
Once you've written reverseList
and tested to make sure it works as intended, you can implement tail'
by reversing the input list, calling take'
on the reversed list, and reversing the result.
To check your function, print tail' 4 [8, 3, -1, 2, -5, 7]
. You should get a result of [-1, 2, -5, 7]
.
If you get stuck anywhere in this problem, don't spend a bunch of time staring at it. Be sure to post on Slack. These Haskell problems can be tricky if you're not taking the right approach from the beginning, but after a bit of guidance, it can become much simpler.
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/inheritance-introduction/problem
Shell
https://www.hackerrank.com/challenges/text-processing-tr-1/problem
https://www.hackerrank.com/challenges/text-processing-tr-2/problem
https://www.hackerrank.com/challenges/text-processing-tr-3/problem
Helpful templates:
$ echo "Hello" | tr "e" "E"
HEllo
$ echo "Hello how are you" | tr " " '-'
Hello-how-are-you
$ echo "Hello how are you 1234" | tr -d [0-9]
Hello how are you
$ echo "Hello how are you" | tr -d [a-e]
Hllo how r you
More info on tr
here: https://www.thegeekstuff.com/2012/12/linux-tr-command/
These problems are all very quick. If you find yourself spending more than a couple minutes on these, be sure to ask for help.
SQL
https://sqlzoo.net/wiki/The_JOIN_operation (queries 7, 8, 9)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.For your submission, copy and paste your links into the following template:
Link to weighted graph tests: _____
Link to random decision tree tests: _____
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for graph repo: _____
Commit link for assignment-problems repo: _____
Issue 1: _____
Issue 2: _____
Primary problems; 60% of assignment grade; 90 minutes estimate
a. In your random decision tree, make the random split selection a little bit smarter. First, randomly choose a feature (i.e. variable name) to split on. But then, instead of choosing a random split for that feature, choose the optimal split as determined by the Gini metric.
max_depth
parameter plus another speedup trick.)b. Run your analysis from 77-1 again, now that your random decision tree has been updated. Post your results on #results.
max_depth
yet -- we'll do that in the near future.)c. Create a strategy class AggressiveStrategy
that buys ships/technology in the same way as CombatPlayer
, but sends all their ships directly upward (or downward) towards the enemy home colony.
This should ideally result in battles in multiple locations on the path between the two home colonies, and there should be an actual winner of the game.
Battle two AggressiveStrategy
players against each other. Post the following on #results:
Ascending die rolls:
- num turns: ___
- num combats: ___
- winner: ___ (Player 0 or Player 1?)
- Player 1 ending CP: ___
- Player 2 ending CP: ___
Descending die rolls:
- num turns: ___
- num combats: ___
- winner: ___ (Player 0 or Player 1?)
- Player 1 ending CP: ___
- Player 2 ending CP: ___
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/syntax-in-functions.
Hello recursion
Maximum awesome
Pay attention to the following example, especially:
maximum' :: (Ord a) => [a] -> a
maximum' [] = error "maximum of empty list"
maximum' [x] = x
maximum' (x:xs)
| x > maxTail = x
| otherwise = maxTail
where maxTail = maximum' xs
Create a Haskell file SmallestPositive.hs
and write a function findSmallestPositive
that takes a list and returns the smallest positive number in the list.
The format will be similar to that shown in the maximum'
example above.
To check your function, print findSmallestPositive [8, 3, -1, 2, -5, 7]
. You should get a result of 2
.
Important: In your function findSmallestPositve
, you will need to compare x
to 0
, which means we must assume that not only can items x
be ordered (Ord
), they are also numbers (Num
). So, you will need to have findSmallestPositive :: (Num a, Ord a) => [a] -> a
.
Note: It is not necessary to put a "prime" at the end of your function name, like is shown in the example.
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/c-tutorial-class/problem
You can read more about C++ classes here: https://www.programiz.com/cpp-programming/object-class
If you get stuck for more than 20 minutes, post on Slack to get help
Shell
https://www.hackerrank.com/challenges/text-processing-tail-1/problem
https://www.hackerrank.com/challenges/text-processing-tail-2/problem
https://www.hackerrank.com/challenges/text-processing-in-linux---the-middle-of-a-text-file/problem
Helpful templates:
tail -n 11 # Last 11 lines
tail -c 20 # Last 20 characters
head -n 10 | tail -n 5 # Get the first 10 lines, and then
get the last 5 lines of those
10 lines (so the final result is
lines 6-10)
These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help.
SQL
https://sqlzoo.net/wiki/The_JOIN_operation (queries 4,5,6)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.For your submission, copy and paste your links into the following template:
Link to Overleaf doc: _____
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____
Issue 1: _____
Issue 2: _____
There will be a 45-minute quiz on Friday from 8:30-9:15. It will mainly be a review of the ML algorithms we've implemented so far, and their use for modeling purposes. Know how to do the following things:
Answer questions about similarities and differences between linear regression, logistic regression, k nearest neighbors, naive bayes, and Gini decision trees.
Answer questions about overfitting, underfitting, training datasets, testing datasets, train-test splits.
Primary problems; 60% of assignment grade; 90 minutes estimate
a. Schedule pair coding sessions to finish game refactoring. Once you've gotten someone else's strategy integrated, update the "Current Completion" portion of the progress sheet: https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing
Saturday:
Sunday:
b. Create a class WeightedGraph
where each edge has an edge weight. Include two methods calc_shortest_path
and calc_distance
that accomplish the same goals as in your Graph
class. But since this is a weighted graph, the actual algorithms for accomplishing those goals are a bit different.
Initialize the WeightedGraph
with a weights
dictionary instead of an edges
list. The edges
list just had a list of edges, whereas the weights
dictionary will have its keys as edges and its values as the weights of those edges.
Implement the method calc_distance
using Dijkstra's algorithm (https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Algorithm). This algorithm works by assigning all other nodes an initial d-value and then iteratively updating those d-values until they actually represent the distances to those nodes.
Initial d-values: initial node is assigned $0,$ all other nodes are assigned $\infty$ (use a large number like $9999999999$). Set current node to be the initial node.
For each unvisited neighbor of the current node, compute (current node's d-value) + (edge weight). If this sum is greater than the neighbor's d-value, then replace neighbor's d-value with the sum.
Update the current node to be the unvisited node that has the smallest d-value, and keep repeating the procedure until the terminal node has been visited. (Once the terminal node has been visited, its d-value is guaranteed to be correct.) Important: a node is not considered considered visited until it has been set as a current node. Even if you updated the node's d-value at some point, the node is not visited until it is the current node.
Test your code on the following example:
>>> weights = {
(0,1): 3,
(1,7): 4,
(7,2): 2,
(2,5): 1,
(5,6): 8,
(0,3): 2,
(3,2): 6,
(3,4): 1,
(4,8): 8,
(8,0): 4
}
>>> vertex_values = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
>>> weighted_graph = WeightedGraph(weights, vertex_values)
>>> weighted_graph.calc_distance(8,4)
7
>>> [weighted_graph.calc_distance(8,n) for n in range(9)]
[4, 7, 12, 6, 7, 13, 21, 11, 0]
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/syntax-in-functions.
Let it be
Pay attention to the following example, especially:
calcBmis :: (RealFloat a) => [(a, a)] -> [a]
calcBmis xs = [bmi | (w, h) <- xs, let bmi = w / h ^ 2, bmi >= 25.0]
Create a Haskell file ProcessPoints.hs
and write a function smallestDistances
that takes a list of 3-dimensional points and returns the distances of any points that are within 10 units from the origin.
To check your function, print smallestDistances [(5,5,5), (3,4,5), (8,5,8), (9,1,4), (11,0,0), (12,13,14)]
. You should get a result of [8.67, 7.07, 9.90]
.
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/c-tutorial-struct/problem
You can read more about structs here: https://www.educative.io/edpresso/what-is-a-cpp-struct
If you get stuck for more than 10 minutes, post on Slack to get help
Shell
https://www.hackerrank.com/challenges/text-processing-cut-7/problem
https://www.hackerrank.com/challenges/text-processing-cut-8/problem
https://www.hackerrank.com/challenges/text-processing-cut-9/problem
https://www.hackerrank.com/challenges/text-processing-head-1/problem
https://www.hackerrank.com/challenges/text-processing-head-2/tutorial
Remember to check out the tutorial tabs.
Note that if you want to start at the index 2
and then go until the end of a line, you can just omit the ending index. For example, cut -c2-
means print characters $2$ and onwards for each line in the file.
Also remember the template cut -d',' -f2-4
, which means print fields $2$ through $4$ for each line the file, where the fields are separated by the delimiter ','
.
You can also look at this resource for some examples: https://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html
These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help.
SQL
https://sqlzoo.net/wiki/SUM_and_COUNT (queries 6,7,8)
https://sqlzoo.net/wiki/The_JOIN_operation (queries 1,2,3)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
For your submission, copy and paste your links into the following template:
Repl.it link to WeightedGraph tests: _____
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for space-empires repo: _____
Commit link for graph repo: _____
Commit link for assignment-problems repo: _____
Issue 1: _____
Issue 2: _____
Primary problems; 60% of assignment grade; 90 minutes estimate
a. If your game doesn't already do this, make it so that if a player commits an invalid move (such as moving off the grid), the game stops.
b. Now that we've solved a bunch of issues in our games, it's time to slow down, and focus on 1 strategy at a time.
Schedule a pair coding session with your partner(s) below, sometime today or tomorrow. Let me know when you've scheduled it. During your session, make sure that your DumbStrategy passes their tests, and that their DumbStrategy passes your tests.
Riley & David
Elijah, Colby, & George
Refactoring will go a lot faster if we do it synchronously in small groups, instead of doing asynchronous refactoring with the entire group. By doing small-group synchronous refactoring, it'll be easier to keep a stream of communication going until the DumbStrategy works.
In case you need it: Probem 80-1 has templates of the game_state and the Strategy class.
c. Make the following adjustment to your random forest:
In your random decision tree, create a max_depth
parameter that stops splitting any nodes beyond the max_depth
. For example, if max_depth = 2
, then you would stop splitting a node once it is 2 units away from the root of the tree.
A consequence of this is that the terminal nodes might not be pure. If a terminal node is impure, then it represents the majority class. (If there are equal amounts of each class, just choose randomly.) When you initialize the random forest, pass a max_depth parameter that, in turn, gets passed to the random decision trees.
We've got a couple more adjustments to make, but I figured we should break up this task over multiple assignments since it's a bit of work and we've also got to keep making progress on the game refactoring.
Supplemental problems; 30% of assignment grade; 60 minutes estimate
Location: assignment-problems
Observe the following example:
bmiTell :: (RealFloat a) => a -> a -> String
bmiTell weight height
| bmi <= underweightThreshold = "The patient may be underweight. If this is the case, the patient should be recommended a higher-calorie diet."
| bmi <= normalThreshold = "The patient may be at a normal weight."
| otherwise = "The patient may be overweight. If this is the case, the patient should be recommended exercise and a lower-calorie diet."
where bmi = weight / height ^ 2
underweightThreshold = 18.5
normalThreshold = 25.0
Create a Haskell file RecommendClothing.hs
and write a function recommendClothing
that takes the input degreesCelsius
, converts it to degreesFahrenheit
(multiply by $\dfrac{9}{5}$ and add $32$), and makes the following recommendations:
If the temperature is $ \geq 80 \, ^\circ \textrm{F},$ then recommend to wear a shortsleeve shirt.
If the temperature is $ > 65 \, ^\circ \textrm{F}$ but $ < 80 \, ^\circ \textrm{F},$ then recommend to wear a longsleeve shirt.
If the temperature is $ > 50 \, ^\circ \textrm{F}$ but $ < 65 \, ^\circ \textrm{F},$ then recommend to wear a sweater.
If the temperature is $ \leq 50 \, ^\circ \textrm{F},$ then recommend to wear a jacket.
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/c-tutorial-strings/problem
myString.substr(1, 3)
Shell
https://www.hackerrank.com/challenges/text-processing-cut-2/problem
https://www.hackerrank.com/challenges/text-processing-cut-3/problem
https://www.hackerrank.com/challenges/text-processing-cut-4/problem
https://www.hackerrank.com/challenges/text-processing-cut-5/problem
https://www.hackerrank.com/challenges/text-processing-cut-6/problem
Here are some useful templates:
cut -c2-4
means print characters $2$ through $4$ for each line in the file.
cut -d',' -f2-4
means print fields $2$ through $4$ for each line the file, where the fields are separated by the delimiter ','
.
You can also look at this resource for some examples: https://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html
These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help.
SQL
https://sqlzoo.net/wiki/SUM_and_COUNT (queries 1,2,3,4,5)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
For your submission, copy and paste your links into the following template:
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____
Issue 1: _____
Issue 2: _____
Primary problems; 50% of assignment grade; 60 minutes estimate
a.
(i) In your game state, make the following updates:
Change "hits"
to "hits_left"
Change "Homeworld"
to "Colony"
Put another key in your game state: game_state["players"]["home_coords"]
Make sure there are no strings with spaces in them -- instead, we'll use underlines
Attack and defense tech starts at 0; movement, ship size, and shipyard tech all start at 1
Colony ships are not affected by technology in general
Add "ship_size_needed"
to the "unit_data"
key in the game state. Otherwise, the player doesn't know what ship size technology it needs before it can buy a ship.
Change the output of decide_purchases
to specify locations at which to build the ships, like this:
{
'units': [{'type': 'Scout', 'coords': (2,1)},
{'type': 'Scout', 'coords': (2,1)},
{'type': 'Destroyer', 'coords': (2,1)},
'technology': ['defense', 'attack', 'attack']
}
The updated game state is shown below:
game_state = {
'turn': 4,
'phase': 'Combat', # Can be 'Movement', 'Economic', or 'Combat'
'round': None, # if the phase is movement, then round is 1, 2, or 3
'player_whose_turn': 0, # index of player whose turn it is (or whose ship is attacking during battle),
'winner': None,
'players': [
{'cp': 9,
'home_coords': (6,3),
'units': [
{'coords': (5,10),
'type': 'Scout',
'hits_left': 1,
'technology': {
'attack': 1,
'defense': 0,
'movement': 3
}},
{'coords': (1,2),
'type': 'Destroyer',
'hits_left': 1,
'technology': {
'attack': 0,
'defense': 0,
'movement': 2
}},
{'coords': (6,0),
'type': 'Homeworld',
'hits_left': 2,
'turn_created': 0
},
{'coords': (5,3),
'type': 'Colony',
'hits_left': 1,
'turn created': 2
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 3, 'shipsize': 1}
},
{'cp': 15,
'home_coords': (0,3),
'units': [
{'coords': (1,2),
'type': 'Battlecruiser',
'hits_left': 1,
'technology': {
'attack': 0,
'defense': 0,
'movement': 1
}},
{'coords': (1,2),
'type': 'Scout',
'hits_left': 1,
'technology': {
'attack': 1,
'defense': 0,
'movement': 1
}},
{'coords': (5,10),
'type': 'Scout',
'hits_left': 1,
'technology': {
'attack': 1,
'defense': 0,
'movement': 1
}},
{'coords': (6,12),
'type': 'Homeworld',
'hits_left': 3,
'turn_created': 0
},
{'coords': (5,10),
'type': 'Colony',
'hits_left': 3
'turn_created': 1
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 1, 'shipsize': 1}
}],
'planets': [(5,3), (5,10), (1,2), (4,8), (9,1)],
'unit_data': {
'Battleship': {'cp_cost': 20, 'hullsize': 3, 'shipsize_needed': 5, 'tactics': 5, 'attack': 5, 'defense': 2, 'maintenance': 3},
'Battlecruiser': {'cp_cost': 15, 'hullsize': 2, 'shipsize_needed': 4, 'tactics': 4, 'attack': 5, 'defense': 1, 'maintenance': 2},
'Cruiser': {'cp_cost': 12, 'hullsize': 2, 'shipsize_needed': 3, 'tactics': 3, 'attack': 4, 'defense': 1, 'maintenance': 2},
'Destroyer': {'cp_cost': 9, 'hullsize': 1, 'shipsize_needed': 2, 'tactics': 2, 'attack': 4, 'defense': 0, 'maintenance': 1},
'Dreadnaught': {'cp_cost': 24, 'hullsize': 3, 'shipsize_needed': 6, 'tactics': 5, 'attack': 6, 'defense': 3, 'maintenance': 3},
'Scout': {'cp_cost': 6, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 1, 'attack': 3, 'defense': 0, 'maintenance': 1},
'Shipyard': {'cp_cost': 3, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 3, 'attack': 3, 'defense': 0,, 'maintenance': 0},
'Decoy': {'cp_cost': 1, 'hullsize': 0, 'shipsize_needed': 1, 'tactics': 0, 'attack': 0, 'defense': 0, 'maintenance': 0},
'Colonyship': {'cp_cost': 8, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 0, 'attack': 0, 'defense': 0, 'maintenance': 0},
'Base': {'cp_cost': 12, 'hullsize': 3, 'shipsize_needed': 2, 'tactics': 5, 'attack': 7, 'defense': 2, 'maintenance': 0},
},
'technology_data': {
# lists containing price to purchase the next level level
'shipsize': [0, 10, 15, 20, 25, 30],
'attack': [20, 30, 40],
'defense': [20, 30, 40],
'movement': [0, 20, 30, 40, 40, 40],
'shipyard': [0, 20, 30]
}
}
The Strategy
template is shown below:
class CombatStrategy:
def __init__(self, player_index):
self.player_index = player_index
def will_colonize_planet(self, coordinates, game_state):
...
return either True or False
def decide_ship_movement(self, unit_index, game_state):
...
return a "translation" which is a tuple representing
the direction in which the ship moves.
# For example, if a unit located at (1,2) wanted to
# move to (1,1), then the translation would be (0,-1).
def decide_purchases(self, game_state):
...
return {
'units': list of unit objects you want to buy,
'technology': list of technology attributes you want to upgrade
}
# for example, if you wanted to buy 2 Scouts, 1 Destroyer,
# upgrade defense technology once, and upgrade attack
# technology twice, you'd return
# {
# 'units': [{'type': 'Scout', 'coords': (2,1)},
# {'type': 'Scout', 'coords': (2,1)},
# {'type': 'Destroyer', 'coords': (2,1)},
# 'technology': ['defense', 'attack', 'attack']
# }
def decide_removal(self, game_state):
...
return the unit index of the ship that you want to remove.
for example, if you want to remove the unit at index 2,
return 2
def decide_which_unit_to_attack(self, combat_state, coords, attacker_index)
# combat_state is a dictionary in the form coordinates : combat_order
# {
# (1,2): [{'player': 1, 'unit': 0},
# {'player': 0, 'unit': 1},
# {'player': 1, 'unit': 1},
# {'player': 1, 'unit': 2}],
# (2,2): [{'player': 2, 'unit': 0},
# {'player': 3, 'unit': 1},
# {'player': 2, 'unit': 1},
# {'player': 2, 'unit': 2}]
# }
# attacker_index is the index of your unit, whose turn it is
# to attack.
...
return the index of the ship you want to attack in the
combat order.
# in the above example, if you want to attack player 1's unit 1,
# then you'd return 2 because it corresponds to
# combat_state['order'][2]
def decide_which_units_to_screen(self, combat_state, coords):
# again, the combat_state is the combat_state for the
# particular battle
...
return the indices of the ships you want to screen
in the combat order
# in the above example, if you are player 1 and you want
# to screen units 1 and 2, you'd return [2,3] because
# the ships you want to screen are
# combat_state['order'][2] and combat_state['order'][3]
# NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY,
# YOU CAN JUST RETURN AN EMPTY ARRAY
(ii) Once your game state / strategies are ready to be tested, post on #machine-learning to let your classmates know.
(iii) Run your classmates' strategies after they post that the strategies are ready. If there are any issues with their strategy, post on #machine-learning to let them know. I'm hoping that, possibly with a little back-and-forth fixing, we can have all the strategies working in everyone's games by the end of the long weekend.
b.
Create a steepest_descent_optimizer(n)
optimizer for the 8 queens problem, which starts with the best of 100 random locations arrays, and on each iteration, repeatedly compares all possible next location arrays that result from moving one queen by one space, and chooses the one that results in the minimum cost. The algorithm will run for n
iterations.
Some clarifications:
By "starts with the best of 100 random locations arrays", I mean that you should start by generating 100 random locations arrays and selecting the lowest-cost array to be your initial locations array.
There are $8$ queens, and each queen can move in one of $8$ directions (up, down, left, right, or in a diagonal direction) unless one of those directions is blocked by another queen or invalid due to being off the board.
So, the number of possible "next location arrays" resulting from moving one queen by one space will be around $8 \times 8 = 64,$ though probably a little bit less. This means that on each iteration, you'll have to check about $64$ possible next location arrays and choose the one that minimies the cost function.
If multiple configurations minimize the cost, randomly select one of them. If every next configuration increases the cost, then terminate the algorithm and return the current locations.
Important: We didn't discuss this in class, so be sure to post on Slack if you get confused on any part of this problem.
Your function should again return the following dictionary:
{
'locations': array that resulted in the lowest cost,
'cost': the actual value of that lowest cost
}
Print out the cost of your steepest_descent_optimizer
for n=10,50,100,500,1000
. Once you have those printouts, post it on Slack in the #results channel.
Supplemental problems; 40% of assignment grade; 60 minutes estimate
Location: assignment-problems/refactor_string_processing.py
The following code is supposed to turn a string into an array. Currently, it's messy, and there's some subtle issues with the code. Clean up the code and get it to work.
Some particular things to fix are:
Putting whitespace where appropriate
Naming variables clearly
Deleting any pieces of code that aren't necessary
string = '"alpha","beta","gamma","delta"\n1,2,3,4\n5.0,6.0,7.0,8.0'
strings = [x.split(',') for x in string.split('\n')]
length_of_string = len(string)
arr = []
for string in strings:
newstring = []
if len(string) > 0:
for char in string:
if char[0]=='"' and char[-1]=='"':
char = char[1:]
elif '.' in char:
char = int(char)
else:
char = float(char)
newstring.append(char)
arr.append(newstring)
print(arr)
---
What it should print:
[['alpha', 'beta', 'gamma', 'delta'], [1, 2, 3, 4], [5.0, 6.0, 7.0, 8.0]]
What actually happens:
Traceback (most recent call last):
File "datasets/myfile.py", line 10, in <module>
char = int(char)
ValueError: invalid literal for int() with base 10: '5.0'
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/syntax-in-functions.
Pattern matching
Create Haskell file Fibonacci.hs
and write a function nthFibonacciNumber
that computes the n
th Fibonacci number, starting with $n=0$. Remember that the Fibonacci sequence is $0,1,1,2,3,5,8,\ldots$ where each number comes from adding the previous two.
To check your function, print nthFibonacciNumber 20
. You should get a result of 6765
.
Note: This part of the section will be very useful, since it talks about how to write a recursive function.
factorial :: (Integral a) => a -> a
factorial 0 = 1
factorial n = n * factorial (n - 1)
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/arrays-introduction/problem
for (int i=0; i<n; i++) {
cin >> a[i];
}
You can read the array out in a similar way.Shell
https://www.hackerrank.com/challenges/text-processing-cut-1/problem
while read line
do
(your code here)
done
Again, be sure to check out the top-right "Tutorial" tab.SQL
https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (queries 9,10)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
For your submission, copy and paste your links into the following template:
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
Commit link for machine-learning repo: ____
Commit link for space-empires repo: ____
Issue 1: _____
Issue 2: _____
Primary problems; 45% of assignment grade; 90 minutes estimate
Adjustments to the game...
If you have grid_size
, rename to board_size
. It should be a tuple (x,y)
instead of just 1 integer
Updates to combat strategy:
decide_purchases
, the units should be strings (not objects)Updates to game state:
Change "location"
to "coords"
Unit types should be strings
Add a dictionaries unit_data
and technology_data
Person-specific corrections (if you haven't addressed these already):
Colby
The conventions for his shipyard/colonyship class naming are incorrect: He has an underscore, he uses something called 'movement_round' in the gamestate.
There are a ton of other really small naming differences in the gamestate, like a player's units vs ships Board_size vs grid_size
Riley
Board_size vs grid_size
decide_which_ship_to_attack arguments are reversed
combat_state is in a different format
George
His unit folder is called "units" (plural)
Shipyard and colonyship filenames don't have underscores
Shipyard vs ShipYard
coords vs pos in ship state
name vs type in ship state
David
It looks like he passes in some datastruct instead of a dictionary and accesses properties using a dot instead of square brackets.
He used some non-python syntax (++ increment) in his CombatStrategy
In decide_ship_movement he has the ship index as the attribute, which is correct, but then he uses ship.coordinates as if he is handling the object.
game_state = {
'turn': 4,
'phase': 'Combat', # Can be 'Movement', 'Economic', or 'Combat'
'round': None, # if the phase is movement, then round is 1, 2, or 3
'player_whose_turn': 0, # index of player whose turn it is (or whose ship is attacking during battle),
'winner': None,
'players': [
{'cp': 9
'units': [
{'coords': (5,10),
'type': 'Scout',
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 3
}},
{'coords': (1,2),
'type': 'Destroyer',
'hits': 0,
'technology': {
'attack': 0,
'defense': 0,
'movement': 2
}},
{'coords': (6,0),
'type': 'Homeworld',
'hits': 0,
'turn_created': 0
},
{'coords': (5,3),
'type': 'Colony',
'hits': 0,
'turn created': 2
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 3, 'shipsize': 0}
},
{'cp': 15
'units': [
{'coords': (1,2),
'type': 'Battlecruiser',
'hits': 1,
'technology': {
'attack': 0,
'defense': 0,
'movement': 1
}},
{'coords': (1,2),
'type': 'Scout',
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 1
}},
{'coords': (5,10),
'type': 'Scout',
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 1
}},
{'coords': (6,12),
'type': 'Homeworld',
'hits': 0,
'turn_created': 0
},
{'coords': (5,10),
'type': 'Colony',
'turn_created': 1
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 1, 'shipsize': 1}
}],
'planets': [(5,3), (5,10), (1,2), (4,8), (9,1)],
'unit_data': {
'Battleship': {'cp_cost': 20, 'hullsize': 3, 'shipsize_needed': 5},
'Battlecruiser': {'cp_cost': 15, 'hullsize': 2, 'shipsize_needed': 4},
'Cruiser': {'cp_cost': 12, 'hullsize': 2, 'shipsize_needed': 3},
'Destroyer': {'cp_cost': 9, 'hullsize': 1, 'shipsize_needed': 2},
'Dreadnaught': {'cp_cost': 24, 'hullsize': 3, 'shipsize_needed': 6},
'Scout': {'cp_cost': 6, 'hullsize': 1, 'shipsize_needed': 1},
'Shipyard': {'cp_cost': 3, 'hullsize': 1, 'shipsize_needed': 1},
'Decoy': {'cp_cost': 1, 'hullsize': 0, 'shipsize_needed': 1},
'Colonyship': {'cp_cost': 8, 'hullsize': 1, 'shipsize_needed': 1},
'Base': {'cp_cost': 12, 'hullsize': 3, 'shipsize_needed': 2},
},
'technology_data': {
# lists containing price to purchase the next level level
'shipsize': [10, 15, 20, 25, 30],
'attack': [20, 30, 40],
'defense': [20, 30, 40],
'movement': [20, 30, 40, 40, 40],
'shipyard': [20, 30]
}
}
class CombatStrategy:
def __init__(self, player_index):
self.player_index = player_index
def will_colonize_planet(self, coordinates, game_state):
...
return either True or False
def decide_ship_movement(self, unit_index, game_state):
...
return a "translation" which is a tuple representing
the direction in which the ship moves.
# For example, if a unit located at (1,2) wanted to
# move to (1,1), then the translation would be (0,-1).
def decide_purchases(self, game_state):
...
return {
'units': list of unit objects you want to buy,
'technology': list of technology attributes you want to upgrade
}
# for example, if you wanted to buy 2 Scouts, 1 Destroyer,
# upgrade defense technology once, and upgrade attack
# technology twice, you'd return
# {
# 'units': ['Scout', 'Scout', 'Destroyer'],
# 'technology': ['defense', 'attack', 'attack']
# }
def decide_removal(self, game_state):
...
return the unit index of the ship that you want to remove.
for example, if you want to remove the unit at index 2,
return 2
def decide_which_unit_to_attack(self, combat_state, coords, attacker_index)
# combat_state is a dictionary in the form coordinates : combat_order
# {
# (1,2): [{'player': 1, 'unit': 0},
# {'player': 0, 'unit': 1},
# {'player': 1, 'unit': 1},
# {'player': 1, 'unit': 2}],
# (2,2): [{'player': 2, 'unit': 0},
# {'player': 3, 'unit': 1},
# {'player': 2, 'unit': 1},
# {'player': 2, 'unit': 2}]
# }
# attacker_index is the index of your unit, whose turn it is
# to attack.
...
return the index of the ship you want to attack in the
combat order.
# in the above example, if you want to attack player 1's unit 1,
# then you'd return 2 because it corresponds to
# combat_state['order'][2]
def decide_which_units_to_screen(self, combat_state):
# again, the combat_state is the combat_state for the
# particular battle
...
return the indices of the ships you want to screen
in the combat order
# in the above example, if you are player 1 and you want
# to screen units 1 and 2, you'd return [2,3] because
# the ships you want to screen are
# combat_state['order'][2] and combat_state['order'][3]
# NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY,
# YOU CAN JUST RETURN AN EMPTY ARRAY
Location: machine-learning/analysis/8_queens.py
We're going to be exploring approaches to solving the 8-queens problem on the next couple assignments.
The 8-queens problem is a challenge to place 8 queens on a chess board in a way that none can attack each other. Remember that in chess, queens can attack any piece that is on the same row, column, or diagonal. So, the 8-queens problem is to place 8 queens on a chess board so that none of them are on the same row, column, or diagonal.
a. Write a function show_board(locations)
that takes a list of locations of 8 queens and prints out the corresponding board by placing periods in empty spaces and the index of the location in any space occupied by a queen.
>>> locations = [(0,0), (6,1), (2,2), (5,3), (4,4), (7,5), (1,6), (2,6)]
>>> show_board(locations)
0 . . . . . . .
. . . . . . 6 .
. . 2 . . . 7 .
. . . . . . . .
. . . . 4 . . .
. . . 3 . . . .
. 1 . . . . . .
. . . . . 5 . .
Tip: To print out a row, you can first construct it as an array and then print the corresponding string, which consists of the array entries separated by two spaces:
>>> row_array = ['0', '.', '.', '.', '.', '.', '.', '.']
>>> row_string = ' '.join(row_array) # note that ' ' is TWO spaces
>>> print(row_string)
0 . . . . . . .
b. Write a function that calc_cost(locations)
computes the "cost", i.e. the number of pairs of queens that are on the same row, column, or diagonal.
For example, in the board above, the cost is 10:
Verify that the cost of the above configuration is 10:
>>> calc_cost(locations)
10
Tip 1: It will be easier to debug your code if you write several helper functions -- one which takes two coordinate pairs and determines whether they're on the same row, another which determines whether they're on the same column, another which determines if they're on the same diagonal.
Tip 2: To check if two locations are on the same diagonal, you can compute the slope between those two points and check if the slope comes out to $1$ or $-1.$
c. Write a function random_optimizer(n)
that generates n
random locations
arrays for the 8 queens, and returns the following dictionary:
{
'locations': array that resulted in the lowest cost,
'cost': the actual value of that lowest cost
}
Then, print out the cost of your random_optimizer
for n=10,50,100,500,1000
. Once you have those printouts, post it on Slack in the #results channel.
Supplemental problems; 45% of assignment grade; 60 minutes estimate
Location: assignment-problems/refactor_linear_regressor.py
The following code is taken from a LinearRegressor
class. While most of the code will technically work, there may be a couple subtle issues, and the code is difficult to read.
Refactor this code so that it is more readable. It should be easy to glance at and understand what's going on. Some particular things to fix are:
Putting whitespace where appropriate
Naming variables clearly
Expanding out complicated one-liners
Deleting any pieces of code that aren't necessary
Important:
You don't have to actually run the code. This is just an exercise in improving code readability. You just need to copy and paste the code below into a file and clean it up.
Don't spend more than 20 min on this problem. You should fix the things that jump out at you as messy, but don't worry about trying to make it absolutely perfect.
def calculate_coefficients(self):
final_dict = {}
mat = [[1 for x in list(self.df.data_dict.values())[0][0]]]
mat_dict = {}
for key in self.df.data_dict:
if key != self.dependent_variable:
mat_dict[key] = self.df.data_dict[key]
for row in range(len(mat_dict)):
mat.append(list(self.df.data_dict.values())[row][0])
mat = Matrix(mat)
mat = mat.transpose()
mat_t = mat.transpose()
mat_mult = mat_t.matrix_multiply(mat)
mat_inv = mat_mult.inverse()
mat_pseudoinv = mat_inv.matrix_multiply(mat_t)
multiplier = [[num] for num in list(self.df.data_dict.values())[1][0]]
multiplier_mat = mat_pseudoinv.matrix_multiply(Matrix(multiplier))
for num in range(len(multiplier_mat.elements)):
if num == 0:
key = 'constant'
else:
key = list(self.df.data_dict.keys())[num-1]
final_dict[key] = [row[0] for row in multiplier_mat.elements][num]
return final_dict
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/syntax-in-functions.
Pattern matching
Create Haskell file CrossProduct.hs
and write a function crossProduct
in that takes an two input 3-dimensional tuples, (x1,x2,x3)
and (y1,y2,y3)
and computes the cross product.
To check your function, print crossProduct (1,2,3) (3,2,1)
. You should get a result of (-4,8,-4)
.
Note: This part of the section will be very useful:
addVectors :: (Num a) => (a, a) -> (a, a) -> (a, a)
addVectors (x1, y1) (x2, y2) = (x1 + x2, y1 + y2)
Note that the top line just states the "type" of addVectors
. This line says that addVectors
works with Num
bers a
, and it takes two inputs of the form (a, a)
and (a, a)
and gives an output of the form (a, a)
. Here, a
just stands for the type, Num
ber.
Complete these C++/Shell/SQL coding challenges and submit screenshots.
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
C++
https://www.hackerrank.com/challenges/c-tutorial-pointer/problem
Shell
https://www.hackerrank.com/challenges/bash-tutorials---arithmetic-operations/problem
SQL
https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (queries 7,8)
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
For your submission, copy and paste your links into the following template:
PART 1
repl.it link for space-empires refactoring: ____
repl.it link for 8 queens: ____
PART 2
refactor_linear_regressor repl.it link: _____
Repl.it link to Haskell code: _____
Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____
PART 3
Issue 1: _____
Issue 2: _____
Primary problems; 45% of assignment grade; 30-75 minutes estimate
Make sure that Problem 77-1-a is done so that we can discuss the results next class. If you've already finished this, you can submit the same link that you did for Problem 77.
Note: your table should have only 5 entries, exactly 1 entry for each model. For each model, you should count all the correct predictions (over all train-test splits) and divide by the total number of predictions (over all train-test splits).
Also note that, all together, it will probably take 5 minutes to train the models on all the splits. This is because we've implemented the simplest version of a random forest that could possibly be concieved, and it's really inefficient. We will make it more efficient next time.
For each classmate, make a list of specific things (if any) that they have to fix in their strategies in order for them to seamlessly integrate into our game.
Next time, we will aggregate and discuss all these fixes and hopefully our strategies will integrate seamlessly after that.
Supplemental problems; 60% of assignment grade; 75 minutes estimate
Recall the standard normal distribution:
$$ p(x) = \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2} $$Previously, you wrote a function calc_standard_normal_probability(a,b)
using a Riemann sum with step size 0.001
.
Now, you will generalize the function:
use an arbitrary number of n
subintervals (the step size will be (b-a)/n
allow 5 different rules for computing the sum ("left endpoint"
, "right endpoint"
, "midpoint"
, "trapezoidal"
, "simpson"
)
The resulting function will be calc_standard_normal_probability(a,b,n,rule)
.
Note: The rules are from AP Calc BC. They are summarized below for a partition $\{ x_0, x_1, \ldots, x_n \}$ and step size $\Delta x.$
$$ \begin{align*} \textrm{Left endpoint rule} &= \Delta x \left[ f(x_0) + f(x_1) + \ldots + f(x_{n-1}) \right] \\[7pt] \textrm{Right endpoint rule} &= \Delta x \left[ f(x_1) + f(x_2) + \ldots + f(x_{n}) \right] \\[7pt] \textrm{Midpoint rule} &= \Delta x \left[ f \left( \dfrac{x_0+x_1}{2} \right) + f \left( \dfrac{x_1+x_2}{2} \right) + \ldots + f\left( \dfrac{x_{n-1}+x_{n}}{2} \right) \right] \\[7pt] \textrm{Trapezoidal rule} &= \Delta x \left[ 0.5f(x_0) + f(x_1) + f(x_2) + \ldots + f(x_{n-1}) + 0.5f(x_{n}) \right] \\[7pt] \textrm{Simpson's rule} &= \dfrac{\Delta x}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + 2f(x_4) + \ldots + 4f(x_{n-1}) + f(x_{n}) \right] \\[7pt] \end{align*} $$For each rule, estimate $P(0 \leq x \leq 1)$ by making a plot of the estimate versus the number of subintervals for the even numbers $n \in \{ 2, 4, 6, \ldots, 100 \}.$ The resulting graph should look something like this. Post your plot on #computation-and-modeling once you've got it.
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/starting-out.
Texas ranges I'm a list comprehension
Create Haskell file ComplicatedList.hs
and write a function calcList
in that takes an input number n
and counts the number of ordered pairs [x,y]
that satisfy $-n \leq x,y \leq n$ and $x-y \leq \dfrac{xy}{2} \leq x+y$ and $x,y \notin \{ -2, -1, 0, 1, 2 \}.$ This function should generate a list comprehension and then count the length of that list.
To check your function, print calcList 50
. You should get a result of $16.$
Complete these C++/Shell/SQL coding challenges and submit screenshots.
https://www.hackerrank.com/challenges/c-tutorial-for-loop/problem
https://www.hackerrank.com/challenges/c-tutorial-functions/problem
https://www.hackerrank.com/challenges/bash-tutorials---comparing-numbers/problem
https://www.hackerrank.com/challenges/bash-tutorials---more-on-conditionals/problem
https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (queries 4,5,6)
For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.
For SQL, each screenshot should include the problem number, the successful smiley face, and your query.
Here's a helpful example of some bash syntax. (The spaces on the inside of the brackets are really important! It won't work if you remove the spaces, i.e. [$n -gt 100]
)
read n
if [ $n -gt 100 ] || [ $n -lt -100 ]
then
echo What a large number.
else
echo The number is smol.
if [ $n -eq 13 ]
then
echo And it\'s unlucky!!!
fi
fi
a.
b.
Remember that for a probability distribution $f(x),$ the cumulative distribution function (CDF) is $F(x) = P(X \leq x) = \displaystyle \int_{-\infty}^x f(x) \, \textrm dx.$
Remember that $EX$ means $\textrm E[X].$
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
For your submission, copy and paste your links into the following template:
Commit link to machine-learning repo (if any changes were required): _____
Repl.it link to Haskell code: _____
Commit link for assignment-problems repo: _____
Link to C++/SQL screenshots (Overleaf or Google Doc): _____
Link to probability solutions (on Overleaf):
Issue 1: _____
Issue 2: _____
Primary problems; 45% of assignment grade; 75 minutes estimate
a. You'll need to do part 1 of the supplemental problem before you do this problem.
(i) Download the freshman_lbs.csv
dataset from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, read it into a DataFrame
, and create 5 test-train splits:
Note that you'll need to convert the appropriate entries to numbers (instead of strings) in the dataset. There are 2 options for doing this:
Option 1: don't worry about fixing the format within the read_csv
method. Just do something like
df = df.apply('weight', lambda x: int(x))
afterwards, before you pass the dataframe into your model.
Option 2: when you read in the csv, after you do the
lines = file.read().split('\n')
entries = [line.split(',') for line in lines]
thing, you can loop through the entries, and if entry[0]+entry[-1] == '""'
, then you can set entry = entry[1:-1]
to remove the quotes. Otherwise, if entry[0]+entry[-1] != '""'
, then you can try to do entry = float(entry[1:-1])
.
(ii) For each test-train split, fit each of the following models on the training data and use it to predict the sexes on the testing data. (You are predicting sex as a function of weight and BMI, and you can just use columns corresponding to September data.)
Decision tree using Gini split criterion
A single random decision tree
Random forest with 10 trees
Random forest with 100 trees
Random forest with 1000 trees
(iii) For each model, compute the accuracy (count the total number of correct classifications and divide by the total number of classifications). Put these results in a table in an Overleaf document.
Note that the total number of classifications should be equal to the total number of records in the dataset (you did 5 train-test splits, and each train-test split involved testing on 20% of the data).
(iv) Below the table, analyze the results. Did you expect these results, or did they surprise you? Why do you think you got the results you did?
b. For each of your classmates, copy over their DumbStrategy
and CombatStrategy
and run your DumbPlayer/CombatPlayer tests using your classmate's strategy. Fill out the following information for each classmate:
Name of classmate
When you copied over their DumbStrategy
and ran your DumbPlayer tests, did they pass? If not, then what's the issue? Is it a problem with your game, or with their strategy class?
When you copied over their CombatStrategy
and ran your CombatPlayer tests, did they pass? If not, then what's the issue? Is it a problem with your game, or with their strategy class?
Supplemental problems; 45% of assignment grade; 75 minutes estimate
In your machine-learning
repository, create a folder machine-learning/datasets/
. Go to https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, download the file airtravel.csv
, and put it in your datasets/
folder.
In Python, you can read a csv as follows:
>>> path_to_datasets = '/home/runner/machine-learning/datasets/'
>>> filename = 'airtravel.csv'
>>> with open(path_to_datasets + filename, "r") as file:
print(file.read())
"Month", "1958", "1959", "1960"
"JAN", 340, 360, 417
"FEB", 318, 342, 391
"MAR", 362, 406, 419
"APR", 348, 396, 461
"MAY", 363, 420, 472
"JUN", 435, 472, 535
"JUL", 491, 548, 622
"AUG", 505, 559, 606
"SEP", 404, 463, 508
"OCT", 359, 407, 461
"NOV", 310, 362, 390
"DEC", 337, 405, 432
Write a @classmethod
called DataFrame.from_csv(path_to_csv, header=True)
that constructs a DataFrame
from a csv file (similar to how DataFrame.from_array(arr)
constructs the DataFrame
from an array).
Test your method as follows:
>>> path_to_datasets = '/home/runner/machine-learning/datasets/'
>>> filename = 'airtravel.csv'
>>> filepath = path_to_datasets + filename
>>> df = DataFrame.from_csv(filepath, header=True)
>>> df.to_array()
[['"Month"', '"1958"', '"1959"', '"1960"'],
['"JAN"', '340', '360', '417'],
['"FEB"', '318', '342', '391'],
['"MAR"', '362', '406', '419'],
['"APR"', '348', '396', '461'],
['"MAY"', '363', '420', '472'],
['"JUN"', '435', '472', '535'],
['"JUL"', '491', '548', '622'],
['"AUG"', '505', '559', '606'],
['"SEP"', '404', '463', '508'],
['"OCT"', '359', '407', '461'],
['"NOV"', '310', '362', '390'],
['"DEC"', '337', '405', '432']]
Location: assignment-problems
Skim the following section of http://learnyouahaskell.com/starting-out.
An intro to lists
Create Haskell file ListProcessing.hs
and write a function prodFirstLast
in Haskell that takes an input list arr
and computes the product of the first and last elements of the list. Then, apply this function to the input [4,2,8,5]
.
Tip: use the !!
operator and the length
function.
Your file will look like this:
prodFirstLast arr = (your code here)
main = print (prodFirstLast [4,2,8,5])
Note that, to print out an integer, we use print
instead of putStrLn
.
(You can also use print
for most strings. The difference is that putStrLn
can show non-ASCII characters like "я" whereas print
cannot.)
Run your function and make sure it gives the desired output (which is 20).
a. Complete these introductory C++ coding challenges and submit screenshots:
https://www.hackerrank.com/challenges/c-tutorial-basic-data-types/problem
https://www.hackerrank.com/challenges/c-tutorial-conditional-if-else/problem
b. Complete these Bash coding challenges and submit screenshots:
https://www.hackerrank.com/challenges/bash-tutorials---a-personalized-echo/problem
https://www.hackerrank.com/challenges/bash-tutorials---the-world-of-numbers/problem
(Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.)
c. Complete SQL queries 1-3 here and submit screenshots:
(Each screenshot should include the problem number, the successful smiley face, and your query.)
a. As we will see in the near future, the standard normal distribution comes up A LOT in the context of statistics. It is defined as
$$ p(x) = \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2}. $$The reason why we haven't encountered it until now is that it's difficult to integrate. In practice, it's common to use a pre-computed table of values to look up probabilities from this distribution.
The actual problem: Write a function calc_standard_normal_probability(a,b)
to approximate $P(a \leq X \leq b)$ for the standard normal distribution, using a Riemann sum with step size 0.001.
To check your function, print out estimates of the following probabilities:
$P(-1 \leq x \leq 1)$
$P(-2 \leq x \leq 2)$
$P(-3 \leq x \leq 3)$
Your estimates should come out close to 0.68, 0.955, 0.997 respectively. (https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
b.
"CDF" stands for Cumulative Distribution Function. The CDF of a probability distribution $f(x)$ is defined as $$ F(x) = P(X \leq x) = \int_{-\infty}^x f(x) \, \textrm dx. $$
Your answer for the CDF will be a piecewise function (3 pieces).
$EX$ means $E[X].$
c.
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created.
~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them.
Primary problems; 40% of assignment grade; 60 minutes estimate
a. Create a RandomForest
class in machine-learning/src/random-forest
that is initialized with a value n
that represents the number of random decision trees to use.
The RandomForest
should have a fit()
method and a predict()
method, just like the DecisionTree
.
The fit()
method should fit all the random decision trees.
The predict()
method should get a prediction from each random decision tree, and then return the prediction that occurred most frequently. (If there are multiple predictions that occurred most frequently, then choose randomly among them.)
So it should work like this:
rf = RandomForest(10) # random forest consisting of 10 random trees
rf.fit(df) # fit all 10 of those trees to the dataframe
rf.predict(observation) # have each of the 10 trees make a prediction, and
# return the majority vote of the 10 trees
b. Refactor the combat_state
in your game.
Previously, it looked like this:
[
{'location': (1,2),
'order': [{'player': 1, 'unit': 0,},
{'player': 0, 'unit': 1},
{'player': 1, 'unit': 1}]
},
{'location': (5,10),
'order': [{'player': 0, 'unit': 0},
{'player': 1, 'unit': 2},
{'player': 1, 'unit': 4}]
}
]
Now, we will refactor the above into this:
{
(1,2): [{'player': 1, 'unit': 0},
{'player': 0, 'unit': 1},
{'player': 1, 'unit': 1},
{'player': 1, 'unit': 2}]
(2,2): [{'player': 2, 'unit': 0},
{'player': 3, 'unit': 1},
{'player': 2, 'unit': 1},
{'player': 2, 'unit': 2}]
}
As a result, we will also have to update the inputs to decide_which_unit_to_attack
. Originally, the inputs were as follows:
decide_which_unit_to_attack(self, combat_state, attacker_index)
Now, we will have to include an additional input location
as follows:
decide_which_unit_to_attack(self, combat_state, location, attacker_index)
c. Refactor your decide_removals
function into a function decide_removal
(singular, not plural) that returns the index of a single ship to remove. So, it will return a single integer instead of an array.
Then, refactor your game so that it calls decide_removal
repeatedly until no more removals are required. This will prevent a situation in which our game crashes because a player did not remove enough ships.
def decide_removal(self, game_state):
...
return the unit index of the ship that you want to remove.
for example, if you want to remove the unit at index 2,
return 2
Supplemental problems; 50% of assignment grade; 75 minutes estimate
PART 1
Location: assignment-problems
Write a function random_draw(distribution)
that draws a random number from the probability distribution. Assume that the distribution is an array such that distribution[i]
represents the probability of drawing i
.
Here are some examples:
random_draw([0.5, 0.5])
will return 0
or 1
with equal probability
random_draw([0.25, 0.25, 0.5])
will return 0
a quarter of the time, 1
a quarter of the time, and 2
half of the time
random_draw([0.05, 0.2, 0.15, 0.3, 0.1, 0.2])
will return 0
5% of the time, 1
20% of the time, 2
15% of the time, 3
30% of the time, 4
10% of the time, and 0.2
20% of the time.
The way to implement this is to
Distribution:
[0.05, 0.2, 0.15, 0.3, 0.1, 0.2]
Cumulative distribution:
[0.05, 0.25, 0.4, 0.7, 0.8, 1.0]
Choose a random number between 0 and 1:
0.77431
The first value in the cumulative distribution that is
greater than 0.77431 is 0.8.
This corresponds to the index 4.
So, return 4.
To test your function, generate 1000 random numbers from each distribution and ensure that their average is close to the true expected value of the distribution.
In other words, for each of the following distributions, print out the true expected value, and then print out the average of 1000 random samples.
[0.5, 0.5]
[0.25, 0.25, 0.5]
[0.05, 0.2, 0.15, 0.3, 0.1, 0.2]
PART 2
Location: assignment-problems
Skim the following sections of http://learnyouahaskell.com/starting-out.
Create Haskell file ClassifyNumber.hs
and write a function classifyNumber
in Haskell that takes an input number x
and returns
"negative"
if x
is negative"nonnegative"
if x
is nonnegative.Then, apply this function to the input 5
.
Your file will look like this:
classifyNumber x = (your code here)
main = putStrLn (classifyNumber 5)
Now, run your function by typing the following into the command line:
>>> ghc --make ClassifyNumber
>>> ./ClassifyNumber
ghc
is a Haskell compiler. It will compile or "make" an executable object using your .hs
file. The command ./ClassifyNumber
. actually runs your executable object.
PART 3
Complete this introductory C++ coding challenge: https://www.hackerrank.com/challenges/cpp-input-and-output/problem
Submit a screenshot that includes the name of the problem (top left), your username (top right), and Status: Accepted (bottom).
PART 4
Complete this introductory Shell coding challenge: https://www.hackerrank.com/challenges/bash-tutorials---looping-and-skipping/problem
The following example of a for
loop will be helpful:
for i in {2..10}
do
((n = 5 * i))
echo $n
done
Note: You can solve this problem with just a single for loop
Again, submit a screenshot that includes the name of the problem (top left), your username (top right), and Status: Accepted (bottom), just like in part 3.
PART 5
Complete queries 11-14 here: https://sqlzoo.net/wiki/SELECT_from_Nobel_Tutorial
As usual, include a screenshot for each problem that includes the problem number, the successful smiley face, and your query.
PART 6
Location: Overleaf
Complete the following probability problems:
a.
b.
Review; 10% of assignment grade; 15 minutes estimate
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.
Additionally, do the following:
Make a GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include a link to the issue you created.
Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)
Location: machine-learning/src/decision_tree.py
Grade Weighting: 40%
Update your DecisionTree
to have the option to build the tree via random splits. By "random splits", I mean that the tree should randomly choose from the possible splits, and it should keep splitting until each leaf node is pure.
>>> dt = DecisionTree(split_metric = 'gini')
>>> dt.fit(df)
Fits the decision tree using the Gini metric
>>> dt = DecisionTree(split_metric = 'random')
>>> dt.fit(df)
Fits the decision tree by randomly choosing splits
Estimated Time: 60 minutes
Grade Weighting: 40%
Submit corrections to final (put your corrections in an overleaf doc). I made a final review video that goes through each problem, available here: https://vimeo.com/496684498
For each correction, explain
Important: The majority of the misunderstandings should NOT be "I ran out of time", and when you explainhow to get to the correct result, SHOW ALL WORK.
Grade Weighting: 20%
Make sure that problem 73-1 is done. In the next assignment, you will run everyone else's strategies and they will run yours as well. We should all get the same results.
Estimated Time: 20 minutes
Important! If you don't do the things below, your assignment will receive a grade of zero.
Commit your code to GitHub. When you submit your assignment, include a link to your commit(s).
Make a GitHub issue on your assigned classmate's repository (but NOT assignment-problems
). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include a link to the issue you created.
Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved.
Wrapping up the semester...
Read the edited version of your blog post here. If you'd like to see any changes, post on Slack by the end of the week. Otherwise, these are going up on the website!
Turn in any missing assignments / resubmissions / reviews / test corrections by Sunday 1/3 at the very latest. Finish strong! I want to give out strong grades, but I can only do that if you're up to date with all your work and you've done it well.
Study for the final!
Probability/Statistics
definitions of independent/disjoint, conditional probability, mean, variance, standard deviation, covariance, how variance/covariance are related to expectation identifying probability distributions, solving for an unknown constant so that a probability distribution is valid, discrete uniform, continuous uniform, exponential, poisson, using cumulative distributions i.e. P(a <= x < b) = P(x < b) - P(x < a), KL divergence, joint distributions, basic probability computations with joint distributions, likelihood distribution, posterior/prior distributions
Machine Learning
pseudoinverse, fitting a linear regression, fitting a logistic regression, end behaviors of linear and logistic regression, interaction terms, using linear regression to fit the coefficients of a nonlinear function, categorical variables, naive bayes, k-nearest neighbors, decision trees, leave-one-out cross validation, underfitting/overfitting, training/testing datasets (testing datasets are also known as validation datasets)
Algorithms
Intelligent search (backtracking), depth-first search, breadth-first search, shortest path in a graph using breadth-first search, quicksort, computing big-O notation given a recurrence, bisection search (also known as bisection search)
Simulation
euler estimation, SIR model, predator-prey model, hodgkin-huxley model, translating a description into a system of differential equations
Review
Basic string processing (something like separate_into_words and reverse_word_order from Quiz 1), Implementing a recursive sequence, unlisting, big-O notation, matrix multiplication, converting to reduced row echelon form, determinant using rref, determinant using cofactors, why determinant using rref is faster than determinant using cofactors, inverse via augmented matrix, tally sort, merge sort (also know how to merge two sorted lists), swap sort, Newton-Raphson (i.e. the “zero of tangent line” method), gradient descent, grid search (also know how to compute cartesian product), Linked list, tree, stack, queue, converting between binary and decimal
Estimated Time: 2 hours
Points: 20
Refactor your game so that strategies adhere to this format exactly. Put the strategies as separate files in src/strategies
.
Note: If you have any disagreements with the strategy template below, post on Slack, and we can discuss.
from units.base import Base
from units.battlecruiser import Battlecruiser
from units.battleship import Battleship
from units.colony import Colony
from units.cruiser import Cruiser
from units.destroyer import Destroyer
from units.dreadnaught import Dreadnaught
from units.scout import Scout
from units.shipyard import Shipyard
class CombatStrategy:
def __init__(self, player_index):
self.player_index = player_index
def will_colonize_planet(self, coordinates, game_state):
...
return either True or False
def decide_ship_movement(self, unit_index, game_state):
...
return a "translation" which is a tuple representing
the direction in which the ship moves.
# For example, if a unit located at (1,2) wanted to
# move to (1,1), then the translation would be (0,-1).
def decide_purchases(self, game_state):
...
return {
'units': list of unit objects you want to buy,
'technology': list of technology attributes you want to upgrade
}
# for example, if you wanted to buy 2 Scouts, 1 Destroyer,
# upgrade defense technology once, and upgrade attack
# technology twice, you'd return
# {
# 'units': [Scout, Scout, Destroyer],
# 'technology': ['defense', 'attack', 'attack']
# }
def decide_removals(self, game_state):
...
return a list of unit indices of ships that you want to remove.
# for example, if you want to remove your 0th and 3rd units, you'd
# return [0, 3]
def decide_which_unit_to_attack(self, combat_state, attacker_index):
# combat_state is the combat_state for the particular battle
# being considered. It will take the form
# {'location': (1,2),
# 'order': [{'player': 1, 'unit': 0},
# {'player': 0, 'unit': 1},
# {'player': 1, 'unit': 1},
# {'player': 1, 'unit': 2}],
# }.
# attacker_index is the index of your unit, whose turn it is
# to attack.
...
return the index of the ship you want to attack in the
combat order.
# in the above example, if you want to attack player 1's unit 1,
# then you'd return 2 because it corresponds to
# combat_state['order'][2]
def decide_which_units_to_screen(self, combat_state):
# again, the combat_state is the combat_state for the
# particular battle
...
return the indices of the ships you want to screen
in the combat order
# in the above example, if you are player 1 and you want
# to screen units 1 and 2, you'd return [2,3] because
# the ships you want to screen are
# combat_state['order'][2] and combat_state['order'][3]
# NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY,
# YOU CAN JUST RETURN AN EMPTY ARRAY
Note: for technology upgrades, you'll likely have to translate between strings of technology names and technology stored as Player
attributes. The setattr
and getattr
functions may be helpful:
>>> class Cls:
pass
>>> obj = Cls()
>>> setattr(obj, "foo", "bar")
>>> obj.foo
'bar'
>>> getattr(obj, "foo")
'bar'
Estimated Time: 1 hour
Points: 15
We need to extend our EulerEstimator
to allow for "time delay". To do this, we'll need to keep a cache of data for the necessary variables. However, it's going to be very hard to build this if we always have to refer to variables by their index in the point. So, in this problem, we're going to update our EulerEstimator
so that we can refer to variables by their actual names.
Refactor your EulerEstimator
so that x
is a dictionary instead of an array. This way, we can reference components of x
by their actual labels rather than having to always use indices.
For example, to run our SIR model, we originally did this:
derivatives = [
(lambda t, x: -0.0003*x[0]*x[1]),
(lambda t, x: 0.0003*x[0]*x[1] - 0.02*x[1]),
(lambda t, x: 0.02*x[1])
]
starting_point = (0, (1000, 1, 0))
estimator = EulerEstimator(derivatives, starting_point)
Now, we need to refactor it into this:
derivatives = {
'susceptible': (lambda t, x: -0.0003*x['susceptible']*x['infected']),
'infected': (lambda t, x: 0.0003*x['susceptible']*x['infected'] - 0.02*x['infected']),
'recovered': (lambda t, x: 0.02*x['infected'])
}
starting_point = (0, {'susceptible': 1000, 'infected': 1, 'recovered': 0})
estimator = EulerEstimator(derivatives, starting_point)
Update the code in test_euler_estimator.py
and 3_neuron_network.py
to adhere to this new convention.
When I check your submission, I'm going to check that your EulerEstimator
has been initialized with a dictionary in each of these files, and I'm going to run each of these files to make sure that they generate the same plots as before.
Estimated Time: 15 minutes
Location:
machine-learning/analysis/scatter_plot.py
Points: 5
Make a scatter plot of the following dataset consisting of the points (x, y, class)
. When the class is A
, color the dot red. When it is B
, color the dot blue. Post your plot on slack once you've got it.
data = [[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],
[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],
[2,12,'B'],[2,12,'B'],
[3,12,'A'],[3,12,'A'],
[3,11,'A'],[3,11,'A'],
[3,11.5,'A'],[3,11.5,'A'],
[4,11,'A'],[4,11,'A'],
[4,11.5,'A'],[4,11.5,'A'],
[2,10.5,'A'],[2,10.5,'A'],
[3,10.5,'B'],
[4,10.5,'A']]
In the plot, make the dot size proportional to the number of points at that location.
For example, to plot a data set
[
(1,1),
(2,4), (2,4),
(3,9), (3,9), (3,9), (3,9),
(4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16)
]
you would use the following code:
import matplotlib.pyplot as plt
plt.scatter(x=[1, 2, 3, 4], y=[1, 4, 9, 16], s=[20, 40, 80, 160], c='red')
Estimated Time: 10-60 minutes (depending on whether you've got bugs)
Location:
machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py
Points: 10
Refactor your DecisionTree
so that the dataframe is passed in the fit
method (not when the decision tree is initialized). Also, create a method to classify
points.
Then, make sure decision tree passes the following tests, using the data
from problem 71-1.
Note: Based on visually inspecting a plot of the data, I think these tests are correct, but if you get something different (that looks reasonable), post on Slack so I can check.
df = DataFrame.from_array(data, columns = ['x', 'y', 'class'])
>>> dt = DecisionTree()
>>> dt.fit(df)
The tree should look like this:
(13A, 15B)
/ \
(y < 12.5) (y >= 12.5)
(13A, 3B) (12B)
/ \
(x < 2.5) (x >= 2.5)
(2A, 2B) (11A, 1B)
/ \ / \
(y < 11.25) (y >= 11.25) (y < 10.75) (y >= 10.75)
(2A) (2B) (1A, 1B) (10A)
/ \
(x < 3.5) (x >= 3.5)
(1B) (1A)
>>> dt.root.best_split
('y', 12.5)
>>> dt.root.low.best_split
('x', 2.5)
>>> dt.root.low.low.best_split
('y', 11.25)
>>> dt.root.low.high.best_split
('y', 10.75)
>>> dt.root.low.high.low.best_split
('x', 3.5)
>>> dt.classify({'x': 2, 'y': 11.5})
'B'
>>> dt.classify({'x': 2.5, 'y': 13})
'B'
>>> dt.classify({'x': 4, 'y': 12})
'A'
>>> dt.classify({'x': 3.25, 'y': 10.5})
'B'
>>> dt.classify({'x': 3.75, 'y': 10.5})
'A'
Estimated time: 45 minutes
Location: Overleaf
Grading: 5 points
a.
b.
Estimated time: 45 minutes
Location: Overleaf
Grading: 5 points
Estimated Time: 45 minutes
Location:
machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py
Points: 15
If you haven't already, create a split()
method in your DecisionTree
(not the same as the split()
method in your Node
!) that splits the tree at the node with highest impurity.
Then, create a fit()
method in your DecisionTree
that keeps on split()
-ing until all terminal nodes are completely pure.
Assert that the following tests pass:
>>> df = DataFrame.from_array(
[[1, 11, 'A'],
[1, 12, 'A'],
[2, 11, 'A'],
[1, 13, 'B'],
[2, 13, 'B'],
[3, 13, 'B'],
[3, 11, 'B']],
columns = ['x', 'y', 'class']
)
>>> dt = DecisionTree(df)
# currently, the decision tree looks like this:
(3A, 4B)
>>> dt.split()
# now, the decision tree looks like this:
(3A, 4B)
/ \
(y < 12.5) (y >= 12.5)
(3A, 1B) (3B)
>>> dt.split()
# now, the decision tree looks like this:
(3A, 4B)
/ \
(y < 12.5) (y >= 12.5)
(3A, 1B) (3B)
/ \
(x < 2.5) (x >= 2.5)
(3A) (1B)
>>> dt.root.high.row_indices
[3, 4, 5]
>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]
>>> dt = DecisionTree(df)
# currently, the decision tree looks like this:
(3A, 4B)
>>> dt.fit()
# now, the decision tree looks like this:
(3A, 4B)
/ \
(y < 12.5) (y >= 12.5)
(3A, 1B) (3B)
/ \
(x < 2.5) (x >= 2.5)
(3A) (1B)
>>> dt.root.high.row_indices
[3, 4, 5]
>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]
Estimated time: 45 minutes
Location: Overleaf
Grading: 10 points
Estimated time: 45 minutes
Location: Overleaf
Grading: 10 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
Estimated time: 60 min
Grading: 10 points
Locations:
machine-learning/src/leave_one_out_cross_validator.py
machine-learning/tests/test_leave_one_out_cross_validator.py
Write a class LeaveOneOutCrossValidator
that computes percent_accuracy
(also known as "leave-one-out cross validation") for any input classifier. For a refresher, see problem 58-1.
Assert that LeaveOneOutCrossValidator
passes the following tests:
>>> df = the cookie dataset that's in test_k_nearest_neighbors_classifier.py
>>> knn = KNearestNeighborsClassifier(k=5)
>>> cv = LeaveOneOutCrossValidator(knn, df, prediction_column='Cookie Type')
[ Note: under the hood, the LeaveOneOutCrossValidator should
create a leave_one_out_df and do
knn.fit(leave_one_out_df, prediction_column='Cookie Type') ]
>>> cv.accuracy()
0.7894736842105263 (Updated!)
Note: the following is included to help you debug.
Row 0 -- True Class is Shortbread; Predicted Class was Shortbread
Row 1 -- True Class is Shortbread; Predicted Class was Shortbread
Row 2 -- True Class is Shortbread; Predicted Class was Shortbread
Row 3 -- True Class is Shortbread; Predicted Class was Shortbread
Row 4 -- True Class is Sugar; Predicted Class was Sugar
Row 5 -- True Class is Sugar; Predicted Class was Sugar
Row 6 -- True Class is Sugar; Predicted Class was Sugar
Row 7 -- True Class is Sugar; Predicted Class was Shortbread
Row 8 -- True Class is Sugar; Predicted Class was Shortbread
Row 9 -- True Class is Sugar; Predicted Class was Sugar
Row 10 -- True Class is Fortune; Predicted Class was Fortune (Updated!)
Row 11 -- True Class is Fortune; Predicted Class was Fortune
Row 12 -- True Class is Fortune; Predicted Class was Fortune
Row 13 -- True Class is Fortune; Predicted Class was Shortbread
Row 14 -- True Class is Fortune; Predicted Class was Fortune (Updated!)
Row 15 -- True Class is Shortbread; Predicted Class was Sugar
Row 16 -- True Class is Shortbread; Predicted Class was Shortbread
Row 17 -- True Class is Shortbread; Predicted Class was Shortbread
Row 18 -- True Class is Shortbread; Predicted Class was Shortbread
>>> accuracies = []
>>> for k in range(1, len(data)-1):
>>> knn = KNearestNeighborsClassifier(k)
>>> cv = LeaveOneOutCrossValidator(knn, df, prediction_column='Cookie Type')
>>> accuracies.append(cv.accuracy())
>>> accuracies
[0.5789473684210527,
0.5789473684210527, #(Updated!)
0.5789473684210527,
0.5789473684210527,
0.7894736842105263, #(Updated!)
0.6842105263157895,
0.5789473684210527,
0.5789473684210527, #(Updated!)
0.6842105263157895, #(Updated!)
0.5263157894736842,
0.47368421052631576, #(Updated!)
0.42105263157894735,
0.42105263157894735, #(Updated!)
0.3684210526315789, #(Updated!)
0.3684210526315789, #(Updated!)
0.3684210526315789, #(Updated!)
0.42105263157894735]
Estimated time: 45 minutes
Grading: 10 points
Location: Overleaf
Suppose you are a mission control analyst who is looking down at an enemy headquarters through a satellite view, and you want to get an estimate of how many tanks they have. Most of the headquarters is hidden, but you notice that near the entrance, there are four tanks visible, and these tanks are labeled with the numbers $52, 30, 68, 7.$ So, you assume that they have $N$ tanks that they have labeled with numbers from $1$ to $N.$
Your commander asks you for an estimate: with $95\%$ certainty, what's the max number of tanks they have? Be sure to show your work.
In this problem, you'll answer that question using the same process that you used in Problem 41-1. See here for some additional clarifications that were added to this problem when it was given to the Computation & Modeling class.
Grading: 10 points
George & David, this will be a catch-up problem for you guys. You guys are missing a handful of recent assignments, and there are some key problems that serve as foundations for future problems. These are the key problems: 67-1, 66-1, 62-1 (in that order of importance).
Colby, this is also a catch-up problem for you -- your task is to complete 68-1.
Eli & Riley you'll get 10 points for this problem because you're up-to-date.
Grading: extra credit (you can get 200% on this assignment)
Location: assignment-problems/sudoku_solver.py
Use "intelligent search" to solve the following mini sudoku puzzle. Fill in the grid so that every row, every column, and every 3x2 box contains the digits 1 through 6.
For a refresher on "intelligent search", see problem 44-1.
Format your output so that when your code prints out the result, it prints out the result in the shape of a sudoku puzzle:
-----------------
| . . 4 | . . . |
| . . . | 2 3 . |
-----------------
| 3 . . | . 6 . |
| . 6 . | . . 2 |
-----------------
| . 2 1 | . . . |
| . . . | 5 . . |
-----------------
Estimated Time: 2-3 hours
Location:
machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py
Points: 15
In this problem, you will create the first iteration of a class DecisionTree
that builds a decision tree by repeatedly looping through all possible splits and choosing the split with the highest "goodness of split".
We will use the following simple dataset:
['x', 'y', 'class']
[1, 11, 'A']
[1, 12, 'A']
[2, 11, 'A']
[1, 13, 'B']
[2, 13, 'B']
[3, 12, 'B']
[3, 13, 'B']
For this dataset, "all possible splits" mean all midpoints between distinct entries in sorted data columns.
The sorted distinct entries of x
are 1, 2, 3.
The sorted distinct entries of y
are 11, 12, 13.
So, "all possible splits" are x=1.5, x=2.5, y=11.5, y=12.5.
Assert that the following tests pass. Note that you will need to create a Node
class for the nodes in your decision tree.
>>> df = DataFrame.from_array(
[[1, 11, 'A'],
[1, 12, 'A'],
[2, 11, 'A'],
[1, 13, 'B'],
[2, 13, 'B'],
[3, 13, 'B'],
[3, 11, 'B']],
columns = ['x', 'y', 'class']
)
>>> dt = DecisionTree(df)
>>> dt.root.row_indices
[0, 1, 2, 3, 4, 5, 6] # these are the indices of data points in the root node
>>> dt.root.class_counts
{
'A': 3,
'B': 4
}
>>> dt.root.impurity
0.490 # rounded to 3 decimal places
>>> dt.root.possible_splits.to_array()
# dt.possible_splits is a dataframe with columns
# ['feature', 'value', 'goodness of split']
# Note: below is rounded to 3 decimal places
[['x', 1.5, 0.085],
['x', 2.5, 0.147],
['y', 11.5, 0.085],
['y', 12.5, 0.276]]
>>> dt.root.best_split
('y', 12.5)
>>> dt.root.split()
# now, the decision tree looks like this:
(3A, 4B)
/ \
(y < 12.5) (y >= 12.5)
(3A, 1B) (3B)
# "low" refers to the "<" child node
# "high" refers to the ">=" child node
>>> dt.root.low.row_indices
[0, 1, 2, 6]
>>> dt.root.high.row_indices
[3, 4, 5]
>>> dt.root.low.impurity
0.375
>>> dt.root.high.impurity
0
>>> dt.root.low.possible_splits.to_array()
[['x', 1.5, 0.125],
['x', 2.5, 0.375],
['y', 11.5, 0.042]]
>>> dt.root.low.best_split
('x', 2.5)
>>> dt.root.low.split()
# now, the decision tree looks like this:
(3A, 4B)
/ \
(y < 12.5) (y >= 12.5)
(3A, 1B) (3B)
/ \
(x < 2.5) (x >= 2.5)
(3A) (1B)
>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]
>>> dt.root.low.low.impurity
0
>>> dt.root.low.high.impurity
0
Estimated time: 0-10 hours (?)
Grading: 1,000,000,000 points (okay, not actually that many, but this is IMPORTANT because we need to get our game working for the opportunity with Caltech)
Problem 59-2 was to refactor the DumbPlayer
tests to use the game state, and make sure they pass. If you haven't completed this yet, you'll need to do that before starting on this problem.
This problem involves refactoring the way we structure players. Currently, we have a class DumbPlayer
that does everything we'd expect from a dumb player. But really, the only reason why DumbPlayer
is dumb is that it uses a dumb strategy.
So, we are going to replace DumbPlayer
with a class DumbStrategy
, and refactor Player
so that we can initialize like this:
>>> dumb_player_1 = Player(strategy = DumbStrategy)
>>> dumb_player_2 = Player(strategy = DumbStrategy)
>>> game = Game(dumb_player_1, dumb_player_2)
a. Write a class DumbStrategy
in the file src/strategies/dumb_strategy.py
that contains the strategies for following methods:
will_colonize_planet(colony_ship, game_state)
: returns either True
or False
; will be called whenever a player's colony ship lands on an uncolonized planet
decide_ship_movement(ship, game_state)
: returns the coordinates to which the player wishes to move their ship.
decide_purchases(game_state)
: returns a list of ship and/or technology types that you want to purchase; will be called during each economic round.
decide_removals(game_state)
: returns a list of ships that you want to remove; will be called during any economic round when your total maintenance cost exceeds your CP.
decide_which_ship_to_attack(attacking_ship, game_state)
: looks at the ships in the combat order and decides which to attack; will be called whenever it's your turn to attack
b. Refactor your class Player
so that you can initialize a dumb player like this:
>>> dumb_player_1 = Player(strategy = DumbStrategy)
>>> dumb_player_2 = Player(strategy = DumbStrategy)
>>> game = Game(dumb_player_1, dumb_player_2)
c. Make sure that all your tests in tests/test_game_state_dumb_player.py
still pass.
d. Write a class CombatStrategy
in the file src/strategies/combat_strategy.py
that contains the strategies for the same methods as DumbStrategy
. But this time, the strategies should be the same as those that are used in CombatPlayer
.
e. Refactor your tests in tests/test_game_state_dumb_player.py
and make sure they still pass. When you initialize the game, you should do so like this:
>>> combat_player_1 = Player(strategy = CombatStrategy)
>>> combat_player_2 = Player(strategy = CombatStrategy)
>>> game = Game(combat_player_1, combat_player_2)
Take a look at all your assignments so far in this course. If there are any assignments with low grades, that you haven't already resubmitted, then be sure to resubmit them.
Also, if you haven't already, submit quiz corrections for all of the quizzes we've had so far!
Estimated time: 45 min
Locations:
machine-learning/src/k_nearest_neighbors_classifier.py
machine-learning/tests/test_k_nearest_neighbors_classifier.py
Grading: 15 points
Update your KNearestNeighborsClassifier
so that
k
is defined upon initialization,fit
, and passing in the data & dependent variable, andUpdate the tests, too, and make sure they still pass.
>>> df = the cookie dataset that's in test_k_nearest_neighbors_classifier.py
>>> knn = KNearestNeighborsClassifier(k=5)
>>> knn.fit(df, dependent_variable='Cookie Type') # dependent_variable is the new name for prediction_column
>>> df = the observation that's in test_k_nearest_neighbors_classifier.py
>>> knn.classify(observation) # we no longer pass in k
Estimated Time: 45 min
Location: Overleaf
Grading: 15 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
Estimated time: 30 min
Location: Overleaf
Grading: 10 points
Complete queries 1-5 in SQL Zoo Module 2. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.
Complete Module 8 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and put it in the overleaf doc.
Estimated Time: 30 min
Location: Overleaf
Grading: 15 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
Remember that PMF means "probability mass function". This is just the function $P(Z=z).$
Tip: Find the possible values of $Z,$ and then find the probabilities of those values of $Z$ occurring. Your answer will be a piecewise function: $$ P(z) = \begin{cases} \_\_\_, \, z=\_\_\_ \\ \_\_\_, \, z=\_\_\_ \\ \ldots \end{cases} $$
Estimated time: 30 min
Location: Overleaf
Grading: 5 points
Complete queries 11-15 in the SQL Zoo. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.
Complete Module 7 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and put it in the overleaf doc.
Estimated time: 60 min
Location: assignment-problems/quicksort.py
Grading: 10 points
Previously, you wrote a variant of quicksort that involved splitting the list into two parts (one part $\leq$ the pivot, and another part $>$ the pivot), and then recursively calling quicksort on those parts.
However, this algorithm can be made more efficient by keeping everything in the same list (rather than creating two new lists). You can do this by swapping elements rather than breaking them out into new lists.
Your task is to write a quicksort algorithm that uses only one list, and uses swaps to re-order elements within that list, per the quicksort algorithm. Here is an example of how to do that.
Make sure your algorithm passes the same test as the quicksort without swaps (that you did on the previous assignment).
Estimated time: 30 min
Location: Overleaf
Grading: 10 points
Complete queries 1-10 in the SQL Zoo. Here's a reference for the LIKE operator, which will come in handy.
Take a screenshot of each successful query and put them in an overleaf doc. When a query is successful, you'll see a smiley face appear. Your screenshots should look like this:
Estimated Time: 60 min
Location: Overleaf
Grading: 10 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
If two events $X$ and $Y$ are "independent", then $P(X \cap Y) = P(X) P(Y).$
If two events $X$ and $Y$ are "disjoint", then $P(X \cap Y) = 0.$
d.
Estimated Time: 15 min
Location: Overleaf
Grading: 5 points
(Taken from Introduction to Statistical Learning)
This problem is VERY similar to the test/train analysis you did in the previous assignment. But this time, you don't have to actually code up anything. You just have to use the concepts of overfitting and underfitting to justify your answers.
Grading: 20 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
d.
e.
f.
g.
Grading: 5 points
Complete Module 6 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Complete Module 4 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Grading: 5 points
Resolve the suggestions/comments on your blog post. Copy and paste everything back into Overleaf, and take a final proofread. Read it through to make sure everything is grammatically correct and makes sense. Submit your shareable Overleaf link along with the assignment.
Grading: 10 points
Create a class NaiveBayesClassifier
withing machine-learning/src/naive_bayes_classifier.py
that passes the following tests. These tests should be written in tests/test_naive_bayes_classifier.py
using assert statements.
>>> df = DataFrame.from_array(
[
[False, False, False],
[True, True, True],
[True, True, True],
[False, False, False],
[False, True, False],
[True, True, True],
[True, False, False],
[False, True, False],
[True, False, True],
[False, True, False]
]
columns = ['errors', 'links', 'scam']
)
>>> naive_bayes = NaiveBayesClassifier(df, dependent_variable='scam')
>>> naive_bayes.probability('scam', True)
0.4
>>> naive_bayes.probability('scam', False)
0.6
>>> naive_bayes.conditional_probability(('errors',True), given=('scam',True))
1.0
>>> naive_bayes.conditional_probability(('links',False), given=('scam',True))
0.25
>>> naive_bayes.conditional_probability(('errors',True), given=('scam',False))
0.16666666666666666
>>> naive_bayes.conditional_probability(('links',False), given=('scam',False))
0.5
>>> observed_features = {
'errors': True,
'links': False
}
>>> naive_bayes.likelihood(('scam',True), observed_features)
0.1
>>> naive_bayes.likelihood(('scam',False), observed_features)
0.05
>>> naive_bayes.classify('scam', observed_features)
True
Note: in the event of a tie, choose the dependent variable that occurred most frequently in the dataset.
Grading: 10 points
Location: assignment-problems/quicksort_without_swaps.py
Implement a function quicksort
that implements the variant of quicksort described here: https://www.youtube.com/watch?v=XE4VP_8Y0BU
quicksort
is very similar to mergesort
.Use your function to sort the list [5,8,-1,9,10,3.14,2,0,7,6]
(write a test with an assert statement). Choose the pivot as the rightmost entry.
Grading: 10 points
Location: Writeup in Overleaf; code in machine-learning/analysis/assignment_62.py
Watch this video FIRST: https://youtu.be/EuBBz3bI-aA?t=29
a. Create a dataset as follows:
$$ \left\{ (x, y) \, \Bigg| \, \begin{matrix} x=0.1, 0.2, \ldots, 10 \\ y=3+0.5x^2 + \epsilon, \, \epsilon \sim \mathcal{U}(-5, 5) \end{matrix} \right\} $$Split the dataset into two subsets:
To do this, you can randomly remove 20% of the data points from the dataset.
b. Fit 5 models to the data: a linear regressor, a quadratic regressor, a cubic regressor, a quartic regressor, and a quintic regressor. Compute the residual sum of squares (RSS) for each model on the training data. Which model is most accurate on the training data? Explain why.
c. Compute the RSS for each model on the testing data. Which model is most accurate on the testing data? Explain why.
d. Based on your findings, which model is the best model for the data? Justify your choice.
Location: Overleaf
Grading: 10 points
Construct a decision tree model for the following data. Include the Gini impurity and goodness of split at each node. You should choose the splits so as to maximize the goodness of split each time. Also, draw a picture of the decision boundary on the graph.
Location: simulation/analysis/3-neuron-network.py
Grading: 5 points
There are a couple things we need to update in our BiologicalNeuron
and BiologicalNeuralNetwork
, to make the model more realistic.
The first thing is that the synapse only releases neurotransmitters when a neuron has "fired". So, the voltage due to synapse inputs should not be a sum of all the raw voltages of the corresponding neurons. Instead, we should only sum the voltages that are over some threshold, say, $50 \, \textrm{mV}.$
So, our model becomes
$$\dfrac{\textrm dV}{\textrm dt} = \underbrace{\dfrac{1}{C} \left[ s(t) - I_{\text{Na}}(t) - I_{\text K}(t) - I_{\text L}(t) \right]}_\text{neuron in isolation} + \underbrace{\dfrac{1}{C} \left( \sum\limits_{\begin{matrix} \textrm{synapses from} \\ \textrm{other neurons} \\ \textrm{with } V(t) > 50 \end{matrix}} V_{\text{other neuron}}(t) \right)}_\text{interactions with other neurons}.$$Update your BiologicalNeuralNetwork
using the above model. The resulting graph should stay mostly the same (but this update to the model will be important when we're simulating many neurons).
Grading: 5 points
Make suggestions on your assigned classmate's blog post. If anything is unclear, uninteresting, or awkwardly phrased, make a suggestion to improve it. You should the "suggesting" feature of Google Docs and type in how you would rephrase or rewrite the particular portions.
Be sure to look for and correct any grammar mistakes as well. This is the second round of review, so I'm expecting there to be NO grammar mistakes whatsoever after you're done reviewing.
Location: Overleaf
Grading: 5 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
Grading: 5 points
Complete Module 3 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Location: Overleaf
Grading: 5 points
For two positive functions $f(n)$ and $g(n),$ we say that $f = O(g)$ if
$$ \lim\limits_{n \to \infty} \dfrac{f(n)}{g(n)} < \infty, $$or equivalently, there exists a constant $c$ such that
$$ f(n) < c \cdot g(n) $$for all $n.$
Using the definition above, prove the following:
a. $3n^2 + 2n + 1 = O(n^2).$
b. If $O(f + g) = O(\max(f,g)).$
c. $O(f) \cdot O(g) = O(f \cdot g).$
d. If $f = O(g)$ and $g = O(h)$ then $f = O(h).$
Location: Overleaf
Grading: 5 points
(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)
a.
b.
c.
Location: Overleaf
Grading: 5 points
(Taken from Introduction to Statistical Learning)
IMPORTANT:
For part (a), write out the model for salary of a male in this dataset, and the model for salary of a female in this dataset, and use these models to justify your answer.
Perhaps counterintuitively, question (c) is false. I want you to provide a thorough explanation of why this is by coming up with a situation in which there would be a significant interaction, but the interaction term is small.
Grading: 5 points
Complete Module 5 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Complete Module 2 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Grading: 5 points (if you've completed these things already, then you get 5 points free)
Resolve any comments/suggestions in your blog post Google Doc
Catch up on any problems you haven't fully completed: BiologicalNeuralNetwork
, DumbPlayer
tests, percent_correct
with KNearestNeighborsClassifier
Location: Overleaf
Grading: 10 points
Construct a decision tree model for the following data, using the splits shown.
Remember that the formula for Gini impurity for a group with class distribution $\vec p$ is
$$ G(\vec p) = \sum_i p_i (1-p_i) $$and that the "goodness-of-split" is quantified as
$$ \text{goodness} = G(\vec p_\text{pre-split}) - \sum_\text{post-split groups} \dfrac{N_\text{group}}{N_\text{pre-split}} G(\vec p_\text{group}). $$See the updated Eurisko Assignment Template for an example of constructing a decision tree in latex for a graph with given splits.
Be sure to include the class counts, impurity, and goodness of split at each node
Be sure to label each edge with the corresponding decision criterion.
Grading: 10 points (5 points for writing tests, 5 points for passing tests)
Revise tests/test_dumb_player.py
, so that it uses the actual game state. You can refer to Problem 23-3 for the tests.
For example, the first test is as follows:
At the end of Turn 1 Movement Phase:
Player 0 has 3 scouts at (4,0)
Player 1 has 3 scouts at (4,4)
Phrased in terms of the game state, we could write the test as
game_state = game.generate_state()
player_0_scout_locations = [u.location for u in game_state.players[0].units if unit.type == Scout]
player_1_scout_locations = [u.location for u in game_state.players[1].units if unit.type == Scout]
assert set(player_0_scout_locations) == set([(4,0), (4,0), (4,0)])
assert set(player_1_scout_locations) == set([(4,4), (4,4), (4,4)])
Given the refactoring that we've been doing, your tests might not run successfully the first time. But don't spend all your time on this problem only. If your tests don't pass, then make sure to complete all the other problems in this assignment before you start debugging your game.
Grading: 10 points
Make suggestions on your assigned classmate's blog post. If anything is unclear, uninteresting, or awkwardly phrased, make a suggestion to improve it. You should the "suggesting" feature of Google Docs and type in how you would rephrase or rewrite the particular portions.
Be sure to look for and correct any grammar mistakes as well. You'll be graded on how thorough your suggestions are. Everyone should be making plenty of suggestions (there's definitely at least 10 suggestions to be made on everyone's drafts).
Grading: 5 points
Complete Module 4 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Complete Module 1 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Grading: 10 points
Recall the following cookie dataset (that has been augmented with some additional examples):
['Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]
[['Shortbread' , 0.14 , 0.14 , 0.28 , 0.44 ],
['Shortbread' , 0.10 , 0.18 , 0.28 , 0.44 ],
['Shortbread' , 0.12 , 0.10 , 0.33 , 0.45 ],
['Shortbread' , 0.10 , 0.25 , 0.25 , 0.40 ],
['Sugar' , 0.00 , 0.10 , 0.40 , 0.50 ],
['Sugar' , 0.00 , 0.20 , 0.40 , 0.40 ],
['Sugar' , 0.02 , 0.08 , 0.45 , 0.45 ],
['Sugar' , 0.10 , 0.15 , 0.35 , 0.40 ],
['Sugar' , 0.10 , 0.08 , 0.35 , 0.47 ],
['Sugar' , 0.00 , 0.05 , 0.30 , 0.65 ],
['Fortune' , 0.20 , 0.00 , 0.40 , 0.40 ],
['Fortune' , 0.25 , 0.10 , 0.30 , 0.35 ],
['Fortune' , 0.22 , 0.15 , 0.50 , 0.13 ],
['Fortune' , 0.15 , 0.20 , 0.35 , 0.30 ],
['Fortune' , 0.22 , 0.00 , 0.40 , 0.38 ],
['Shortbread' , 0.05 , 0.12 , 0.28 , 0.55 ],
['Shortbread' , 0.14 , 0.27 , 0.31 , 0.28 ],
['Shortbread' , 0.15 , 0.23 , 0.30 , 0.32 ],
['Shortbread' , 0.20 , 0.10 , 0.30 , 0.40 ]]
When fitting our k-nearest neighbors models, we have been using this dataset to predict the type of a cookie based on its ingredient portions. We've also seen that issues can arise when $k$ is too small or too large.
So, what is a good value of $k?$
a. To explore this question, plot the function $$ y= \text{percent_correct}(k), \qquad k=1,2,3, \ldots, 16, $$ where $\text{percent_correct}(k)$ is the percentage of points in the dataset that the $k$-nearest neighbors model would classify correctly.
for each data point:
1. fit a kNN model to all the data EXCEPT that data point
2. use the kNN model to classify the data point
3. determine whether the kNN classification matches up
with the actual class of the data point
percent_correct = num_correct_classifications / tot_num_data_points
You should get the following result:
b. Based on what you know about what happens when $k$ is too small or too large, does the shape of your plot makes sense?
c. What would be an appropriate value (or range of values) of $k$ for this modeling task? Justify your answer by referring to your plot.
Grading: 10 points
Get your BiologicalNeuralNetwork
fully working. Note that there's a particular issue with defining functions within for loops that Elijah pointed out on Slack:
Consider the following code:
funcs = [] for i in range(10): funcs.append(lambda x: x * i) for f in funcs: print(f(5))
You'd expect to see 5, 10, 15, 20, 25, etc.
BUT what you actually get is 50, 50, 50, 50, 50, etc.
Instead ofi * 5
, it is just doing10 * 5
This is because the lambda refers to the place thati
is stored in memory, which ends as10
So instead of taking the current value ofi
and using it in the lambda, it will take the last value it was set to, before it is actually called. The way you fix this problem is:funcs = [] for i in range(10): funcs.append(lambda x, i=i: x * i) for f in funcs: print(f(5))
So, when you're getting your derivatives, you'll need to do the following:
network_derivatives = []
for each neuron:
parent_indices = indices of neurons that send synapses to neuron i
# x is [V0, n0, m0, h0,
# V1, n1, m1, h1,
# V2, n2, m2, h2, ...]
network_derivatives += [
(
lambda t, x, i=i, neuron=neuron:
neuron.dV(t, x[4*i : (i+1)*4])
+ 1/neuron.C * sum(x[p*4] for p in parent_indices)
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dn(t, x[4*i : (i+1)*4])
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dm(t, x[4*i : (i+1)*4])
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dh(t, x[4*i : (i+1)*4])
)
]
Grading: 10 points
For blog post draft #4, I want you to do the following:
George: Linear and Logistic Regression, Part 1: Understanding the Models
Make plots of the images you wanted to include, and insert them into your post. You can use character arguments in plt.plot()
; here's a reference
You haven't really hit on why the logistic model takes a sigmoid shape. You should talk about the $e^{\beta x}$ term, where $\beta$ is negative. What happens when $x$ gets really negative? What happens when $x$ gets really positive?
For your linear regression, you should use coefficients $\beta_0, \beta_1, \ldots$ just like you did in the logistic regression. This will help drive the point home that logistic regression is just a transformation of linear regression.
We're ready to move onto the text editing phase! The next time you submit your blost, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
Colby: Linear and Logistic Regression, Part 2: Fitting the Models
In your explanation of the pseudoinverse, make sure to state that in most modeling contexts, our matrix $X$ is taller than it is wide, because we have lots of data points. So our matrix $X$ is usually not invertible because it is a tall rectangular matrix.
In your explanation of the pseudoinverse, be more careful with your language: the pseudoinverse $(\mathbf{X}^T\mathbf{X})^{-1}$ is not equivalent to the standard inverse $\mathbf{X}^{-1}.$ The equation $\mathbf{X} \vec \beta = \vec y$ is usually not solvable, because the standard inverse $\mathbf{X}^{-1}$ usually does not exist. But the pseudoinverse $(\mathbf{X}^T\mathbf{X})^{-1}$ usually does exist, and the solution $\vec \beta = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \vec y$ minimizes the sum of squared error between the desired output $\vec y$ and the actual output $\mathbf{X} \vec \beta.$
Decide between bracket matrices (bmatrix
) and parenthesis matrices (pmatrix
). You sometiemes use bmatrix
, and other times pmatrix
. Choose one convention and stick to it.
On page 2, show the following intermediate steps (fill in the dots): $$\begin{align*} \vec \beta &= \ldots \\ \vec \beta &= \begin{pmatrix} \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \end{pmatrix}^{-1} \begin{pmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 4 \\ 0 & 0 & 2 & 5 \end{pmatrix} \begin{pmatrix} 0.1 \\ 0.2 \\ 0.5 \\ 0.6 \end{pmatrix} \\ \vec \beta &= \begin{pmatrix} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \end{pmatrix} \begin{pmatrix} 0.1 \\ 0.2 \\ 0.5 \\ 0.6 \end{pmatrix} \\ \vec \beta &= \ldots \end{align*}$$
Remove any parts of your code that are not relevant to the blog post (even if you used them for some part of an assignment). For example, you should remove the rounding part from apply_coeffs
.
Make sure your code follows the same conventions everywhere. For example, you sometimes say coefficients
, and other times just coeffs
. For the purposes of this blog post, we want to be as clear as possible, so always use coefficients
instead of coeffs
.
Make your code tight so that there is zero redundancy. For example, in __init__
, the argument ratings
is redundant because these values are already in your dataframe, and you already have a variable prediction_column
that indicates the relevant column in your dataframe. So eliminate ratings
from your code.
After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables
In the statements of your models, you need a constant term, and you need to standardize your coefficient labels and your variable names. You use different conventions in a lot of places -- sometimes you call the coefficients $a,b,c,\ldots,$ other times $c_1, c_2, c_3,\ldots,$ and other times $\beta_1,\beta_2,\beta_3,\ldots.$ Likewise, you sometimes say "beef" and "pb", while other times you say "roast beef" and "peanut butter". You need to standardize these names. I think using $\beta$'s or $c$'s for coefficients and abbreviations for variable names is preferable. So, for example, one of your equations would turn into $y = \beta_0 + \beta_1(\textrm{beef}) + \beta_2(\textrm{pb}) + \beta_3(\textrm{mayo}) + \beta_4(\textrm{jelly})$ or $y = c_0 + c_1(\textrm{beef}) + c_2(\textrm{pb}) + c_3(\textrm{mayo}) + c_4(\textrm{jelly}).$
After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
David: Predator-Prey Modeling with Euler Estimation
Explain this more clearly:
Euler estimation works by adding the derivative of an equation to each given value over and over again. This is because the derivatives are the instantaneous rates of change so we add it at each point to accurately show the the equation. Adding each point up from an equation is also equivalent to an integral.
Clean up your code on page 3. It's hard to tell what's going on. Put these code snippets into a single clean function, and change up the naming/structure so that it's clear what's going on. The code doesn't have to be the same as what's actually in your Euler estimator.
After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
Elijah: Solving Magic Squares using Backtracking
In your nested for loops, you should be using range(1,10)
because 0
is not considered as an element of the magic square.
In your code snippets, you should name everything very descriptively so that it's totally obvious what things represent. For example, instead of s = int(len(arr)**0.5)
, you could say side_length = int(len(arr)**0.5)
.
After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
Location: machine-learning/src/k_nearest_neighbors_classifier.py
Grading: 10 points
Create a class KNearestNeighborsClassifier
that works as follows. Leverage existing methods in your DataFrame class to do the brunt of the processing.
>>> df = DataFrame.from_array(
[['Shortbread' , 0.14 , 0.14 , 0.28 , 0.44 ],
['Shortbread' , 0.10 , 0.18 , 0.28 , 0.44 ],
['Shortbread' , 0.12 , 0.10 , 0.33 , 0.45 ],
['Shortbread' , 0.10 , 0.25 , 0.25 , 0.40 ],
['Sugar' , 0.00 , 0.10 , 0.40 , 0.50 ],
['Sugar' , 0.00 , 0.20 , 0.40 , 0.40 ],
['Sugar' , 0.10 , 0.08 , 0.35 , 0.47 ],
['Sugar' , 0.00 , 0.05 , 0.30 , 0.65 ],
['Fortune' , 0.20 , 0.00 , 0.40 , 0.40 ],
['Fortune' , 0.25 , 0.10 , 0.30 , 0.35 ],
['Fortune' , 0.22 , 0.15 , 0.50 , 0.13 ],
['Fortune' , 0.15 , 0.20 , 0.35 , 0.30 ],
['Fortune' , 0.22 , 0.00 , 0.40 , 0.38 ]],
columns = ['Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]
)
>>> knn = KNearestNeighborsClassifier(df, prediction_column = 'Cookie Type')
>>> observation = {
'Portion Eggs': 0.10,
'Portion Butter': 0.15,
'Portion Sugar': 0.30,
'Portion Flour': 0.45
}
>>> knn.compute_distances(observation)
Returns a dataframe representation of the following array:
[[0.047, 'Shortbread'],
[0.037, 'Shortbread'],
[0.062, 'Shortbread'],
[0.122, 'Shortbread'],
[0.158, 'Sugar'],
[0.158, 'Sugar'],
[0.088, 'Sugar'],
[0.245, 'Sugar'],
[0.212, 'Fortune'],
[0.187, 'Fortune'],
[0.396, 'Fortune'],
[0.173, 'Fortune'],
[0.228, 'Fortune']]
Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.
>>> knn.nearest_neighbors(observation)
Returns a dataframe representation of the following array:
[[0.037, 'Shortbread'],
[0.047, 'Shortbread'],
[0.062, 'Shortbread'],
[0.088, 'Sugar'],
[0.122, 'Shortbread'],
[0.158, 'Sugar'],
[0.158, 'Sugar'],
[0.173, 'Fortune'],
[0.187, 'Fortune'],
[0.212, 'Fortune'],
[0.228, 'Fortune'],
[0.245, 'Sugar'],
[0.396, 'Fortune']]
Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.
>>> knn.compute_average_distances(observation)
{
'Shortbread': 0.067,
'Sugar': 0.162,
'Fortune': 0.239
}
Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.
>>> knn.classify(observation, k=5)
'Shortbread'
(In the case of a tie, chose whichever class has a lower average distance. If that is still a tie, then pick randomly.)
Location: simulation/analysis/3-neuron-network.py
Grading: 15 points
IMPORTANT UPDATE: I'm not going to take points off if your BiologicalNeuralNetwork
isn't fully working. I'll expect to see at least the skeleton of it written, but there's this subtle thing with lambda functions that Elijah pointed out, that we need to talk about in class tomorrow.
Create a Github repository named simulation
, and organize the code as follows:
simulation/
|- src/
|- euler_estimatory.py
|- biological_neuron.py
|- biological_neural_network.py
|- tests/
|- test_euler_estimator.py
|- analysis/
|- predator_prey.py
|- sir_epidemiology.py
Rename your class Neuron
to be BiologicalNeuron
. (This is to avoid confusion when we create another neuron class in the context of machine learning.)
Create a class BiologicalNeuralNetwork
to simulate a network of interconnected neurons. This class will be initialized with two arguments:
neurons
- a list of neurons in the network
synapses
- a list of "directed edges" that correspond to connections between neurons
To simulate your BiologicalNeuralNetwork
, you will use an EulerEstimator
where x
is a long array of V,n,m,h
for each neuron. So, if you are simulating 3 neurons (neurons 0,1,2), then you will be passing in $4 \times 3 = 12$ derivatives:
Note that you will have to add extra terms to the voltage derivatives represent the synapses. The updated derivative of voltage is as follows: $$\dfrac{\textrm dV}{\textrm dt} = \underbrace{\dfrac{1}{C} \left[ s(t) - I_{\text{Na}}(t) - I_{\text K}(t) - I_{\text L}(t) \right]}_\text{neuron in isolation} + \underbrace{\dfrac{1}{C} \left( \sum\limits_{\begin{matrix} \textrm{synapses from} \\ \textrm{other neurons} \end{matrix}} V_{\text{other neuron}}(t) \right)}_\text{interactions with other neurons}.$$
So, in the case of 3 neurons connected as $0 \to 1 \to 2,$ the full system of equations would be as follows:
$$\begin{align*} \dfrac{\textrm dV_0}{\textrm dt} &= \textrm{neuron_0.dV}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dn_0}{\textrm dt} &= \textrm{neuron_0.dn}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dm_0}{\textrm dt} &= \textrm{neuron_0.dm}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dh_0}{\textrm dt} &= \textrm{neuron_0.dh}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dV_1}{\textrm dt} &= \textrm{neuron_1.dV}(V_1,n_1,m_1,h_1) + \dfrac{1}{\textrm{neuron_1.C}} V_0(t) \\ \dfrac{\textrm dn_1}{\textrm dt} &= \textrm{neuron_1.dn}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dm_1}{\textrm dt} &= \textrm{neuron_1.dm}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dh_1}{\textrm dt} &= \textrm{neuron_1.dh}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dV_2}{\textrm dt} &= \textrm{neuron_2.dV}(V_2,n_2,m_2,h_2) + \dfrac{1}{\textrm{neuron_2.C}} V_1(t) \\ \dfrac{\textrm dn_2}{\textrm dt} &= \textrm{neuron_2.dn}(V_2,n_2,m_2,h_2) \\ \dfrac{\textrm dm_2}{\textrm dt} &= \textrm{neuron_2.dm}(V_2,n_2,m_2,h_2) \\ \dfrac{\textrm dh_2}{\textrm dt} &= \textrm{neuron_2.dh}(V_2,n_2,m_2,h_2) \end{align*}$$Test your BiologicalNeuralNetwork
as follows:
>>> def electrode_voltage(t):
if t > 10 and t < 11:
return 150
elif t > 20 and t < 21:
return 150
elif t > 30 and t < 40:
return 150
elif t > 50 and t < 51:
return 150
elif t > 53 and t < 54:
return 150
elif t > 56 and t < 57:
return 150
elif t > 59 and t < 60:
return 150
elif t > 62 and t < 63:
return 150
elif t > 65 and t < 66:
return 150
return 0
>>> neuron_0 = BiologicalNeuron(stimulus = electrode_voltage)
>>> neuron_1 = BiologicalNeuron()
>>> neuron_2 = BiologicalNeuron()
>>> neurons = [neuron_0, neuron_1, neuron_2]
>>> synapses = [(0,1), (1,2)]
The neural network resembles a directed graph:
0 --> 1 --> 2
>>> network = BiologicalNeuralNetwork(neurons, synapses)
>>> euler = EulerEstimator(
derivatives = network.get_derivatives(),
point = network.get_starting_point()
)
>>> plt.plot([n/2 for n in range(160)], [electrode_voltage(n/2) for n in range(160)])
>>> euler.plot([0, 80], step_size = 0.001)
Grading: 5 points
Complete Module 3 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.
Location: machine-learning/tests/test_data_frame.py
Grading: 10 points
Implement the following functionality in your DataFrame
, and assert that these tests pass.
a. Loading an array. You'll need to use @classmethod
for this one (read about it here).
>>> columns = ['firstname', 'lastname', 'age']
>>> arr = [['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]]
>>> df = DataFrame.from_array(arr, columns)
b. Selecting columns by name
>>> df.select_columns(['firstname','age']).to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]
c. Selecting rows by index
>>> df.select_rows([1,3]).to_array()
[['Charles', 'Trapp', 17],
['Sylvia', 'Mendez', 9]]
d. Selecting rows which satisfy a particular condition (given as a lambda function)
>>> df.select_rows_where(
lambda row: len(row['firstname']) >= len(row['lastname'])
and row['age'] > 10
).to_array()
[['Charles', 'Trapp', 17]]
e. Ordering the rows by given column
>>> df.order_by('age', ascending=True).to_array()
[['Kevin', 'Fray', 5],
['Sylvia', 'Mendez', 9],
['Anna', 'Smith', 13],
['Charles', 'Trapp', 17]]
>>> df.order_by('firstname', ascending=False).to_array()
[['Sylvia', 'Mendez', 9],
['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]
Grading: 10 points
For blog post draft #3, I want you to do the following:
George: Linear and Logistic Regression, Part 1: Understanding the Models
Make plots of the images you wanted to include, and insert them into your post. You can use character arguments in plt.plot()
; here's a reference
You haven't really hit on why the logistic model takes a sigmoid shape. You should talk about the $e^{\beta x}$ term, where $\beta$ is negative. What happens when $x$ gets really negative? What happens when $x$ gets really positive?
For your linear regression, you should use coefficients $\beta_0, \beta_1, \ldots$ just like you did in the logistic regression. This will help drive the point home that logistic regression is just a transformation of linear regression.
We're ready to move onto the text editing phase! The next time you submit your blost, put it in a Google Doc and share it with me so that I can make "suggestions" on it.
Colby: Linear and Logistic Regression, Part 2: Fitting the Models
You haven't defined what $y'$ means. Be sure to do that
You've got run-on sentences and incorrect comma usage everywhere. Fix that. Read each sentence aloud and make sure it makes sense as a complete sentence.
In your explanation of the pseudoinverse, you say that it's a generalization of the matrix inverse when the matrix may not be invertible. That's correct. But you should also explain why, in our case, the matrix $X$ is not expected to be invertible. (Think: only square matrices are invertible. Is $X$ square?)
Your matrix entries are backwards. For example, the first row should be $x_{11}, x_{12}, \ldots, x_{1n}.$
Wherever you use ln
, you should use \ln
instead.
In your linear/logistic regression functions, you should use $x_1, x_2, \ldots, x_m$ instead of $a,b, \ldots z.$
In your examples of the linear and logistic regression, you should use 3-dimensional data points instead of just 2-dimensional data point. This way, your example can demonstrate how you deal with multiple input variables. Also, you should set up some context around the example. Come up with a concrete situation in which your data points could be observations, and you want to predict something.
Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables
Instead of inserting a screenshot of the raw data, put it in a data table. See the template for how to create data tables.
You have some models in text: y = a(roast
beef)+b(peanut butter)
and y = a(roast beef)+b(peanut butter)+c(roast beef)(peanut butter).
These should be on their own lines, as equations.
There are some sections where the wording is really confusing. Proofread your paper and make sure that everything is expressed clearly. For example, this is not expressed clearly:
So for example we could not regress y = x^a. The most important attribute of this is that we can plot a logistic regression using this method. This is possible because the format of the logistic regression is ...
.
At the very end, when you talk about transforming a dataset to fit a quadratic, it's not clear what you're doing. (I know what you're trying to say, but if I didn't already know, then I'd probably be confused.) You should explain how, in general, if we want to fit a nonlinear regression model $$ y= \beta_1 f_2(x_1) + \beta_2 f_2(x_2) + \cdots, $$ then we have to transform the data as $$ (x_1, x_2, \ldots ,y) \to (f_1(x_1), f_2(x_2), \ldots, y) $$ and then fit a linear regression to the points of the form $(f_1(x_1), f_2(x_2), \ldots, y).$
David: Predator-Prey Modeling with Euler Estimation
Fix your latex formatting -- follow the latex commandments.
Use pseudocode formatting (see the template)
In the predator-prey model that you stated, you need to explain where each term comes from. You've sort of hit on this below the model, but you haven't explicitly paired each individual term with its explanation. Also, why do we multiply the $DW$ together for some of the terms? Imagine that the reader knows what a derivative is, but has no experience using a differential equation for modeling purposes.
You should also explain that this equation is difficult to solve analytically, so that's why we're going to turn to Euler estimation.
You should explain where these recurrences come from. Why does this provide a good estimation of the function? (You should talk about rates of change) $$\begin{align*}D(t + \Delta t) &\approx D(t) + D'(t) \Delta t \\ W(t + \Delta t) &\approx W(t) + W'(t) \Delta t \end{align*}$$
When explaining Euler estimation, you should show the computations for the first few points in the plot. This way, the reader can see a concrete example of the process that you're actually carrying out to generate the plot.
Elijah: Solving Magic Squares using Backtracking
Make sure that the rest of your content is there on the next draft:
How can you overcome the inefficiency of brute-force search using "backtracking", i.e. intelligent search? https://en.wikipedia.org/wiki/Sudoku_solving_algorithms#Backtracking
Write some code for how to implement backtracking using a bunch of nested for loops (i.e. the ugly solution). Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using backtracking. Then try the 4x4, 5x5, etc and make a graph.
How can you write the code more compactly using a single while loop?
Complete Module 1 AND Module 2 of Sololearn's C++ Course. Take a screenshot of each completed module, with your user profile showing, and submit both screenshots along with the assignment.
Location: Overleaf
Grading: 12 points
Naive Bayes classification is a way to classify a new observation consisting of multiple features, if we have data about how other observations were classified. It involves choosing the class that maximizes the posterior distribution of the classes, given the observation.
$$\begin{align*} \text{class} &= \underset{\text{class}}{\arg\max} \, P(\text{class} \, | \, \text{observed features}) \\ &= \underset{\text{class}}{\arg\max} \, \dfrac{P(\text{observed features} \, | \, \text{class}) P(\text{class})}{P(\text{observed features})} \\ &= \underset{\text{class}}{\arg\max} \, P(\text{observed features} \, | \, \text{class}) P(\text{class})\\ &= \underset{\text{class}}{\arg\max} \, \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{class}) P(\text{class})\\ &= \underset{\text{class}}{\arg\max} \, P(\text{class}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{class}) \\ \end{align*}$$The key assumption (used in the final line) is that all the features are independent:
$$\begin{align*} P(\text{observed features} \, | \, \text{class}) = \prod\limits_{\text{observed} \\ \text{features}} P(\text{feature} \, | \, \text{class}) \end{align*}$$Suppose that you want to find a way to classify whether an email is a phishing scam or not, based on whether it has errors and whether it contains links.
After checking 10 emails in your inbox, you came up with the following data set:
Now, you look at 4 new emails. For each of the new emails, compute
$$ P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) $$and decide whether it is a scam.
a. No errors, no links. You should get
$$ P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = 0 \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{4}. $$b. Contains errors, contains links. You should get
$$ P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = \dfrac{3}{10} \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{20}. $$c. Contains errors, no links. You should get
$$ P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = \dfrac{1}{10} \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{20}. $$d. No errors, contains links. You should get
$$ P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = 0 \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{4}. $$Grading: 12 points
Refactor your Hodgkin-Huxley neuron simulation so that the functions governing the internal state of the neuron are encapsulated within a class Neuron
.
>>> def stimulus(t):
if t > 10 and t < 11:
return 150
elif t > 20 and t < 21:
return 150
elif t > 30 and t < 40:
return 150
elif t > 50 and t < 51:
return 150
elif t > 53 and t < 54:
return 150
elif t > 56 and t < 57:
return 150
elif t > 59 and t < 60:
return 150
elif t > 62 and t < 63:
return 150
elif t > 65 and t < 66:
return 150
return 0
>>> neuron = Neuron(stimulus)
>>> neuron.plot_activity()
The above code should generate the SAME plot that you generated previously.
Note: Do NOT make plot_activity()
into a gigantic function. Rather, you should keep your code modular, using helper functions when appropriate. Multiple helper functions will be needed to achieve this implementation with good code quality.
Grading: 6 points
Create an EconomicEngine
that handles the following:
This will be in the spirit of how MovementEngine
handles movement and CombatEngine
handles combat.
In your EconomicEngine
, include a method generate_economic_state()
that generates the following information:
economic_state = {
'income': 20,
'maintenance cost': 5
}
Location: Overleaf
Grading: 10 points
Note: Points will be deducted for poor latex quality. If you're writing up your latex and anything looks off, make sure to post about it so you can fix it before you submit. FOLLOW THE LATEX COMMANDMENTS!
The dataset below displays the ratio of ingredients for various cookie recipes.
['ID', 'Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]
[[ 1 , 'Shortbread' , 0.14 , 0.14 , 0.28 , 0.44 ],
[ 2 , 'Shortbread' , 0.10 , 0.18 , 0.28 , 0.44 ],
[ 3 , 'Shortbread' , 0.12 , 0.10 , 0.33 , 0.45 ],
[ 4 , 'Shortbread' , 0.10 , 0.25 , 0.25 , 0.40 ],
[ 5 , 'Sugar' , 0.00 , 0.10 , 0.40 , 0.50 ],
[ 6 , 'Sugar' , 0.00 , 0.20 , 0.40 , 0.40 ],
[ 7 , 'Sugar' , 0.10 , 0.08 , 0.35 , 0.47 ],
[ 8 , 'Sugar' , 0.00 , 0.05 , 0.30 , 0.65 ],
[ 9 , 'Fortune' , 0.20 , 0.00 , 0.40 , 0.40 ],
[ 10 , 'Fortune' , 0.25 , 0.10 , 0.30 , 0.35 ],
[ 11 , 'Fortune' , 0.22 , 0.15 , 0.50 , 0.13 ],
[ 12 , 'Fortune' , 0.15 , 0.20 , 0.35 , 0.30 ],
[ 13 , 'Fortune' , 0.22 , 0.00 , 0.40 , 0.38 ]]
Suppose you're given a cookie recipe and you want to determine whether it is a shortbread cookie, a sugar cookie, or a fortune cookie. The cookie recipe consists of 0.10 portion eggs, 0.15 portion butter, 0.30 portion sugar, and 0.45 portion flour. We will infer the classification of this cookie using the "$k$ nearest neighbors" approach.
Part 1: How to do $k$ nearest neighbors.
a. This cookie can be represented as the point $P(0.10, 0.15, 0.30, 0.45).$ Compute the Euclidean distance between $P$ and each of the points corresponding to cookies in the dataset.
b. Consider the 5 points that are closest to $P.$ (These are the 5 "nearest neighbors".) What cookie IDs are they, and what types of cookies are represented by these points?
c. What cookie classification showed up most often in the 5 nearest neighbors? What inference can you make about the recipe corresponding to the point $P$?
Part 2: The danger of using too large a $k$
a. What happens if we try to perform the $k$ nearest neighbors approach with $k=13$ (i.e. the full dataset) to infer the cookie classification of point $P?$ What issue occurs, and why does it occur?
b. For each classification of cookie, find the average distance between $P$ and the points corresponding to the cookies in that classification. Explain how this resolves the issue you identified in part (a).
Location: Overleaf
Grading: 10 points
Note: Points will be deducted for poor latex quality. If you're writing up your latex and anything looks off, make sure to post about it so you can fix it before you submit. FOLLOW THE LATEX COMMANDMENTS!
Suppose you want to estimate the probability that you will get into a particular competitive college. You had a bunch of friends a year ahead of you that applied to the college, and these are their results:
Martha was accepted. She was the 95th percentile of her class, got a 33 on the ACT, and had an internship at a well-known company the summer before she applied to college.
Jeremy was rejected. He was in the 95th percentile of his class and got a 34 on the ACT.
Alphie was accepted. He was in the 92nd percentile of his class, got a 35 on the ACT, and had agreed to play on the college's basketball team if accepted.
Dennis was rejected. He was in the 85th percentile of his class, got a 30 on the ACT, and had committed to run on the college's track team if accepted.
Jennifer was accepted. She was in the 80th percentile of her class, got a 36 on the ACT, and had a side business in 3D printing that was making $15,000 per year.
Martin was rejected. He was in the 85th percentile of his class, got a 29 on the ACT, and had was a finalist in an international science fair.
Mary was accepted. She was in the 95th percentile of her class, got a 36 on the ACT, and was a national finalist in the math olympiad.
Dean was rejected. He was in the 87th percentile of his class, got a 31 on the ACT, and was a national finalist in the chemistry olympiad.
Adam was accepted. He was in the 99th percentile of his class and got a 36 on the ACT.
Jeremy was rejected. He was in the 95th percentile of his class and got a 32 on the ACT.
Create a writeup in Overleaf that contains the following parts.
a. Create a quantitative dataset to represent this information, and include it in your writeup. Name your features appropriately.
Important: If you use 0 as the output for rejections and 1 as the output for acceptances, you do run into a problem where the regressor blows up. One solution is to use a small number like 0.001 for rejections and 0.999 for acceptances. But like Elijah pointed out, if you change that number to 0.0000001 or something else, then some of the outputs can drastically change.
But think about the college admissions process for these students: given these students' stats, how sure can you actually be that they will or won't get into a college? For example, 0.999999999 seems extreme because you can't be 99.9999999% sure you'll get into a competitive college. Something like, say, in the ballpark of 80% seems more reasonable as a max certainty. So instead of using 0's and 1's in your dataset, you should change those to numbers that would give a more realistic representation of the min certainty & max certainty that these students would get into the college.
There's not really a single correct answer, but you do need to provide some reasonable justification for the numbers that you chose to represent min and max acceptance probability. There's a saying that applies here: "All models are wrong, but some are useful".
b. Decide what type of model you will use to model the probability of acceptance as a function of the features in your dataset. State and justify the form of the model in your writeup.
c. Fit the model to the data. For each feature, answer the following questions:
According to your model, as that variable increases, does the estimated probability of acceptance increase or decrease? Does that result make sense? If so, why? (If not, then something is wrong with your model, and you need to figure out what's going wrong.)
d. Estimate the probability of being accepted for each of the data points that you used to fit the model. How well does this match up with reality?
e. Estimate your probability of being accepted if you are in the 95th percentile of your class and got a 34 on the ACT. Justify why your model's prediction is reasonable.
f. Now suppose that you have an opportunity to do an internship at a well-known company the summer before you apply to college. If you do it, what will your estimated probability of acceptance become? Based on this information, how much does the internship matter in terms of getting into the college you want?
Grading: 10 points
For blog post draft #2, I want you to finish addressing all of the key parts in the content. We'll worry about grammar / phrasing / writing style later.
Here are the things that you still need to address...
George: Linear and Logistic Regression, Part 1: Understanding the Models
You've talked about what regression can be used for. But what exactly is regression? In the beginning, you should write a bit about how regression involves fitting a function to the "general trend" of some data points, and then using that function to predict outcomes for data points where the outcome is not already known.
You should state the linear regression model before you state the logistic regression model, because linear regression is much simpler.
You've given the logistic regression model and stated its sigmoid shape. You should write a bit about why the model takes that shape. Explain why that form of function gives rise to a sigmoid shape.
You should also state how the two models can be generalized when there are multiple input variables.
You've stated that you can change the upper limit of the logistic regression, and you've given some examples of why you might want to do this, but you haven't actually explained how you do this in the equation. What number do you have to change?
Still need to do this: Explain how a logistic model can be adjusted to model variables that range from some general lower bound $a$ to some general upper bound $b.$ Give a couple examples of situations in which we'd want to do this.
Colby: Linear and Logistic Regression, Part 2: Fitting the Models
You've stated the matrix equation for linear regression, but you should explain a bit where it comes from. What is the system of equations you start with, and why does that correspond to the matrix equation you stated?
You've used the term "pseudoinverse" but you haven't stated exactly what that is in the equation. In particular: why do you have to multiply both sides by the transpose first? How come you couldn't just invert the matrix right off the bat?
You've stated that for the logistic regressor, the process is similar, but you have to transform y. But why do you transform y in this way? You need to start with the logistic regression equation, invert it, and then write down the system of equations. Then you can explain how this now looks like a linear regression, except that the right-hand side is just a transformation of y.
Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables
Separate your text into paragraphs.
Typeset your math markup, putting equations on separate lines where appropriate.
David: Predator-Prey Modeling with Euler Estimation
In the predator-prey model that you stated, you need to explain where each term comes from. You've sort of hit on this below the model, but you haven't explicitly paired each individual term with its explanation. Also, why do we multiply the $DW$ together for some of the terms? Imagine that the reader knows what a derivative is, but has no experience using a differential equation for modeling purposes.
After you set up the system, you should start talking about the Euler estimation process. The plot should come at the very end, because it's the "punch line" of the blog post. You should also explain that this equation is difficult to solve analytically, so that's why we're going to turn to Euler estimation.
It looks like you've got some code snippets in an equation environment. Use the code environment provided in the template (I've updated the template recently). The code will look much better that way.
Before you dive into your code explanation of Euler estimation, you should do a brief mathematical explanation, referencing the main recurrences: $D(t + \Delta t) \approx D(t) + D'(t) \Delta t,$ $W(t + \Delta t) \approx W(t) + W'(t) \Delta t.$ You should also explain where these recurrences come from.
When explaining Euler estimation, you should show the computations for the first few points in the plot. This way, the reader can see a concrete example of the process that you're actually carrying out to generate the plot.
Elijah: Solving Magic Squares using Backtracking
Continue writing the rest of your draft. Here are the areas you still need to address:
How do you write the is_valid
function?
What's the problem with brute-force search? Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using brute force. Then try the 4x4, 5x5, etc and make a graph. State the results as something ridiculous -- e.g. (it'll take me years to solve a something-by-something magic square)
What is the most obvious inefficiency in brute-force search? (It spends a lot of time exploring invalid combinations.)
How can you overcome this inefficiency using "backtracking", i.e. intelligent search? https://en.wikipedia.org/wiki/Sudoku_solving_algorithms#Backtracking
Write some code for how to implement backtracking using a bunch of nested for loops (i.e. the ugly solution). Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using backtracking. Then try the 4x4, 5x5, etc and make a graph.
How can you write the code more compactly using a single while loop?
Location: Overleaf
Grading: 2 points for each part
Note: Points will be deducted for poor latex quality. If you're writing up your latex and anything looks off, make sure to post about it so you can fix it before you submit. FOLLOW THE LATEX COMMANDMENTS!
a. The covariance of two random variables $X_1,X_2$ is defined as
$$\text{Cov}[X_1,X_2] = \text{E}[(X_1 - \overline{X}_1)(X_2 - \overline{X}_2)].$$Given that $X \sim U[0,1],$ compute $\text{Cov}[X,X^2].$
b. Given that $X_1, X_2 \sim U[0,1],$ compute $\text{Cov}[X_1, X_2].$
c. Prove that
$$\text{Var}[X_1 + X_2] = \text{Var}[X_1] + \text{Var}[X_2] + 2 \text{Cov}[X_1,X_2].$$d. Prove that
$$\text{Cov}[X_1,X_2] = E[X_1 X_2] - E[X_1] E[X_2].$$Grading: 7 points
Create a MovementEngine
that handles the movement of ships, similar to how CombatEngine
handles combat.
In your MovementEngine
, include a method generate_movement_state()
that generates the following information:
movement_state = {
'round': 1, # Movement 1, 2, or 3
}
For now, most of the relevant information (e.g. the locations of ships) will already be included in the game state. But we may want to expand the movement state in the future, as we include more features of the game.
Grading: 1 point
Email me a bio AND a headshot that you want to be included on the website: https://eurisko.us/people/
Here's a sample, if you want to use a similar structure:
John Doe is a sophomore in Math Academy and App Academy at Pasadena High School. Outside of school, he enjoys running the 100m and 200m sprints in track, playing video games, and working on his Eagle Scout project that involves saving the lives of puppies from students who don’t regularly commit their Github code. For college, John wants to study math and computer science.
Location: Overleaf
NOTE: Points will be deducted for poor latex quality.
a. (2 points) Using the identity $\text{Var}[X] = \text{E}[X^2] - \text{E}[X]^2,$ compute $\text{Var}[X]$ if $X$ is sampled from the continuous uniform distribution $U[a,b].$
b. (2 points) Using the identity $\text{Var}[X] = \text{E}[X^2] - \text{E}[X]^2,$ compute $\text{Var}[X]$ if $X$ is sampled from the exponential distribution $p(x) = \lambda e^{-\lambda x}, \, x \geq 0.$
c. (2 points) Using the identity $\text{Var}[N] = \text{E}[N^2] - \text{E}[N]^2,$ compute $\text{Var}[N]$ if $N$ is sampled from the Poisson distribution $p(n) = \dfrac{\lambda^n e^{-\lambda}}{n!}, \, n \in \left\{ 0, 1, 2, \ldots \right\}.$
Location: assignment-problems/hodgkin_huxley.py
Grading: 14 points
The Nobel Prize in Physiology or Medicine 1963 was awarded jointly to Sir John Carew Eccles, Alan Lloyd Hodgkin and Andrew Fielding Huxley for their 1952 model of "spikes" (called "action potentials") in the voltage of neurons, using differential equations.
Watch this video to learn about neurons, and this video to learn about action potentials.
Here is a link to the Hodgkin-Huxley paper. I've outlined the key points of the model below.
Idea 0: Start with physics fundamentals
From physics, we know that current is proportional to voltage by a constant $C$ called the capacitance:
$$I(t) = C \dfrac{\textrm dV}{\textrm dt}$$So, the voltage of a neuron can be modeled as
$$\dfrac{\textrm dV}{\textrm dt} = \dfrac{I(t)}{C}.$$For neurons, we have $C \approx 1.0 \, .$
Idea 1: Decompose the current into 4 main subcurrents (stimulus & ion channels)
The current $I(t)$ consists of
a stimulus $s(t)$ to the neuron (from an electrode or other neurons),
current flux across sodium and potassium ion channels ($I_{\text{Na}}(t)$ and $I_{\text K}(t)$), and
current leakage, treated as a channel $I_{\text L}(t).$
So, we have
$$\dfrac{\textrm dV}{\textrm dt} = \dfrac{1}{C} \left[ s(t) - I_{\text{Na}}(t) - I_{\text K}(t) - I_{\text L}(t) \right].$$
Idea 2: Model the ion channel currents
The current across an ion channel is proportional to the voltange difference, relative to the equilibrium voltage of that channel:
$$\begin{align*} I_{\text{Na}}(t) &= g_{\text{Na}}(t) \left( V(t) - V_\text{Na} \right), \quad& I_{\text{K}}(t) &= g_{\text{K}}(t) \left( V(t) - V_\text{K} \right), \quad& I_{\text{L}}(t) &= g_{\text{L}}(t) \left( V(t) - V_\text{L} \right), \\ V_\text{Na} &\approx 115, \quad& V_\text{K} &\approx -12, \quad& V_\text{L} &\approx 10.6 \end{align*}$$The constants of proportionality are conductances, which were modeled experimentally:
$$\begin{align} g_{\text{Na}}(t) &= \overline{g}_{\text{Na}} m(t)^3 h(t), \quad& g_{\text{K}}(t) &= \overline{g}_{\text{K}} n(t)^4, \quad& g_{\text L}(t) &= \overline{g}_\text{L}, \\ \overline{g}_{\text{Na}} &\approx 120, \quad& \overline{g}_{\text{K}} &\approx 36, \quad& \overline{g}_{\text{L}} &\approx 0.3, \end{align}$$where
$$\begin{align*} \dfrac{\text dn}{\text dt} &= \alpha_n(t)(1-n(t)) - \beta_n(t)n(t) \\ \dfrac{\text dm}{\text dt} &= \alpha_m(t)(1-m(t)) - \beta_m(t)m(t) \\ \dfrac{\text dh}{\text dt} &= \alpha_h(t)(1-h(t)) - \beta_h(t)h(t). \end{align*}$$and
$$\begin{align*} \alpha_n(t) &= \dfrac{0.01(10-V(t))}{\exp \left[ 0.1 (10-V(t)) \right] - 1}, \quad& \alpha_m(t) &= \dfrac{0.1(25-V(t))}{\exp \left[ 0.1 (25-V(t)) \right] - 1}, \quad& \alpha_h(t) &= 0.07 \exp \left[ -\dfrac{V(t)}{20} \right], \\ \beta_n(t) &= 0.125 \exp \left[ -\dfrac{V(t)}{80} \right], \quad& \beta_m(t) &= 4 \exp \left[ - \dfrac{V(t)}{18} \right], \quad& \beta_h(t) &= \dfrac{1}{\exp \left[ 0.1( 30-V(t)) \right] + 1}. \end{align*}$$
YOUR PROBLEM STARTS HERE..
Implement the Hodgkin-Huxley neuron model using Euler estimation. You can represent the state of the neuron at time $t$ using
$$ \Big( t, (V, n, m, h) \Big), $$and you can approximate the initial values by setting $V(0)=0$ and setting $n,$ $m,$ and $h$ equal to their asymptotic values for $V(0)=0\mathbin{:}$
$$\begin{align*} n(0) &= n_\infty(0) = \dfrac{\alpha_n(0)}{\alpha_n(0) + \beta_n(0)} \\ m(0) &= m_\infty(0) = \dfrac{\alpha_m(0)}{\alpha_m(0) + \beta_m(0)} \\ h(0) &= h_\infty(0) = \dfrac{\alpha_h(0)}{\alpha_h(0) + \beta_h(0)} \end{align*}$$(When we take $V(0)=0,$ we are letting $V$ represent the voltage offset from the usual resting potential.)
Simulate the system for $t \in [0, 80 \, \text{ms}]$ with step size $\Delta t = 0.01$ and stimulus
$$ s(t) = \begin{cases} 150, & t \in [10,11] \cup [20,21] \cup [30,40] \cup [50,51] \cup [53,54] \\ & \phantom{t \in [} \cup [56,57] \cup [59,60] \cup [62,63] \cup [65,66] \\ 0 & \text{otherwise}. \end{cases} $$You should get the following result:
Location: Overleaf for explanations; assignment-problems/binary_search.py
for code
a. (2 points) Write a function binary_search(entry, sorted_list)
that finds an index of entry
in the sorted_list
. You should do this by repeatedly checking the midpoint of the list, and then recursing on the lower or upper half of the list as appropriate. (If there is no midpoint, then round up or round down consistently.)
Assert that your function passes the following test:
>>> binary_search(7, [2, 3, 5, 7, 8, 9, 10, 11, 13, 14, 15, 16])
3
b. (1 points) Suppose you have a sorted list of 16 elements. What is the greatest number of iterations of binary search that would be needed to find the index of any particular element in the list? Justify your answer.
c. (2 points) State and justify a recurrence equation for the time complexity of binary search on a list of $n$ elements. Then, use it to derive the time complexity of binary search on a list of $n$ elements.
Grading: 5 points
In CombatEngine
, write a method generate_combat_array()
that returns an array of combat states:
[
{'location': (1,2),
'order': [{'player': 1, 'unit': 0,},
{'player': 0, 'unit': 1},
{'player': 1, 'unit': 1}],
},
{'location': (5,10),
'order': [{'player': 0, 'unit': 0},
{'player': 1, 'unit': 2},
{'player': 1, 'unit': 4}],
},
],
Location: Overleaf
Grading: 20 points for a complete draft that is factually correct with proper grammar and usage of transitions / paragraphs, along with descriptions of several images and code snippets to be included in your post.
Write a first draft of your blog post. If you have any images / code snippets in mind, you can just describe them verbally, as follows. We'll fill in images and code later.
Here is an example snippet:
...
To squash the possible outputs into the interval [0,1], we
need to use a sigmoid function of the form
$$y = \dfrac{1}{1+e^{\beta x}}.$$
[image: graph of sigmoidal data with a sigmoid function
running through it]
When we wish to fit the logistic regression for multiple input
variables, we can replace the exponent with a linear combination
of features:
$$y = \dfrac{1}{1+e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n}}.$$
The sigmoid function is nonlinear, so it may seem that we
need to take a different approach to fitting the logistic
regression. Howerver, using a bit of algebra, we can transform the
logistic regression problem into a linear regression problem:
\begin{align*}
y &= \dfrac{1}{1+e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n}} \\
1+e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n} &= \dfrac{1}{y} \\
e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n} &= \dfrac{1}{y} - 1 \\
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n &= \ln \left( \dfrac{1}{y} - 1 \right) \\
\end{align*}
So, if we transform the $y$-values of the dataset as
$$y \to \ln \left( \dfrac{1}{y} - 1 \right),$$
then the logistic regression problem reduces to a linear regression
problem using the transformed $y$-values. To obtain our logistic
regression coefficients, we transform $y$ and fit the linear regression
as follows:
[code: transform y, fit linear regression]
...
We're shooting for something in the style of Hacker News posts. Here are some examples of the kind of style that we're looking for:
George: Linear and Logistic Regression, Part 1: Understanding the Models
What is "regression", and why do we care about it?
In particular, what are linear and logistic regression? What "shape" of data do they model, what is the mathematical form for each of them? (Don't explain how to actually fit the model. This will be done by Colby in part 2.)
Give some examples of situations in which you'd use linear regression instead of logistic regression. Then, give some examples of situations in which you'd use logistic regression instead of linear regression. Make sure you explain why you'd use one model instead of the other.
Explain how in particular, logistic regression can be used to model probability. Give some examples of probabilities that we might model using logistic regression. Why can't we use linear regression to model probability?
Explain how a logistic regression model can be adjusted to model variables that range from 0 to some maximum other than 1. Give a couple examples of situations in which we'd want to do this (e.g. ratings on a scale of 0-10).
Explain how a logistic model can be adjusted to model variables that range from some general lower bound $a$ to some general upper bound $b.$ Give a couple examples of situations in which we'd want to do this.
Colby: Linear and Logistic Regression, Part 2: Fitting the Models
Briefly recap: what "shape" of data do linear and logistic regression model, and what are their mathematical forms?
How do you fit a linear regression model using matrix algebra? What is the pseudoinverse, and why is it needed? Talk through the general method, and illustrate on a concrete example.
How do you fit a logistic regression model using matrix algebra? Talk through the general method, and illustrate on a concrete example.
How is logistic regression related to linear regression? (Logistic regression is really just linear regression on a transformation of variables.)
Explain how you made LogisticRegressor really compact by having it inherit from LinearRegressor. Include a Python code snippet.
Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables
What if you want to model a dataset using linear or logistic regression, but it has categorical variables? Give a couple examples of situations in which this might happen.
In order to model the dataset, what do you have to do to the categorical variables, and why?
What are "interactions" between variables? Give a couple examples of situations when you might need to incorporate interactions into the models.
As-is, can linear or logistic regression capture interactions between variables? Why not? Explain some bad things that might happen if you use vanilla linear or logistic regression in a case when it's important to capture interactions between variables.
What do you have to do to the dataset in order to for your linear or logistic regression to capture interactions?
You can transform the dataset in any way before fitting a linear regression. Consequently, you can actually model many types of nonlinear data by reducing the task down to a linear regression. For example, you can fit a polynomial model using linear regression. Give an example scenario of when you might want to do this. How can you do this?
Logistic regression reduces to linear regression. Polynomial regression reduces to linear regression. Is there any type of regression model that you can't just reduce down to a linear regression? How can you tell whether or not a regression model can be reduced to linear regression?
David: Predator-Prey Modeling with Euler Estimation
What is a predator-prey relationship? Give several concrete examples.
How can we model a predator-prey relationship using differential equations? How do you set up the system of differential equations, and why do you set the system up that way?
What is Euler estimation, and how can you use it to plot the approximate solution to systems of differential equations? Write some code.
How do you choose your step size? What bad thing happens if you choose a step size that is too big? What bad thing happens if you choose a really really really small step size?
Why do oscillations arise in the plot? What do they actually represent in terms of the predator and prey?
Elijah: Solving Magic Squares using Backtracking
What is a magic square? Talk about it in general, not just the 3x3 case. This link will be helpful: https://mathworld.wolfram.com/MagicSquare.html
How can you solve a magic square using brute-force search? Write some code.
What's the problem with brute-force search? Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using brute force. Then try the 4x4, 5x5, etc and make a graph. State the results as something ridiculous -- e.g. (it'll take me years to solve a something-by-something magic square)
What is the most obvious inefficiency in brute-force search? (It spends a lot of time exploring invalid combinations.)
How can you overcome this inefficiency using "backtracking", i.e. intelligent search? https://en.wikipedia.org/wiki/Sudoku_solving_algorithms#Backtracking
Write some code for how to implement backtracking using a bunch of nested for loops (i.e. the ugly solution). Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using backtracking. Then try the 4x4, 5x5, etc and make a graph.
How can you write the code more compactly using a single while loop?
Location: Overleaf
Grading: 7 points
Suppose a magician does a trick where they show you a die that has 6 sides labeled with the numbers $1$ through $6.$ Then, they start rolling the die, and they roll two $1$s and three $2$s. You suspect that they might have switched out the die for a "trick" die that is labeled with three sides of $1$ and three sides of $2.$
Let $\text{switch}$ be the event that the magician actually switched out the die.
PART 0
Given the data, compute the likelihood $P(\text{two rolls of 1 and three rolls of 2} \, | \, \text{switch}).$ Your work should lead to the following result: $$ P(\text{two rolls of 1 and three rolls of 2} \, | \, \text{switch}) = \begin{cases} 0.3125, \quad \textrm{switch = True} \\ 0.001286, \quad \textrm{switch = False} \\ \end{cases}$$
PART 1
Suppose that, before the magician rolled the die, you were agnostic: you believed there was a $50\%$ chance that the die was fair (i.e. a $50\%$ chance that the die was switched out for a biased one).
a. Given your prior belief, what is the prior distribution $P(\text{switch})?$
$$ P(\textrm{switch}) = \begin{cases} \_\_\_, \quad \textrm{switch = True} \\ \_\_\_, \quad \textrm{switch = False} \\ \end{cases}$$b. What is the posterior distribution $P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2})?$ Your work should lead to the following result: $$ P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2}) = \begin{cases} 0.996, \quad \textrm{switch = True} \\ 0.004, \quad \textrm{switch = False} \\ \end{cases}$$
PART 2
Suppose that, before the magician rolled the die, you were optimistic: you believed there was a $99\%$ chance that the die was fair (i.e. a $1\%$ chance that the die was switched out for a biased one).
a. Given your prior belief, what is the prior distribution $P(\text{switch})?$
$$ P(\textrm{switch}) = \begin{cases} \_\_\_, \quad \textrm{switch = True} \\ \_\_\_, \quad \textrm{switch = False} \\ \end{cases}$$b. What is the posterior distribution $P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2})?$ Your work should lead to the following result: $$ P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2}) = \begin{cases} 0.711, \quad \textrm{switch = True} \\ 0.289, \quad \textrm{switch = False} \\ \end{cases}$$
PART 3
Suppose that, before the magician rolled the die, you were pessimistic: you believed there was a $1\%$ chance that the die was fair (i.e. a $99\%$ chance that the die was switched out for a biased one).
a. Given your prior belief, what is the prior distribution $P(\text{switch})?$
$$ P(\textrm{switch}) = \begin{cases} \_\_\_, \quad \textrm{switch = True} \\ \_\_\_, \quad \textrm{switch = False} \\ \end{cases}$$b. What is the posterior distribution $P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2})?$ Your work should lead to the following result: $$ P(\text{switch} \, | \, \text{two rolls of 1 and three rolls of 2}) = \begin{cases} 0.99996, \quad \textrm{switch = True} \\ 0.00004, \quad \textrm{switch = False} \\ \end{cases}$$
Location: Overleaf
Grading: 7 points
One of the simplest ways to model the spread of disease using differential equations is the SIR model. The SIR model assumes three sub-populations: susceptible, infected, and recovered.
The number of susceptible people $(S)$ decreases at a rate proportional to the rate of meeting between susceptible and infected people (because susceptible people have a chance of catching the disease when they come in contact with infected people).
The number of infected people $(I)$ increases at a rate proportional to the rate of meeting between susceptible and infected people (because susceptible people become infected after catching the disease), and decreases at a rate proportional to the number of infected people (as the diseased people recover).
The number of recovered people $(R)$ increases at a rate proportional to the number of infected people (as the diseased people recover).
a. Write a system of differential equations to model the system.
$$\begin{cases} \dfrac{\textrm{d}S}{\textrm{d}t} &= \_\_\_, \quad S(0) = \_\_\_ \\ \dfrac{\textrm{d}I}{\textrm{d}t} &= \_\_\_, \quad I(0) = \_\_\_ \\ \dfrac{\textrm{d}R}{\textrm{d}t} &= \_\_\_, \quad R(0) = \_\_\_ \end{cases}$$Make the following assumptions:
There are initially $1000$ susceptible people and $1$ infected person.
The number of meetings between susceptible and infected people each day is proportional to the product of the numbers of susceptible and infected people, by a factor of $0.01 \, .$ The transmission rate of the disease is $3\%.$ (In other words, $3\%$ of meetings result in transmission.)
Each day, $2\%$ of infected people recover.
Check: If you've written the system correctly, then at $t=0,$ you should have
$$ \dfrac{\textrm{d}S}{\textrm{d}t} = -0.3, \quad \dfrac{\textrm{d}I}{\textrm{d}t} = 0.3, \quad \dfrac{\textrm{d}R}{\textrm{d}t} = 0.2 \, . $$b. Plot the system and include the plot in your Overleaf document. (You get to choose your own step size and interval. Choose a step size small enough that the model doesn't blow up, but large enough that the simulation doesn't take long to run. Choose an interval that displays all the main features of the differential equation -- you'll see what I mean if you play around with various plotting intervals.)
c. Explain what the plot shows, and explain why this happens.
Location: space-empires/src/game.py generate_state()
Grading: 7 points
Refactor your game state generator to use the structure shown below. Print out your game state at each moment in your DumbPlayer
tests. (In the next assignment, we will refactor the tests to actually use the game state, but for now, just print it out.)
game_state = {
'turn': 4,
'phase': 'Combat', # Can be 'Movement', 'Economic', or 'Combat'
'round': None, # if the phase is movement, then round is 1, 2, or 3
'player_whose_turn': 0, # index of player whose turn it is (or whose ship is attacking during battle),
'winner': None,
'players': [
{'cp': 9
'units': [
{'location': (5,10),
'type': Scout,
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 3
}},
{'location': (1,2),
'type': Destroyer,
'hits': 0,
'technology': {
'attack': 0,
'defense': 0,
'movement': 2
}},
{'location': (6,0),
'type': Homeworld,
'hits': 0,
'turn_created': 0
},
{'location': (5,3),
'type': Colony,
'hits': 0,
'turn created': 2
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 3, 'ship size': 0}
},
{'cp': 15
'units': [
{'location': (1,2),
'type': Battlecruiser,
'hits': 1,
'technology': {
'attack': 0,
'defense': 0,
'movement': 0
}},
{'location': (1,2),
'type': Scout,
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 0
}},
{'location': (5,10),
'type': Scout,
'hits': 0,
'technology': {
'attack': 1,
'defense': 0,
'movement': 0
}},
{'location': (6,12),
'type': Homeworld,
'hits': 0,
'turn_created': 0
},
{'location': (5,10),
'type': Colony,
'turn created': 1
}],
'technology': {'attack': 1, 'defense': 0, 'movement': 0, 'ship size': 0}
}],
'planets': [(5,3), (5,10), (1,2), (4,8), (9,1)]
}
Location: Overleaf
Grading: 11 points total
a. (1 point) Given that $X \sim p(x),$ where $p(x)$ is a continuous distribution, prove that for any real number $a$ we have $E[aX] = aE[X].$
b. (1 point) Given that $X_1, X_2 \sim p(x),$ where $p(x)$ is a continuous distribution, prove that $E[X_1 + X_2] = E[X_1] + E[X_2].$
c. (3 points) Given that $X \sim p(x)$ where $p(x)$ is a continuous probability distribution, prove the identity $\text{Var}[X] = E[X^2] - E[X]^2.$
d. (3 points) Use bisection search to estimate $\sqrt{5}$ to $4$ decimal places by hand, showing your work at each step of the way. See problem 5-2 for a refresher on bisection search.
e. (3 points) Use "merge sort" to sort the list [4,8,7,7,4,2,3,1]
. Do the problem by hand and show your work at each step of the way. See problem 23-2 for a refresher on merge sort.
We've been doing a lot of likelihood estimation. However, we've been hand-waving the fact that the likelihood actually represents the correct probability distribution of a parameter once we normalize it.
The mathematical reason why the normalized likelihood actually represents the correct probability distribution is Bayes' theorem. Bayes' theorem states that for any two events $A$ and $B,$ if $B,$ occurred, then the probability of $A$ occurring is
$$\begin{align} P(A \, | \, B) = \dfrac{P(A \text{ and } B)}{P(B)}. \end{align}$$Note that Bayes' theorem comes from the "multiplication law" for conditional probability:
$$\begin{align} P(A \text{ and } B) = P(A \, | \, B)P(B). \end{align}$$In most of our contexts, we're interested in the probability of a parameter taking a particular value, given some observed data. So, using Bayes' theorem and the multiplication law, we have
$$\begin{align*} P(\text{parameter}=k \, | \, \text{data}) &= \dfrac{P(\text{parameter}=k \text{ and data})}{P(\text{data})} \\ &= \dfrac{P(\text{data} \, | \, \text{parameter}=k) P(\text{parameter}=k)}{P(\text{data})}. \end{align*}$$Now, $P(\text{data})$ is just a constant, so we have
$$\begin{align*} P(\text{parameter}=k \, | \, \text{data}) &= \dfrac{P(\text{data} \, | \, \text{parameter}=k) P(\text{parameter}=k)}{\text{some constant}} \\ &\propto P(\text{data} \, | \, \text{parameter}=k) P(\text{parameter}=k) \end{align*}$$where the "$\propto$" symbol means "proportional to".
The term $P(\text{parameter}=k)$ is called the prior distribution and represents the information that we know about the parameter before we have observed the data. If we haven't observed any data, we often take this prior distribution to be the uniform distribution.
The term $P(\text{data} \, | \, \text{parameter}=k)$ is the probability of observing the data, given that the parameter is $k.$ This is equivalent to the likelihood of the parameter $k,$ given the data.
The term $P(\text{parameter}=k \, | \, \text{data})$ is called the posterior distribution and represents the information that we know about the parameter after we have observed the data.
In all the problems that we have done until now, we have taken the prior distribution to be the uniform distribution, meaning that we don't know anything about the parameter until we have gathered some data. Since the uniform distribution is constant, we have
$$\begin{align*} P(\text{parameter}=k \, | \, \text{data}) &\propto P(\text{data} \, | \, \text{parameter}=k) P(\text{parameter}=k) \\ &\propto P(\text{data} \, | \, \text{parameter}=k) (\text{some constant}) \\ &\propto P(\text{data} \, | \, \text{parameter}=k). \end{align*}$$Remember that $P(\text{data} \, | \, \text{parameter}=k)$ is just the likelihood, $\mathcal{L}(\text{parameter}=k \, | \, \text{data}),$ so we have
$$\begin{align*} P(\text{parameter}=k \, | \, \text{data}) \propto \mathcal{L}(\text{parameter}=k \, | \, \text{data}). \end{align*}$$And there we go! The distribution of the parameter is proportional to the likelihood.
THE ACTUAL QUESTION BEGINS HERE...
Grading: 13 points possible
Location: Overleaf
Suppose a wormhole opens and people begin testing it out for travel purposes. You want to use the wormhole too, but you're not sure whether it's safe. Some other people don't seem to care whether it's safe, so you decide to count the number of successful travels by others until you're $99\%$ sure the risk of disappearing forever into the wormhole is no more than your risk of dying from a car crash in a given year ($1$ in $10\,000$).
You model the situation as $P(\text{success}) = k$ and $P(\text{failure}) = 1-k.$ "Success" means successfully entering and successfully exiting the wormhole, while "failure" means entering but failing to exit the wormhole.
PART 1 (Bayesian inference with a uniform prior)
Start by assuming a uniform prior distribution (i.e. you know nothing about $k$ until you've collected some data). So, $k$ initially follows the uniform distribution over the interval $[0,1]\mathbin{:}$
$$P(k) = \dfrac{1}{1-0} = 1, \quad k \in [0,1]$$a. (1 point) Other people make $1\,000$ successful trips through the wormhole with no failures. What is the likelihood function for $k$ given these $1\,000$ successful trips?
- Check that your function is correct by verifying a particular input/output pair: when you plug in $k=0.99,$ you should get $\mathcal{L}(k=0.99 \, | \, 1\,000 \text{ successes}) = 0.000043$ (rounded to 6 decimal places).
b. (1 point) What is the posterior distribution for $k$ given these $1\,000$ successful trips? (This is the same as just normalizing the likelihood function).
- Check that your function is correct by verifying a particular input/output pair: when you plug in $k=0.99,$ you should get $p(k=0.99 \, | \, 1\,000 \text{ successes}) = 0.043214$ (rounded to 6 decimal places).
c. (2 points) Assuming that you will use the wormhole $500$ times per year, what is the posterior probability that the risk of disappearing forever into the wormhole is no more than your risk of dying from a car crash in a given year ($1$ in $10\,000$)? In other words, what is $P \left( 1-k^{500} \leq \dfrac{1}{10\,000} \, \Bigg| \, 1\,000 \text{ successes} \right)?$
- Note that $k^{500}$ represents the probability of $500$ successes in a row, so $1-k^{500}$ represents the probability of at least one failure over the course of $500$ trips through the wormhole.
- Check your answer: you should get $0.000200$ (rounded to 4 decimal places)
PART 2 (Updating by inspecting the posterior)
You keep on collecting data until your posterior distribution is $P(k \, | \, \text{? successes}) = 5\,001 k^{5\,000}.$ But then you forget how many successes you have counted. Because this is a rather simple scenario, it's easy to find this number by inspecting your posterior distribution.
a. (1 point) Looking at the given posterior distribution, how many successes have you counted?
b. (2 points) Suppose you observe $2\,000$ more successes. What is the posterior distribution now?
- Check that your function is correct by verifying a particular input/output pair: when you plug $k=0.999$ into your posterior distribution, you should get $P(k=0.999 \, | \, 2\,000 \text{ more successes}) = 6.362$ (rounded to 3 decimal places).
PART 3 (Bayesian updating)
To get some practice with the general procedure of Bayesian inference with a non-uniform prior, let's re-do Part 2, supposing you weren't able to remember the number of successes by inspecting your posterior distribution.
This time, you'll use $P(k \, | \, \text{? successes}) = 5\,001 k^{5\,000}$ as your prior distribution.
a. (3 points) Suppose you observe $2\,000$ more successes. Fill in the blanks: $$\begin{align*} \text{prior distribution: }& P(k) = 5\,001 k^{5\,000} \\ \text{likelihood: }& P(2\,000 \text{ more successes} \, | \, k) = \_\_\_ \\ \text{prior } \times \text{likelihood: }& P(2\,000 \text{ more successes} \, | \, k)P(k) = \_\_\_ \\ \text{posterior distribution: } & P(k \, | \, 2\,000 \text{ more successes}) = \_\_\_ \end{align*}$$
- Check that your function is correct by verifying a particular input/output pair: when you plug in $k=0.999$ to your likelihood, you should $P(2\,000 \text{ more successes} \, | \, k = 0.999) = 0.1351$ (rounded to 4 decimal places)
- Your posterior distribution should come out exactly the same as in 2b, and you need to show the work for why this happens.
PART 4 (Inference)
Let's go back to the moment when you forgot the number of successes, and your posterior distribution was is $P(k \, | \, \text{? successes}) = 5\,001 k^{5\,000}.$
a. (3 points) Assuming that you will use the wormhole $500$ times per year, how many more people do you need to observe successfully come out of the wormhole to be $99\%$ sure the risk of disappearing forever into the wormhole is no more than your risk of dying from a car crash in a given year ($1$ in $10\,000$)?
- In other words, find the least $N$ such that $P\left(1-k^{500} \leq \dfrac{1}{10\,000} \, \Bigg| \, N \text{ more successes} \right) = 0.99 \, .$
- Check that your function is correct by verifying a particular input/output pair: your answer should come out to $N = 23 \, 019 \, 699.$
If your EulerEstimator is not currently working, then fix it. Make sure to write a post if you run into any issues you're not able to solve.
Location: Write your answers in LaTeX on Overleaf.com using this template.
Suppose that you wish to model a deer population $D(t)$ and a wolf population $W(t)$ over time $t$ (where time is measured in years).
Initially, there are $100$ deer and $10$ wolves.
In the absence of wolves, the deer population would increase at the instantaneous rate of $60\%$ per year.
In the absence of deer, the wolf population would decrease at the instantaneous rate of $90\%$ per year.
The wolves and deer meet at a rate of $0.1$ times per wolf per deer per year, and every time a wolf meets a deer, it has a $50\%$ chance of successfully killing and eating the deer. In other words, the deer population decreases at a rate of $0.1$ deer per deer killed per year.
The rate at which the wolf population increases is proportional to the number of deer that are killed, by a factor of $0.4.$ In other words, the wolf population grows by a rate of $0.4$ wolves per deer killed per year.
a. (2 points) Set up a system of differential equations to model the situation:
\begin{cases} \dfrac{\text{d}D}{\textrm{d}t} = (\_\_\_) D + (\_\_\_) DW, \quad D(0) = \_\_\_ \\ \dfrac{\text{d}W}{\textrm{d}t} = (\_\_\_) W + (\_\_\_) DW, \quad W(0) = \_\_\_ \\ \end{cases}Check your answer: at $t=0,$ you should have $\dfrac{\text{d}D}{\textrm{d}t} = 10$ and $\dfrac{\text{d}W}{\textrm{d}t} = 11.$
Here's some latex for you to use:
$$\begin{cases}
\dfrac{\text{d}D}{\textrm{d}t} = (\_\_\_) D + (\_\_\_) DW, \quad D(0) = \_\_\_ \\
\dfrac{\text{d}W}{\textrm{d}t} = (\_\_\_) W + (\_\_\_) DW, \quad W(0) = \_\_\_ \\
\end{cases}$$
b. (2 points) Plot the system of differential equations for $0 \leq t \leq 100,$ using a step size $\Delta t = 0.001.$ Then, download your plot and put it in your writeup. (I updated the latex template to show to include an example of inserting an image.)
c. (2 points) In the plot, you should see oscillations. What does this mean in terms of the wolf and deer populations? Why does this happen?
def f(x):
return g(x)
def g(x):
return 0
f(0)
a. (2 points) Refactor your Player
class so that a player is initialized with a strategy class, and the strategy class is stored within player.strategy
.
b. (4 points) In Game
, create a method generate_state()
that looks at the game board, players, units, etc and puts all the relevant information into a nested "state" dictionary. In the next class, we will come to an agreement about how this "state" dictionary should be structured, but for now, I want everyone to give it a shot and we'll see what we come up with.
Note: DON'T do this now, but just to let you know what's coming -- once we've come to an agreement regarding the structure of the "state" dictionary, you will be doing the following:
Rename dumb_player.py
to deprecated_dumb_player.py
, rename combat_player.py
to deprecated_combat_player.py
, and rename yourname_strategy_player.py
to deprecated_yourname_strategy_player.py
.
Write classes DumbStrategy
, CombatStrategy
, and YourNameStrategy
that each contain the strategies for following methods:
will_colonize_planet(colony_ship, game_state)
: returns either True
or False
; will be called whenever a player's colony ship lands on an uncolonized planet
decide_ship_movement(ship, game_state)
: returns the coordinates to which the player wishes to move their ship.
decide_purchases(game_state)
: returns a list of ship and/or technology types that you want to purchase; will be called during each economic round.
decide_removals(game_state)
: returns a list of ships that you want to remove; will be called during any economic round when your total maintenance cost exceeds your CP.
decide_which_ship_to_attack(attacking_ship, game_state)
: looks at the ships in the combat order and decides which to attack; will be called whenever it's your turn to attack
Make sure your tests work when you use players with DumbStrategy
or CombatStrategy
Run a game with YourNameStrategy
vs DumbStrategy
and ensure that your strategy beats the dumb strategy.
Soon, a part of your homework will consist of writing up blog posts about things you've done. I've come up with 11 topics so far and put them in a spreadsheet. Each person is going to write about a different topic. So I'm going to try to match up everyone with topics they're most interested in. Take a look at this spreadsheet and rank your top 5 posts in order of preference (with 1 being the most preferable, 2 being the next-most preferable, and so on).
We're doing this because we need to build up some luck surface area. It's great that we're doing so much cool stuff, but part of the process of opening doors is telling people what you're doing. Writing posts is a way to do that. And it's also going to help contribute to developing your portfolios, so that you have evidence of what you're doing.
Location: Write your answers in LaTeX on Overleaf.com using this template.
Grading: 2 points per part
Several witnesses reported seeing a UFO during the following time intervals:
$$ \text{data}=\bigg\{ [12,13], \, [12,13.5], \, [14,15], \, [14,16] \bigg\} $$The times represent hours in military time:
Suppose you want to quantify your certainty regarding when the UFO arrived and when it left.
Assume the data came from $\mathcal{U}[a,b],$ the uniform distribution on the interval $[a,b].$ This means the UFO arrived at time $a$ and left at time $b.$
Watch out! The data do NOT correspond to samples of $[a,b].$ Rather, the data correspond to subintervals of $[a,b].$
a. Compute the likelihood function $\mathcal{L}([a,b]\,|\,\text{data}).$
Your result should come out to $\dfrac{3}{(b-a)^4}.$
Hint: if the UFO was there from $t=a$ to $t=b,$ then what's the probability that a single random observation of the UFO would take place between 12:00 and 13:30? In other words, if you had to choose a random number between $a$ and $b,$ what's the probability that your random number would be between $12$ and $13.5?$
b. Normalize the likelihood function so that it can be interpreted as a probability density.
As an intermediate step to solve for the constant of normalization, you will have to take a double integral $\displaystyle \int_{b_\text{min}}^{b_\text{max}} \int_{a_\text{min}}^{a_\text{max}} \mathcal{L}([a,b]\,|\,\text{data}) \, \text{d}a \, \text{d}b$. Two of the bounds will be infinite, and two of the bounds will be finite.
To figure out the appropriate intervals of integration for $a,b,$ ask yourself the following:
What is the highest possible value for $a$ (i.e. the latest the UFO could have arrived) given the data?
What is the lowest possible value for $b$ (i.e. the earliest the UFO could have left) given the data?
c. What is the probability that the UFO came and left sometime during the day that it was sighted? In other words, what is the probability that $0<a<a_\text{max}$ and $b_\text{min} < b < 24?$
d. What is the probability that the UFO arrived before 10am?
Begin implementing your custom strategy player as a class (YourName)StrategyPlayer
. In this assignment, you will write pseudocode for the components of your strategy related to movement. Again, assume a $13 \times 13$ board.
By "pseudocode", I mean that you should still write a class and use Python syntax, but your player doesn't actually have to "work" when you run the code. You just need to translate your strategy ideas into a sketch of what it would ideally look like if you coded it up.
For example, if your colonization strategy is to colonize a planet only if it is at least $2$ spaces away from a planet that you have already colonized, then the following "pseudocode" would be fine, even if (for example) you haven't actually implemented other_planet.colonizer_player_number
.
def calc_distance(coords_1, coords_2):
x_1, y_1 = coords_1
x_2, y_2 = coords_2
return abs(x_2 - x_1) + abs(y_2 - y_1)
def will_colonize_planet(colony_ship, planet, game):
for other_planet in game.board.planets:
if other_planet.colonizer_player_number == self.player_number:
distance = self.calc_distance(planet.coordinates, other_planet.coordinates)
if distance < 2:
return False
return True
a. (2 points) Implement your colonization logic within the method Player.will_colonize_planet(colony_ship, planet, game)
that returns either True
or False
.
This function will be called whenever a player's colony ship lands on an uncolonized planet.
When we integrate this class into the game (you don't have to do it yet), the game will ask the player whether it wants to colonize a particular planet, and the player will use the game
data to determine whether it wants to colonize.
b. (6 points) Implement your movement logic within the method Player.decide_ship_movement(ship, game)
that returns the coordinates to which the player wishes to move their ship.
game
data to make a decision.Location: machine-learning/analysis/signal_separation.py
Grading: 6 points total
Use your DataFrame
and LinearRegressor
to solve the following regression problems.
PART A (2 points)
The following dataset takes the form $$y = a + bx + cx^2 + dx^3$$ for some constants $a,b,c,d.$ Use linear regression to determine the best-fit values of $a,b,c,d.$
[(0.0, 4.0),
(0.2, 8.9),
(0.4, 17.2),
(0.6, 28.3),
(0.8, 41.6),
(1.0, 56.5),
(1.2, 72.4),
(1.4, 88.7),
(1.6, 104.8),
(1.8, 120.1),
(2.0, 134.0),
(2.2, 145.9),
(2.4, 155.2),
(2.6, 161.3),
(2.8, 163.6),
(3.0, 161.5),
(3.2, 154.4),
(3.4, 141.7),
(3.6, 122.8),
(3.8, 97.1),
(4.0, 64.0),
(4.2, 22.9),
(4.4, -26.8),
(4.6, -85.7),
(4.8, -154.4)]
PART B (4 points)
The following dataset takes the form $$y = a \sin(x) + b \cos(x) + c \sin(2x) + d \cos(2x)$$ for some constants $a,b,c,d.$ Use linear regression to determine the best-fit values of $a,b,c,d.$
[(0.0, 7.0),
(0.2, 5.6),
(0.4, 3.56),
(0.6, 1.23),
(0.8, -1.03),
(1.0, -2.89),
(1.2, -4.06),
(1.4, -4.39),
(1.6, -3.88),
(1.8, -2.64),
(2.0, -0.92),
(2.2, 0.95),
(2.4, 2.63),
(2.6, 3.79),
(2.8, 4.22),
(3.0, 3.8),
(3.2, 2.56),
(3.4, 0.68),
(3.6, -1.58),
(3.8, -3.84),
(4.0, -5.76),
(4.2, -7.01),
(4.4, -7.38),
(4.6, -6.76),
(4.8, -5.22)]
Grading: 8 points total
Location: Write your answers in LaTeX on Overleaf.com using this template.
PART A (2 points per part)
The following statements are false. For each statement, explain why it is false, and give a concrete counterexample that illustrates that it is false.
To compute the expected values of all the variables $(x_1, x_2, \ldots, x_n),$ that follow a joint distribution $p(x_1, x_2, \ldots, x_n),$ you need to compute the equivalent of $n$ integrals (assuming that a double integral is the equivalent of $2$ integrals, a triple integral is the equivalent of $3$ integrals, and so on).
If $p(x)$ is a valid probability distribution, and we create a function $f(x,y) = p(x),$ then $f(x,y)$ is a valid joint distribution.
PART B (4 points)
The following statement is true. First, give a concrete example on which the statement holds true. Then, construct a thorough proof.
Grading: 10 points total (2 points for each strategy description + justification)
Location: Write your answers in LaTeX on Overleaf.com using this template.
Write up specifications for your own custom strategy player.
Make the specifications simple enough that you will be able to implement them next week without too much trouble.
Make sure that your player is able to consistently beat DumbPlayer
and CombatPlayer
. But don't tailor your strategies to these opponents -- your player will battle against other classmates' players soon.
In your LaTeX doc, outline your strategy in each of the following components, and justify your reasoning for the strategy. And again, make sure it's simple enough for you to implement next week.
How will you decide where to move your ships?
How will you decide what to buy during the economic phase?
If you are unable to pay all your maintenance costs, how will you decide what ship to remove?
If you come across a planet, how will you decide whether or not to colonize it?
During battle with the enemy, when it's your ship's turn to attack an enemy ship, how will you decide which ship to attack?
Note: You can assume we're playing on a $13 \times 13$ board.
Grading: 8 points
Locations:
assignment-problems/euler_estimator.py
assignment-problems/system_plot.py
Generalize your EulerEstimator
to systems of differential equations. For example, we should be able to model the system
starting at the point $\left( t, \begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} \right) = \left( 0, \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \right)$ as follows:
>>> euler = EulerEstimator(
derivatives = [
(lambda t,x: x[0] + 1),
(lambda t,x: x[0] + x[1]),
(lambda t,x: 2*x[1])
],
point = (0,(0,0,0))
)
>>> euler.point
(0, (0, 0, 0))
>>> euler.calc_derivative_at_point()
[1, 0, 0]
>>> euler.step_forward(0.1)
>>> euler.point
(0.1, (0.1, 0, 0))
>>> euler.calc_derivative_at_point()
[1.1, 0.1, 0]
>>> euler.step_forward(-0.5)
>>> euler.point
(-0.4, (-0.45, -0.05, 0))
>>> euler.go_to_input(5, step_size = 2)
notes to help you debug:
point: (-0.4, (-0.45, -0.05, 0))
derivative: (0.55, -0.5, -0.1)
deltas: (2, (1.1, -1, -0.2))
point: (1.6, (0.65, -1.05, -0.2))
derivative: (1.65, -0.4, -2.1)
deltas: (2, (3.3, -0.8, -4.2))
point: (3.6, (3.95, -1.85, -4.4))
derivative: (4.95, 2.1, -3.7)
deltas: (1.4, (6.93, 2.94, -5.18))
>>> euler.point
(5, (10.88, 1.09, -9.58))
Also, implement plotting capability.
(continued from above tests)
>>> euler.plot([-5,5], step_size = 0.1, filename = 'plot.png')
generates three plots of x_0(t), x_1(t), and x_2(t) all
on the same axis, for t-values in the interval [-5,5]
separated by a length of step_size
Once you've got a plot, post it on Slack to compare with your classmates.
Grading: 3 points (1 point for fully catching up on each problem)
We'll use this assignment as a partial catch-up. If you didn't get full points on 44-1, 44-2, and 44-4, then revise and resubmit.
44-1: Make sure your final output is a fully correct magic square, meaning all the rows, columns, and diagonals (including the anti-diagonal) each add up to 15. Also, make sure your code runs quickly, meaning that you shouldn't be doing a plain brute-force search.
44-2: I've updated the problem with the correct answers so that you can verify your results. Also, you don't have to show every single little step in your work. Just show the main steps.
44-4: Make sure the game asks your player if it wants to colonize. There should be a method will_colonize()
in your player, and the game should check if player.will_colonize()
. For your DumbPlayer
and CombatPlayer
, you can just set will_colonize()
to return false. Also, make sure your DumbPlayer
tests and CombatPlayer
tests fully pass.
Location: assignment-problems/magic_square.py
In this problem, you will solve for all arrangements of digits $1,2,\ldots, 9$ in a $3 \times 3$ "magic square" where all the rows, columns, and diagonals add up to $15$ and no digits are repeated.
a. (4 points)
First, create a function is_valid(arr)
that checks if a possibly-incomplete array is a valid magic square "so far". In order to be valid, all the rows, columns, and diagonals in an array that have been completely filled in must sum to $15.$
>>> arr1 = [[1,2,None],
[None,3,None],
[None,None,None]]
>>> is_valid(arr1)
True (because no rows, columns, or diagonals are completely filled in)
>>> arr2 = [[1,2,None],
[None,3,None],
[None,None,4]]
>>> is_valid(arr2)
False (because a diagonal is filled in and it doesn't sum to 15)
>>> arr3 = [[1,2,None],
[None,3,None],
[5,6,4]]
>>> is_valid(arr3)
False (because a diagonal is filled in and it doesn't sum to 15)
(it doesn't matter that the bottom row does sum to 15)
>>> arr4 = [[None,None,None],
[None,3,None],
[5,6,4]]
>>> is_valid(arr4)
True (because there is one row that's filled in and it sums to 15)
b. (6 points)
Now, write a script to start filling in numbers of the array -- but whenever you reach a configuration that can no longer become a valid magic square, you should not explore that configuration any further. Once you reach a valid magic square, print it out.
for
loops, along with continue
statements where appropriate. (A continue
statement allows you to immediately continue to the next item in a for
loop, without executing any of the code below the continue
statement.)Some of the first steps are shown below to give a concrete demonstration of the procedure:
Filling...
[[_,_,_],
[_,_,_],
[_,_,_]]
[[1,_,_],
[_,_,_],
[_,_,_]]
[[1,2,_],
[_,_,_],
[_,_,_]]
[[1,2,3],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,2,4],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,2,5],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
...
[[1,2,9],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,3,2],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,3,4],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,3,5],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
...
[[1,3,9],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
[[1,4,2],
[_,_,_],
[_,_,_]]
^ no longer can become a valid magic square
...
[[1,5,9],
[_,_,_],
[_,_,_]]
[[1,5,9],
[2,_,_],
[_,_,_]]
[[1,5,9],
[2,3,_],
[_,_,_]]
[[1,5,9],
[2,3,4],
[_,_,_]]
^ no longer can become a valid magic square
[[1,5,9],
[2,3,5],
[_,_,_]]
^ no longer can become a valid magic square
...
Location: Going forward, we're going to start using $\LaTeX$ for stats problems. Make an account at https://www.overleaf.com/ submit a sharable link to your document. Here is a link to an example template that you can use.
Grading: 1 point per item (8 points total)
A joint distribution is a probability distribution on two or more random variables. To work with joint distributions, you will need to use multi-dimensional integrals.
For example, given a joint distribution $p(x,y),$ the probability that $(X,Y) \in [a,b] \times [c,d]$ is given by
$$ \begin{align*} P((X,Y) \in [a,b] \times [c,d]) = \displaystyle \iint_{[a,b] \times [c,d]} p(x,y) \, \text{d}A, \end{align*} $$or equivalently,
$$ \begin{align*} P(a < X \leq b, \, c < Y \leq d) = \displaystyle \int_c^d \int_a^b p(x,y) \, \text{d}x \, \text{d}y. \end{align*} $$Part A
The joint uniform distribution $\mathcal{U}([a,b]\times[c,d])$ is a distribution such that all points $(x,y)$ have equal probability in the region $[a,b]\times[c,d]$ and zero probability elsewhere. So, it takes the form
$$p(x,y) = \begin{cases} k & (x,y) \in [a,b] \times [c,d] \\ 0 & (x,y) \not\in [a,b] \times [c,d] \end{cases}$$for some constant $k.$
i. Find the value of $k$ such that $p(x,y)$ is a valid probability distribution. Your answer should be in terms of $a,b,c,d.$
ii. Given that $(X,Y) \sim p,$ compute $\text{E}[X]$ and $\text{E}[Y].$
iii. Given that $(X,Y) \sim p,$ compute $\text{Var}[X]$ and $\text{Var}[Y].$
iv. Given that $(X,Y) \sim p,$ compute $P\left( (X,Y) \in \left[ \dfrac{a+3b}{4}, \dfrac{3a+b}{4} \right] \times \left[ \dfrac{2c+d}{3}, \dfrac{c+2d}{3} \right] \right).$
Part B
Now consider the joint exponential distribution defined by
$$p(x,y) = \begin{cases} k e^{-n x - q y} & x,y \geq 0 \\ 0 & x<0 \text{ or } y < 0 \end{cases}.$$i. Find the value of $k$ such that $p(x,y)$ is a valid probability distribution. Your answer should be in terms of $n,q.$
ii. Given that $(X,Y) \sim p,$ compute $\text{E}[X]$ and $\text{E}[Y].$
iii. Given that $(X,Y) \sim p,$ compute $\text{Var}[X]$ and $\text{Var}[Y].$
iv. Given that $(X,Y) \sim p,$ compute $P\left( X < a, \, Y < b \right).$
Grading: 5 points
Locations:
assignment-problems/euler_estimator.py
assignment-problems/parabola_plot.py
Update EulerEstimator
to make plots:
>>> euler = EulerEstimator(derivative = (lambda x: x+1),
point = (1,4))
>>> euler.plot([-5,5], step_size = 0.1, filename = 'plot.png')
generates a plot of the function for x-values in the
interval [-5,5] separated by a length of step_size
for this example, the plot should look like the parabola
y = 0.5x^2 + x + 2.5
a. (1 point)
Update the CP in your DumbPlayer
tests. Players start with 0 CP, and the home colony generates 20 CP per round.
b. (5 points)
If you haven't already, modify your game so that whenever a colony ship moves onto a space with a planet, the game asks the player if they want to colonize. Your DumbPlayer
and CombatPlayer
should both choose not to colonize any planets -- but the game should ask them anyway. When you submit, include the link to the file where your game asks the players if they want to colonize, so I can verify.
c. (2 points)
Make sure your DumbPlayer
and CombatPlayer
tests all pass. Let's get this done once and for all so that we can move onto a manual strategy player!
Grading: 4 points for correct predictions and 4 points for code quality
Locations:
machine-learning/src/logistic_regressor.py
machine-learning/tests/test_logistic_regressor.py
If you have existing logistic regressor files and tests, rename them as follows:
machine-learning/src/deprecated_logistic_regressor.py
machine-learning/tests/test_deprecated_logistic_regressor.py
Now, create a new class LinearRegressor
in machine-learning/src/logistic_regressor.py
and assert
that it passes the following tests. Put your tests in tests/test_logistic_regressor.py
.
IMPORTANT: Your code must satisfy the following specifications:
LogisticRegressor
should inherit from LinearRegressor
.
LogisticRegressor
should compute its coefficients by first replacing the prediction_column
with the appropriate transformed values, and then fitting a linear regression.
LogisticRegressor
should make a prediction by first computing the linear regression output using its coefficients, and then passing the result as input into the sigmoid function.
Note: If you have a deprecated LogisticRegressor
, then you can use parts of it for inspiration, but I don't want to see any old dead code in this new class.
(continued from LinearRegressor tests...)
>>> df = df.apply('rating', lambda x: 0.1 if x==0 else x)
^ Note: a "lambda" function is a way of writing a function on a single line,
kind of like a list comprehension. The above lambda function just replaces any
values of 0 with 0.1.
>>> regressor = LogisticRegressor(df, prediction_column = 'rating', max_value = 10)
^ Note: the logistic regression model is
prediction_column = max_value / [ 1 + e^( sum of products of inputs and
their corresponding multipliers ) ]
>>> regressor.multipliers
{
'beef': -0.03900793,
'pb': -0.02047944,
'mayo': 1.74825378,
'jelly': -0.39777219,
'beef_pb': 0.14970983,
'beef_mayo': -0.74854916,
'beef_jelly': 0.46821312,
'pb_mayo': 0.32958369,
'pb_jelly': -0.5288267,
'mayo_jelly': 2.64413352,
'constant': 1.01248436
}
>>> regressor.predict({
'beef': 5,
'pb': 5,
'mayo': 1,
'jelly': 1,
})
0.023417479576338825
>>> regressor.predict({
'beef': 0,
'pb': 3,
'mayo': 0,
'jelly': 1,
})
7.375370259327203
>>> regressor.predict({
'beef': 1,
'pb': 1,
'mayo': 1,
'jelly': 0,
})
0.8076522077650409
>>> regressor.predict({
'beef': 6,
'pb': 0,
'mayo': 1,
'jelly': 0,
})
8.770303903540402
Grading: 0.5 points per test (6 tests total) and 3 tests for code quality. So, 6 points total.
Locations:
assignment-problems/euler_estimator.py
assignment-problems/test_euler_estimator.py
Create a class EulerEstimator
in assignment-problems/euler_estimator.py
that uses Euler estimation to solve a differential equation. In assignment-problems/test_euler_estimator.py
, assert that the following tests pass:
>>> euler = EulerEstimator(derivative = (lambda x: x+1),
point = (1,4))
>>> euler.point
(1,4)
>>> euler.calc_derivative_at_point()
2 (because the derivative is f'(x)=x+1, so f'(1)=2)
>>> euler.step_forward(0.1)
>>> euler.point
(1.1, 4.2) (because 4 + 0.1*2 = 4.2)
>>> euler.calc_derivative_at_point()
2.1
>>> euler.step_forward(-0.5)
>>> euler.point
(0.6, 3.15) (because 4.2 + -0.5*2.1 = 3.15)
>>> euler.go_to_input(3, step_size = 0.5)
note: you should move to the x-coordinate of 3
using a step_size of 0.5, until the final step,
in which you need to reduce the step_size to hit 3
the following is provided to help you debug:
point, derivative, deltas
(0.6, 3.15), 1.6, (0.5, 0.8)
(1.1, 3.95), 2.1, (0.5, 1.05)
(1.6, 5), 2.6, (0.5, 1.3)
(2.1, 6.3), 3.1, (0.5, 1.55)
(2.6, 7.85), 3.6, (0.4, 1.44)
>>> euler.point
(3, 9.29)
Location: assignment-problems/poisson_distribution.txt
The Poisson distribution can be used to model how many times an event will occur within some continuous interval of time, given that occurrences of an event are independent of one another.
Its probability function is given by \begin{align*} p_\lambda(n) = \dfrac{\lambda^n e^{-\lambda}}{n!}, \quad n \in \left\{ 0, 1, 2, \ldots \right\}, \end{align*}
where $\lambda$ is the mean number of events that occur in the particular time interval.
SUPER IMPORTANT: Manipulating the Poisson distribution involves using infinite sums. However, these sums can be easily expressed using the Mclaurin series for $e^x\mathbin{:}$
\begin{align*} e^x = \sum_{n=0}^\infty \dfrac{x^n}{n!} \end{align*}PART A (1 point per correct answer with supporting work)
Consider the Poisson distribution defined by $$p_2(n) = \dfrac{2^n e^{-2}}{n!}.$$ Show that this is a valid probability distribution, i.e. all the probability sums to $1.$
Given that $N \sim p_2,$ compute $P(10 < N \leq 12).$ Write your answer in closed form, as simplified as possible (but don't expand out factorials). Pay close attention to the "less than" vs "less than or equal to" symbols.
Given that $N \sim p_2,$ compute $E[N].$ Using the Mclaurin series for $e^x,$ your answer should come out to a nice clean integer.
Given that $N \sim p_2,$ compute $\text{Var}[N].$ Using the Mclaurin series for $e^x,$ your answer should come out to a nice clean integer.
PART B (1 point per correct answer with supporting work)
Consider the general Poisson distribution defined by $$p_\lambda(n) = \dfrac{\lambda^n e^{-\lambda}}{n!}.$$ Show that this is a valid probability distribution, i.e. all the probability sums to $1.$
Given that $N \sim p_\lambda,$ compute $P(10 < N \leq 12).$ Write your answer in closed form, as simplified as possible (but don't expand out factorials).
Given that $N \sim p_\lambda,$ compute $E[N].$ Using the Mclaurin series for $e^x,$ your answer should come out really simple.
Given that $N \sim p_\lambda,$ compute $\text{Var}[N].$ Using the Mclaurin series for $e^x,$ your answer should come out really simple.
This will be a lighter assignment in case you need to catch up with the DataFrame/LinearRegressor class or the Game.
Locations: assignment-problems/assignment_41_stats.txt
AND assignment-problems/assignment_41_stats.py
Grading: 1 point per part (6 points total)
In the previous assignment, you calculated the distribution of the bias of a coin by collecting some data, computing the likelihood function, and then "normalized" the likelihood function to turn it into a probability distribution.
This process is called Bayesian inference, and the probability distribution that you created using the likelihood function is called the Bayesian posterior distribution.
Later, we will learn more about the "Bayesian" in Bayesian inference, but for now, I want you to become familiar with the process. Now, you will perform another round of Bayesian inference, but this time on a different distribution.
Your friend is randomly stating positive integers that are less than some upper bound (which your friend knows, but you don't know). The numbers your friend states are as follows:
1, 17, 8, 25, 3
You assume that the numbers come from a discrete uniform distribution $U\left\{1,2,\ldots,k\right\}$ defined as follows:
$$p_k(x) = \begin{cases} \dfrac{1}{k} & x \in \left\{1,2,\ldots,k\right\} \\ 0 & x \not\in \left\{1,2,\ldots,k\right\} \end{cases}$$SHOW YOUR WORK FOR ALL PARTS OF THE QUESTION!
Compute the likelihood function $\mathcal{L}(k | \left\{ 1, 17, 8, 25, 3 \right\} ).$ Remember that the likelihood is just the probability of getting the result $ \left\{ 1, 17, 8, 25, 3 \right\}$ under the assumption that the data was sampled from the distribution $p_k(x).$ Your answer should be a piecewise function expressed in terms of $k.$
"Normalize" the likelihood function by multiplying it by some constant $c$ such that sums to $1.$ That is, find the constant $c$ such that
$$\sum_{k=1}^\infty c \cdot \mathcal{L}(k | \left\{ 1, 17, 8, 25, 3 \right\}) = 1.$$
This way, we can treat $c \cdot \mathcal{L}$ as a probability distribution for $k:$
$$p(k) = c \cdot \mathcal{L}(k | \left\{ 1, 17, 8, 25, 3 \right\})$$
SUPER IMPORTANT: You won't be able to figure this out analytically (i.e. just using pen and paper). Instead, you should write a Python script in assignment-problems/assignment_41_stats.py
to approximate the sum by evaluating it for a very large number of terms. You should use as many terms as you need until the result appears to converge.
What is the most probable value of $k?$ You can tell this just by looking at the distribution $p(k),$ but make sure to justify your answer with an explanation.
The largest number in the dataset is $25.$ What is the probability that $25$ is actually the upper bound chosen by your friend?
What is the probability that the upper bound is less than or equal to $30?$
Fill in the blank: you can be $95\%$ sure that the upper bound is less than $\_\_\_.$
SUPER IMPORTANT: You won't be able to figure this out analytically (i.e. just using pen and paper). Instead, you should write another Python function in assignment-problems/assignment_41_stats.py
to approximate value of $k$ needed (i.e. the number of terms needed) to have $p(K \leq k) = 0.95.$
SHOW YOUR WORK FOR ALL PARTS OF THE QUESTION!
Grading: 4 points for correct predictions and 4 points for code quality
Locations:
machine-learning/src/linear_regressor.py
machine-learning/tests/test_linear_regressor.py
machine-learning/src/logistic_regressor.py
machine-learning/tests/test_logistic_regressor.py
If you have existing linear and polynomial regressor files and tests, rename them as follows:
machine-learning/src/deprecated_linear_regressor.py
machine-learning/src/deprecated_polynomial_regressor.py
machine-learning/tests/test_deprecated_linear_regressor.py
machine-learning/tests/test_deprecated_polynomial_regressor.py
Now, create a new class LinearRegressor
in machine-learning/src/linear_regressor.py
and assert
that it passes the following tests. Put your tests in tests/test_linear_regressor.py
.
Note: You can use parts of your deprecated LinearRegressor
for inspiration, but I don't want to see any old dead code in this new class.
>>> data_dict = {
'beef': [0, 0, 0, 0, 5, 5, 5, 5, 0, 0, 0, 0, 5, 5, 5, 5],
'pb': [0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5],
'condiments': [[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly']],
'rating': [1, 1, 4, 0, 4, 8, 1, 0, 5, 0, 9, 0, 0, 0, 0, 0]
}
>>> df = DataFrame(data_dict, column_order = ['beef', 'pb', 'condiments'])
>>> df.columns
['beef', 'pb', 'condiments']
>>> df = df.create_all_dummy_variables()
>>> df = df.append_pairwise_interactions()
>>> df = df.append_columns({
'constant': [1 for _ in range(len(data_dict['rating']))],
'rating': data_dict['rating']
})
>>> df.columns
['beef', 'pb', 'mayo', 'jelly',
'beef_pb', 'beef_mayo', 'beef_jelly', 'pb_mayo', 'pb_jelly', 'mayo_jelly'
'constant', 'rating']
>>> linear_regressor = LinearRegressor(df, prediction_column = 'rating')
>>> linear_regressor.coefficients
{
'beef': 0.25,
'pb': 0.4,
'mayo': -1.25,
'jelly': 1.5,
'beef_pb': -0.21,
'beef_mayo': 1.05,
'beef_jelly': -0.85,
'pb_mayo': -0.65,
'pb_jelly': 0.65,
'mayo_jelly': -3.25,
'constant': 2.1875
}
>>> linear_regressor.gather_all_inputs({
'beef': 5,
'pb': 5,
'mayo': 1,
'jelly': 1,
})
{
'beef': 5,
'pb': 5,
'mayo': 1,
'jelly': 1,
'beef_pb': 25,
'beef_mayo': 5,
'beef_jelly': 5,
'pb_mayo': 5,
'pb_jelly': 5,
'mayo_jelly': 1,
'constant': 1
}
^ Note: this should include all interactions terms AND any constant term
that also appear in the base dataframe
>>> linear_regressor.predict({
'beef': 5,
'pb': 5,
'mayo': 1,
'jelly': 1,
})
-1.8125
>>> linear_regressor.predict({
'beef': 0,
'pb': 3,
'mayo': 0,
'jelly': 1,
})
6.8375
>>> linear_regressor.predict({
'beef': 1,
'pb': 1,
'mayo': 1,
'jelly': 0,
})
1.7775
>>> linear_regressor.predict({
'beef': 6,
'pb': 0,
'mayo': 1,
'jelly': 0,
})
8.7375
Locations: assignment-problems/assignment_41_scenarios.txt
Grading: 1 point per part (2 points total)
a. Give a real-world scenario in which something could go drastically wrong if you mistakenly use a logistic regression when the data is actually linear.
b. Give a real-world scenario in which something could go drastically wrong if you mistakenly use a linear regression when the data is actually sigmoid-shaped.
2 points
Get your CombatPlayer
tests working. If they're working already, then congrats, you just got 2 points for free.
I'm making this worth 2 points so that there's an incentive to get this done as soon as possible (rather than just waiting to resubmit the previous assignment at some later date).
Once you get your tests working, remember to resubmit any previous assignments to get that 60% partial credit.
Location: assignment-problems/assignment_40_statistics.txt
Grading: 1 point for a correct answer to each part with all work shown.
Suppose you toss a coin $10$ times and get the result $\text{HHHHT HHHHH}.$ From this result, you estimate that the coin is biased and generally lands on heads $90\%$ of the time. But how sure can you be? Let's quantify it.
SHOW YOUR WORK FOR ALL PARTS OF THE QUESTION!
Compute the likelihood function $\mathcal{L}(k | \text{HHHHT HHHHH})$ where $P(H)=k.$ Remember that the likelihood is just the probability of getting the result $\text{HHHHT HHHHH}$ under the assumption that $P(H)=k.$ Your answer should be expressed in terms of $k.$
"Normalize" the likelihood function by multiplying it by some constant $c$ such that it integrates to $1.$ That is, find the constant $c$ such that $$\int_0^1 c \cdot \mathcal{L}(k | \text{HHHHT HHHHH}) \, \textrm{d}k = 1.$$ This way, we can treat $c \cdot \mathcal{L}$ as a probability distribution for $k:$ $$p(k) = c \cdot \mathcal{L}(k | \text{HHHHT HHHHH})$$
What is the most probable value of $k?$ In other words, what is value of $k$ at which $p(k)$ reaches a maximum? USE THE FIRST OR SECOND DERIVATIVE TEST.
What is the probability that the coin is biased at all? In other words, what is $P(k > 0.5)?$
What is the probability that the coin is biased with $0.85 < P(H) < 0.95?$
Fill in the blank: you can be $95\%$ sure that $P(H)$ is at least $\_\_\_.$
WAIT! DID YOU SHOW YOUR WORK FOR ALL PARTS OF THE QUESTION?
Grading: 1 point for each working test (4 tests total), 4 points for code quality
In your DataFrame
, modify your create_dummy_variables()
method so that it also works when a column's entries are themselves lists of categorical entries.
Assert that your DataFrame
passes the following tests. Make sure to put your tests in machine-learning/tests/test_dataframe.py
.
>>> data_dict = {
'beef': [0, 0, 0, 0, 5, 5, 5, 5, 0, 0, 0, 0, 5, 5, 5, 5],
'pb': [0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5],
'condiments': [[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly'],
[],['mayo'],['jelly'],['mayo','jelly']],
'rating': [1, 1, 4, 0, 4, 8, 1, 0, 5, 0, 9, 0, 0, 0, 0, 0]
}
>>> df = DataFrame(data_dict, column_order = ['beef', 'pb', 'condiments'])
>>> df.columns
['beef', 'pb', 'condiments']
>>> df.to_array()
[[ 0, 0, []],
[ 0, 0, ['mayo']],
[ 0, 0, ['jelly']],
[ 0, 0, ['mayo','jelly']],
[ 5, 0, []],
[ 5, 0, ['mayo']],
[ 5, 0, ['jelly']],
[ 5, 0, ['mayo','jelly']],
[ 0, 5, []],
[ 0, 5, ['mayo']],
[ 0, 5, ['jelly']],
[ 0, 5, ['mayo','jelly']],
[ 5, 5, []],
[ 5, 5, ['mayo']],
[ 5, 5, ['jelly']],
[ 5, 5, ['mayo','jelly']]]
>>> df = df.create_dummy_variables()
>>> df.columns
['beef', 'pb', 'mayo', 'jelly']
>>> df.to_array()
[[ 0, 0, 0, 0],
[ 0, 0, 1, 0],
[ 0, 0, 0, 1],
[ 0, 0, 1, 1],
[ 5, 0, 0, 0],
[ 5, 0, 1, 0],
[ 5, 0, 0, 1],
[ 5, 0, 1, 1],
[ 0, 5, 0, 0],
[ 0, 5, 1, 0],
[ 0, 5, 0, 1],
[ 0, 5, 1, 1],
[ 5, 5, 0, 0],
[ 5, 5, 1, 0],
[ 5, 5, 0, 1],
[ 5, 5, 1, 1]]
>>> df = df.append_pairwise_interactions()
>>> df.columns
['beef', 'pb', 'mayo', 'jelly',
'beef_pb', 'beef_mayo', 'beef_jelly',
'pb_mayo', 'pb_jelly',
'mayo_jelly']
>>> df.to_array()
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
[ 5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 5, 0, 1, 0, 0, 5, 0, 0, 0, 0],
[ 5, 0, 0, 1, 0, 0, 5, 0, 0, 0],
[ 5, 0, 1, 1, 0, 5, 5, 0, 0, 1],
[ 0, 5, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 5, 1, 0, 0, 0, 0, 5, 0, 0],
[ 0, 5, 0, 1, 0, 0, 0, 0, 5, 0],
[ 0, 5, 1, 1, 0, 0, 0, 5, 5, 1],
[ 5, 5, 0, 0, 25, 0, 0, 0, 0, 0],
[ 5, 5, 1, 0, 25, 5, 0, 5, 0, 0],
[ 5, 5, 0, 1, 25, 0, 5, 0, 5, 0],
[ 5, 5, 1, 1, 25, 5, 5, 5, 5, 1]]
>>> df.append_columns({
'constant': [1 for _ in range(len(data_dict['rating']))],
'rating': data_dict['rating']
}, column_order = ['constant', 'rating'])
>>> df.columns
['beef', 'pb', 'mayo', 'jelly',
'beef_pb', 'beef_mayo', 'beef_jelly',
'pb_mayo', 'pb_jelly',
'mayo_jelly',
'constant', 'rating]
>>> df.to_array()
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 4],
[ 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
[ 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4],
[ 5, 0, 1, 0, 0, 5, 0, 0, 0, 0, 1, 8],
[ 5, 0, 0, 1, 0, 0, 5, 0, 0, 0, 1, 1],
[ 5, 0, 1, 1, 0, 5, 5, 0, 0, 1, 1, 0],
[ 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 1, 5],
[ 0, 5, 1, 0, 0, 0, 0, 5, 0, 0, 1, 0],
[ 0, 5, 0, 1, 0, 0, 0, 0, 5, 0, 1, 9],
[ 0, 5, 1, 1, 0, 0, 0, 5, 5, 1, 1, 0],
[ 5, 5, 0, 0, 25, 0, 0, 0, 0, 0, 1, 0],
[ 5, 5, 1, 0, 25, 5, 0, 5, 0, 0, 1, 0],
[ 5, 5, 0, 1, 25, 0, 5, 0, 5, 0, 1, 0],
[ 5, 5, 1, 1, 25, 5, 5, 5, 5, 1, 1, 0]]
2 points
Get your CombatPlayer
tests working. If they're working already, then congrats, you just got 2 points for free.
I'm making this worth 2 points so that there's an incentive to get this done as soon as possible (rather than just waiting to resubmit the previous assignment at some later date).
Once you get your tests working, remember to resubmit any previous assignments to get that 60% partial credit.
Grading: 12 points total = 6 points for code quality, 1 point for passing each test (there are 6 tests).
Update your DataFrame
to implement the following functionality.
Put all of your DataFrame
tests as assertions in a file machine-learning/tests/test_dataframe.py
.
>>> data_dict = {
'id': [1, 2, 3, 4],
'color': ['blue', 'yellow', 'green', 'yellow']
}
>>> df1 = DataFrame(data_dict, column_order = ['id', 'color'])
>>> df2 = df1.create_dummy_variables()
>>> df2.columns
['id', 'color_blue', 'color_yellow', 'color_green']
>>> df2.to_array()
[[1, 1, 0, 0]
[2, 0, 1, 0]
[3, 0, 0, 1]
[4, 0, 1, 0]]
>>> df3 = df2.remove_columns(['id', 'color_yellow'])
>>> df3.columns
['color_blue', 'color_green']
>>> df3.to_array()
[[1, 0]
[0, 0]
[0, 1]
[0, 0]]
>>> df4 = df3.append_columns({
'name': ['Anna', 'Bill', 'Cayden', 'Daphnie'],
'letter': ['a', 'b', 'c', 'd']
})
>>> df4.columns
['color_blue', 'color_green', 'name', 'letter']
>>> df4.to_array()
[[1, 0, 'Anna', 'a']
[0, 0, 'Bill', 'b']
[0, 1, 'Cayden', 'c']
[0, 0, 'Daphnie', 'd']]
Grading: 12 points total = 6 points for code quality, 0.5 points for passing each test (there are 12 tests).
In graph/src/directed_graph.py
, create a class DirectedGraph
that implements a directed graph. (Yes, you have to name it with the full name.)
In a directed graph, nodes have parents and children instead of just "neighbors".
For example, a Tree
is a special case of an DirectedGraph
.
Important: do NOT try to inherit from Graph
.
Implement the following tests as assertions in graph/tests/test_directed_graph.py
.
>>> edges = [(0,1),(1,2),(3,1),(4,3),(1,4),(4,5),(3,6)]
Note: the edges are in the form (parent,child)
>>> directed_graph = DirectedGraph(edges)
Note: if no vertices are passed, just assume that vertices = node index list, i.e.
vertices = [0, 1, 2, 3, 4, 5, 6]
at this point, the directed graph looks like this:
0-->1-->2
^ \
| v
6<--3<--4-->5
>>> [[child.index for child in node.children] for node in directed_graph.nodes]
[[1], [2,4], [], [1,6], [3,5], [], []]
>>> [[parent.index for parent in node.parents] for node in directed_graph.nodes]
[[], [0,3], [1], [4], [1], [4], [3]]
>>> directed_graph.calc_distance(0,3)
3
>>> directed_graph.calc_distance(3,5)
3
>>> directed_graph.calc_distance(0,5)
3
>>> directed_graph.calc_distance(4,1)
2
>>> directed_graph.calc_distance(2,4)
False
>>> directed_graph.calc_shortest_path(0,3)
[0, 1, 4, 3]
>>> directed_graph.calc_shortest_path(3,5)
[3, 1, 4, 5]
>>> directed_graph.calc_shortest_path(0,5)
[0, 1, 4, 5]
>>> directed_graph.calc_shortest_path(4,1)
[4, 3, 1]
>>> directed_graph.calc_shortest_path(2,4)
False
Location: Put your answers in assignment-problems/assignment_39_probability.txt
SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK
Part A (For 1-4 you get 1 point per correct answer with supporting work)
Suppose that you take a bus to work every day. Bus A arrives at 8am but is x minutes late with $x \sim U(0,15).$ Bus B arrives at 8:05 but with $x \sim U(0,10).$ The bus ride is 20 minutes and you need to arrive at work by 8:30.
Tip: Remember that $U(a,b)$ means the uniform distribution on $[a,b].$
If you take bus A, what time do you expect to arrive at work?
If you take bus B, what time do you expect to arrive at work?
If you take bus A, what is the probability that you will arrive on time to work?
If you take bus B, what is the probability that you will arrive on time to work?
SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK
Part B (For 1-4 you get 1 point per correct answer with supporting work; for 5 you get 2 points.)
Continuing the scenario above, there is a third option that you can use to get to work: you can jump into a wormhole and (usually) come out almost instantly at the other side. The only issue is that time runs differently inside the wormhole, and while you're probably going to arrive at the other end very quickly, there's a small chance that you could get stuck in there for a really long time.
The number of seconds it takes you to come out the other end of the wormhole follows an exponential distribution with parameter $\lambda=10,$ meaning that $x \sim p(x)$ where $$p(x) = \begin{cases} 10e^{-10x} & \text{ if } x > 0 \\ 0 &\text{ otherwise.} \end{cases}$$
How long do you expect it to take you to come out of the wormhole?
What's the probability of taking longer than a minute to come out of the wormhole?
Fill in the blank: the probability of coming out of the wormhole within ___ seconds is $99.999\%.$
What is the probability that it would take you longer than a day to come out of the wormhole? Give the exact answer, not a decimal approximation.
Your friend says that you shouldn't use the wormhole because there's always a chance that you might get stuck in it for over a week, and if you use the wormhole twice per day, then that'll probably happen sometime within your lifetime. Is this a reasonable fear? Why or why not? Justify your answer using probability.
SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK - SHOW YOUR WORK
4 points
Get your CombatPlayer
tests working from the previous assignment. If they're working already, then congrats, you just got 4 points for free.
I'm making this worth 4 points so that there's an incentive to get this done as soon as possible (rather than just waiting to resubmit the previous assignment at some later date).
Once you get your tests working, remember to resubmit the previous assignment to get that 60% partial credit.
Grading: 2.5 points for code quality, 2.5 points for passing tests (0.5 points each).
Rename your method distance(i,j)
to calc_distance(i,j)
.
Implement a method calc_shortest_path(i,j)
in your Graph
class that uses breadth-first search to compute a shortest path from the node at index i
to the node at index j
.
You can use an approach similar to a breadth-first traversal starting from the node at index i
. The only additional thing you need to do is have a list previous
which will store the previous node index for every node visited. Then, once the breadth-first search has completed, you can use previous
to generate the shortest path.
In graph/tests/test_graph.py
, assert that your method passes the following tests:
>>> edges = [(0,1),(1,2),(1,3),(3,4),(1,4),(4,5)]
>>> graph = Graph(edges)
at this point, the graph looks like this:
0 -- 1 -- 2
| \
3--4 -- 5
>>> graph.calc_shortest_path(0,4)
[0, 1, 4]
>>> graph.calc_shortest_path(5,2)
[5, 4, 1, 2]
>>> graph.calc_shortest_path(0,5)
[0, 1, 4, 5]
>>> graph.calc_shortest_path(4,1)
[4, 1]
>>> graph.calc_shortest_path(3,3)
[3]
Grading: 2 points for code quality, 2 points for passing tests.
Update your DataFrame
to implement the following functionality.
Put all of your DataFrame
tests as assertions in a file machine-learning/tests/test_dataframe.py
, including the tests you wrote on the previous assignment.
>>> data_dict = {
'Pete': [1, 0, 1, 0],
'John': [2, 1, 0, 2],
'Sarah': [3, 1, 4, 0]
}
>>> df1 = DataFrame(data_dict, column_order = ['Pete', 'John', 'Sarah'])
>>> df2 = df1.append_pairwise_interactions()
>>> df2.columns
['Pete', 'John', 'Sarah', 'Pete_John', 'Pete_Sarah', 'John_Sarah']
>>> df2.to_array()
[[1, 2, 3, 2, 3, 6]
[0, 1, 1, 0, 0, 1]
[1, 0, 4, 0, 4, 0]
[0, 2, 0, 0, 0, 0]]
Location: assignment-problems/uniform_distribution.txt
Grading: 1 point per correct answer with supporting work
This problem deals with integrals, so you need to show your work by hand (i.e. not using code). You can just type out words in place of symbols, as usual.
A uniform distribution on the interval $[a,b]$ is a probability distribution $p(x)$ that takes the following form for some constant $k\mathbin{:}$
$$p(x) = \begin{cases} k & x \in [a,b] \\ 0 & x \not\in [a,b] \end{cases}$$a) Find the value of $k$ such that $p(x)$ is a valid probability distribution. Your answer should be in terms of $a$ and $b.$
b) Given that $X \sim p,$ compute the cumulative distribution $P(X \leq x).$ Your answer should be a piecewise function:
$$P(X \leq x) = \begin{cases} \_\_\_ &\text{ if } x < a \\ \_\_\_ &\text{ if } a \leq x \leq b \\ \_\_\_ &\text{ if } b < x \end{cases}$$c) Given that $X \sim p_2,$ compute $E[X].$
d) Given that $X \sim p_2,$ compute $\text{Var}[X].$
Location: machine-learning/analysis/regression_by_hand.txt
Grading: each part is worth 5 points.
Suppose you are given the following dataset:
data = [(1,0.2), (2,0.25), (3,0.5)]
a) Fit a linear regression model $y=a+bx$ by hand. Show all of your steps in a text file. No code allowed!
b) Fit a logistic regression model $y=\dfrac{1}{1+e^{a+bx} }$ by hand. Show all of your steps in a text file. No code allowed!
Grading: 5 points for code quality, 5 points for passing tests.
Update your DataFrame
to implement the following functionality.
Put all of your DataFrame
tests as assertions in a file machine-learning/tests/test_dataframe.py
, including the tests you wrote on the previous assignment.
>>> data_dict = {
'Pete': [1, 0, 1, 0],
'John': [2, 1, 0, 2],
'Sarah': [3, 1, 4, 0]
}
>>> df1 = DataFrame(data_dict, column_order = ['Pete', 'John', 'Sarah'])
>>> df1.data_dict
{
'Pete': [1, 0, 1, 0],
'John': [2, 1, 0, 2],
'Sarah': [3, 1, 4, 0]
}
>>> df1.to_array()
[[1, 2, 3]
[0, 1, 1]
[1, 0, 4]
[0, 2, 0]]
>>> def multiply_by_4(x):
return 4*x
>>> df2 = df1.apply('John', multiply_by_4)
>>> df2.to_array()
[[1, 8, 3]
[0, 4, 1]
[1, 0, 4]
[0, 8, 0]]
Grading: 5 points for code quality, 5 points for passing tests.
Implement a method find_distance(i,j)
in your Graph
class that uses breadth-first search to compute the number of steps that must be taken to get from the node at index i
to the node at index j
.
You should do a breadth-first traversal starting from the node at index i
, and count how many generations of children you have to go through until you reach the node with index j
.
>>> edges = [(0,1),(1,2),(1,3),(3,4),(1,4),(4,5)]
>>> graph = Graph(edges)
at this point, the graph looks like this:
0 -- 1 -- 2
| \
3--4 -- 5
>>> graph.distance(0,4)
2
>>> graph.distance(5,2)
3
>>> graph.distance(0,5)
3
>>> graph.distance(4,1)
1
>>> graph.distance(3,3)
0
There is a correction we need to make to our combat player tests. The home base colony, aka the "home world", generates 20 CP per round. Previously, we had been using a different amount (which applies to non-home-world colonies).
I've updated the combat player tests for ascending rolls and descending rolls. Now it's your job to update your combat player tests and make sure they pass. You just need to set your home colony to generate 20 CP per round, and then update your tests with those that I've given below. Provided your tests were working before, this should be a pretty quick problem.
You get 2 points for passing each test. So there are 2 x 6 = 12 points total. Let's get this out of the way, finally, so that we can build our first complete strategy player over the weekend!
Extra credit may be possible on this problem. Hopefully, these tests are completely correct. But as usual with Space Empires tests, there's a chance that I may have made a mistake somewhere. If you find there is an error that leads to a mistake in a test, you can get "bounty point" for bringing it to my attention and explaining where the underlying error is. You can earn up to a maximum of 6 bounty points (1 per test).
Put both the ascending rolls test and the descending rolls test in space-empires/tests/test_combat_player.py
.
Ascending Rolls
At the end of Turn 1 economic phase, Player 1 has 7 CP and Player 2 has 1 CP
At the end of Turn 2 movement phase, Player 1 has 3 scouts at (2,2), while Player 2 has 1 destroyer at (2,2).
At the end of Turn 2 combat phase, Player 1 has 1 scout at (2,2), while Player 2 has no ships (other than its colony/shipyards at its home planet).
Descending Rolls
At the end of Turn 1 economic phase, Player 1 has 1 CP and Player 2 has 7 CP
At the end of Turn 2 movement phase, Player 1 has 1 destroyer at (2,2), while Player 2 has 3 scouts at (2,2).
At the end of Turn 2 combat phase, Player 2 has 3 scouts at (2,2), while Player 1 has no ships (other than its colony/shipyards at its home planet).
Ascending Rolls
STARTING CONDITIONS
Players 1 and 2
are CombatTestingPlayers
start with 0 CPs
have an initial fleet of 3 scouts, 3 colony ships, 4 ship yards
---
TURN 1 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: (2,0) --> (2,2)
Colony Ships 4,5,6: (2,0) --> (2,2)
Player 2:
Scouts 1,2,3: (2,4) --> (2,2)
Colony Ships 4,5,6: (2,4) --> (2,2)
COMBAT PHASE
Colony Ships are removed
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 1
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 1
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 2
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 2
Largest Roll to Hit: 3
Die Roll: 2
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 3 | 1 |
Attack 3
Attacker: Player 1 Scout 3
Defender: Player 2 Scout 3
Largest Roll to Hit: 3
Dice Roll: 3
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
Combat phase complete
------------------------
TURN 1 - ECONOMIC PHASE
Player 1
INCOME/MAINTENANCE (starting CP: 0)
colony income: +20 CP from home colony
maintenance costs: -1 CP/Scout x 3 Scouts = -3 CP
PURCHASES (starting CP: 17)
ship size technology 2: -10 CP
REMAINING CP: 7
Player 2
INCOME/MAINTENANCE (starting CP: 0)
colony income: +20 CP from home colony
maintenance costs: 0
PURCHASES (starting CP: 20)
ship size technology 2: -10 CP
destroyer: -9 CP
REMAINING CP: 1
------------------------
TURN 2 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: stay at (2,2)
Player 2:
Destroyer 1: (2,4) --> (2,2)
------------------------
TURN 2 - COMBAT PHASE
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 2 | Destroyer 1 | 1 |
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
Attack 1
Attacker: Player 2 Destroyer 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 4
Dice Roll: 4
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 2 | Destroyer 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
Attack 2
Attacker: Player 1 Scout 2
Defender: Player 2 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 5
Hit or Miss: Miss
Attack 3
Attacker: Player 1 Scout 3
Defender: Player 2 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 6
Hit or Miss: Miss
Attack 4
Attacker: Player 2 Destroyer 1
Defender: Player 1 Scout 2
Largest Roll to Hit: 4
Dice Roll: 1
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 2 | Destroyer 1 | 1 |
| 1 | Scout 3 | 1 |
Attack 5
Attacker: Player 1 Scout 3
Defender: Player 2 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 2
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 3 | 1 |
combat phase completed
Descending Rolls
STARTING CONDITIONS
Players 1 and 2
are CombatTestingPlayers
start with 0 CPs
have an initial fleet of 3 scouts, 3 colony ships, 4 ship yards
---
TURN 1 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: (2,0) --> (2,2)
Colony Ships 4,5,6: (2,0) --> (2,2)
Player 2:
Scouts 1,2,3: (2,4) --> (2,2)
Colony Ships 4,5,6: (2,4) --> (2,2)
COMBAT PHASE
Colony Ships are removed
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 1
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 6
Hit or Miss: Miss
Attack 2
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 5
Hit or Miss: Miss
Attack 3
Attacker: Player 1 Scout 3
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 4
Hit or Miss: Miss
Attack 4
Attacker: Player 2 Scout 1
Defender: Player 1 Scout 1
Largest Roll to Hit: 3
Die Roll: 3
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 5
Attacker: Player 2 Scout 2
Defender: Player 1 Scout 2
Largest Roll to Hit: 3
Die Roll: 2
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 5
Attacker: Player 2 Scout 3
Defender: Player 1 Scout 3
Largest Roll to Hit: 3
Die Roll: 1
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Combat phase complete
------------------------
TURN 1 - ECONOMIC PHASE
Player 1
INCOME/MAINTENANCE (starting CP: 0)
colony income: +20 CP from home colony
maintenance costs: 0
PURCHASES (starting CP: 20)
ship size technology 2: -10 CP
destroyer: -9 CP
REMAINING CP: 1
Player 2
INCOME/MAINTENANCE (starting CP: 0)
colony income: +20 CP from home colony
maintenance costs: -1 CP/Scout x 3 Scouts = -3 CP
PURCHASES (starting CP: 17)
ship size technology 2: -10 CP
REMAINING CP: 7
------------------------
TURN 2 - MOVEMENT PHASE
Player 1:
Destroyer 1 : (2,0) --> (2,2)
Player 2:
Scouts 1,2,3: stay at (2,2)
------------------------
TURN 2 - COMBAT PHASE
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Destroyer 1 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 1
Attacker: Player 1 Destroyer 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 4
Dice Roll: 6
Hit or Miss: Miss
Attack 2
Attacker: Player 2 Scout 1
Defender: Player 1 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 5
Hit or Miss: Miss
Attack 3
Attacker: Player 2 Scout 2
Defender: Player 1 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 4
Hit or Miss: Miss
Attack 4
Attacker: Player 2 Scout 3
Defender: Player 1 Destroyer 1
Largest Roll to Hit: 3
Dice Roll: 3
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
combat phase completed
Location: assignment-problems/exponential_distribution.txt
Continuous distributions are defined similarly to discrete distributions. There are only 2 big differences:
We use an integral to compute expectation: if $X \sim p,$ then $$E[X] = \int_{-\infty}^\infty x \, p(x) \, \mathrm{d}x.$$
We talk about probability on an interval rather than at a point: if $X \sim p,$ then $$P(a < X \leq b) = \int_a^b p(x) \, \mathrm{d}x$$
Note the following definitions:
This problem deals with integrals, so you need to show your work by hand (i.e. not using code). You can just type out words in place of symbols, e.g.
integral of e^(-5x) from x=0 to x=1
= -1/5 e^(-5x) from x=0 to x=1
= -1/5 e^(-5) - (-1/5)
= [ 1 - e^(-5) ] / 5
PART A (1 point per correct answer with supporting work)
Consider the exponential distribution defined by $$p_2(x) = \begin{cases} 2 e^{-2 x} & x \geq 0 \\ 0 & x < 0 \end{cases}.$$ Using integration, show that this is a valid distribution, i.e. all the probability integrates to $1.$
Given that $X \sim p_2,$ compute $P(0 < X \leq 1).$
Given that $X \sim p_2,$ compute the cumulative distribution function $P(X \leq x).$
Given that $X \sim p_2,$ compute $E[X].$
Given that $X \sim p_2,$ compute $\text{Var}[X].$
PART B (1 point per correct answer with supporting work)
Consider the general exponential distribution defined by $p_\lambda(x) = \lambda e^{-\lambda x}.$ Using integration, show that this is a valid distribution, i.e. all the probability integrates to $1.$
Given that $X \sim p_\lambda,$ compute $P(0 < X < 1).$ Your answer should be general in terms of $\lambda,$ and it should match your answer from Part A if you set $\lambda = 2.$
Given that $X \sim p_\lambda,$ compute the cumulative distribution function $P(X < x).$ Your answer should be general in terms of $\lambda$ and $x,$ and it should match your answer from Part A if you set $\lambda = 2.$
Given that $X \sim p_\lambda,$ compute $E[X].$ Your answer should be general in terms of $\lambda,$ and it should match your answer from Part A if you set $\lambda = 2.$
Given that $X \sim p_\lambda,$ compute $\text{Var}[X].$ Your answer should be general in terms of $\lambda,$ and it should match your answer from Part A if you set $\lambda = 2.$
Location: machine-learning/src/dataframe.py
Grading: 5 points for code quality, 5 points for passing tests.
Create a class DataFrame
that implements the following tests:
>>> data_dict = {
'Pete': [1, 0, 1, 0],
'John': [2, 1, 0, 2],
'Sarah': [3, 1, 4, 0]
}
>>> df1 = DataFrame(data_dict, column_order = ['Pete', 'John', 'Sarah'])
>>> df1.data_dict
{
'Pete': [1, 0, 1, 0],
'John': [2, 1, 0, 2],
'Sarah': [3, 1, 4, 0]
}
>>> df1.to_array()
[[1, 2, 3]
[0, 1, 1]
[1, 0, 4]
[0, 2, 0]]
>>> df1.columns
['Pete', 'John', 'Sarah']
>>> df2 = df1.filter_columns(['Sarah', 'Pete'])
>>> df2.to_array()
[[3, 1],
[1, 0],
[4, 1],
[0, 0]]
>>> df2.columns
['Sarah', 'Pete']
Location: assignment-problems/assignment_35_statistics.py
To say that a random variable $X$ follows a probability distribution $p(x),$ is to say that $P(X=x) = p(x).$ Symbolically we write $X \sim p.$
The expected value (also known as the mean) of a random variable $X \sim p$ is defined as the weighted sum of possible values, where the weights are given by the probability.
In other words, $E[X] = \sum x p(x).$
It is common to denote $E[X]$ by $\bar{X}$ or $\left< X \right>.$
The variance of a random variable is the expected squared deviation from the mean.
The standard deviation of a random variable is the square root of the variance.
Warning: No points will be given if you don't show your work.
Part A (1 point per correct answer with supporting work)
Write the probability distribution $p_{4}(x)$ for getting $x$ heads on $4$ coin flips, where the coin is a fair coin (i.e. it lands on heads with probability $0.5$).
Let $X$ be the number of heads in $4$ coin flips. Then $X \sim p_{4}.$ Intuitively, what is the expected value of $X$? Explain the reasoning behind your intuition.
Compute the expected value of $X,$ using the definition $E[Y] = \sum x p(x).$ The answer you get should match your answer from (b).
Compute the variance of $X,$ using the definition $\text{Var}[X] = E[(X-\bar{X})^2].$
Compute the standard deviation of $X,$ using the definition $\text{Std}[X] = \sqrt{ \text{Var}[X] }.$
Part B (1 point per correct answer with supporting work)
Write the probability distribution $p_{4,\lambda}(y)$ for getting $y$ heads on $4$ coin flips, where the coin is a biased coin that lands on heads with probability $\lambda.$
Let $Y$ be the number of heads in $4$ coin flips. Then $Y \sim p_{4,\lambda}.$ Intuitively, what is the expected value of $Y$? Your answer should be in terms of $\lambda.$ Explain the reasoning behind your intuition.
Compute the expected value of $Y,$ using the definition $E[Y] = \sum x p(x).$ The answer you get should match your answer from (b).
Compute the variance of $X,$ using the definition $\text{Var}[Y] = E[(Y-\bar{Y})^2].$
Compute the standard deviation of $X,$ using the definition $\text{Std}[Y] = \sqrt{ \text{Var}[Y] }.$
Location: machine-learning/analysis/assignment_35.py
Recall the following sandwich dataset:
Slices Beef | Tbsp Peanut Butter | Condiments | Rating |
--------------------------------------------------------------
0 | 0 | - | 1 |
0 | 0 | mayo | 1 |
0 | 0 | jelly | 4 |
0 | 0 | mayo, jelly | 0 |
5 | 0 | - | 4 |
5 | 0 | mayo | 8 |
5 | 0 | jelly | 1 |
5 | 0 | mayo, jelly | 0 |
0 | 5 | - | 5 |
0 | 5 | mayo | 0 |
0 | 5 | jelly | 9 |
0 | 5 | mayo, jelly | 0 |
5 | 5 | - | 0 |
5 | 5 | mayo | 0 |
5 | 5 | jelly | 0 |
5 | 5 | mayo, jelly | 0 |
In Assignment 29-1, you transformed Condiments
into dummy variables and fit a linear model to the data.
rating = beta_0
+ beta_1 ( slices beef ) + beta_2 ( tbsp pb ) + beta_3 ( mayo ) + beta_4 ( jelly )
+ beta_5 ( slices beef ) ( tbsp pb ) + beta_6 ( slices beef ) ( mayo ) + beta_7 ( slices beef ) ( jelly )
+ beta_8 ( tbsp pb ) ( mayo ) + beta_9 ( tbsp pb ) ( jelly )
+ beta_10 ( mayo ) ( jelly )
The linear model captured the overall idea of the data, but it often gave weird predictions like negative ratings for a really bad sandwich.
YOUR TASK
This time, you will fit a logistic model to the data. This will squeeze the predicted ratings into the interval between $0$ and $10.$
Normally, we use a logistic curve of the form $y=\dfrac{1}{1+e^{ \sum \beta_i x_i}}$ to model probability. But here, we are modeling ratings on a scale from $0$ to $10$ instead of probability on a scale of $0$ to $1.$ So we just need to change the numerator accordingly:
$$y=\dfrac{10}{1+e^{\sum \beta_i x_i}}$$Note that any ratings of 0
will cause an issue when fitting the model. So, change them to 0.1
.
With your model, assert
that your predictions are within $0.25$ of the following correct predictions:
PREDICTED RATINGS
no ingredients: 2.66
mayo only: 0.59
mayo and jelly: 0.07
5 slices beef + mayo: 7.64
5 tbsp pb + jelly: 8.94
5 slices beef + 5 tbsp pb + mayo + jelly: 0.02
You get 1 point for each matching prediction, and 4 points for code quality.
Because you guys have been good about participating in #help this week... I will be nice and give you the final transformed dataset so you don't have to worry about data entry and processing. Here it is:
columns = ['beef', 'pb', 'mayo', 'jelly', 'rating',
'beef_pb', 'beef_mayo', 'beef_jelly', 'pb_mayo', 'pb_jelly',
'mayo_jelly', 'constant']
data = [[ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 1, 1, 0.1, 0, 0, 0, 0, 0, 1, 1],
[ 5, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 1],
[ 5, 0, 1, 0, 8, 0, 5, 0, 0, 0, 0, 1],
[ 5, 0, 0, 1, 1, 0, 0, 5, 0, 0, 0, 1],
[ 5, 0, 1, 1, 0.1, 0, 5, 5, 0, 0, 1, 1],
[ 0, 5, 0, 0, 5, 0, 0, 0, 0, 0, 0, 1],
[ 0, 5, 1, 0, 0.1, 0, 0, 0, 5, 0, 0, 1],
[ 0, 5, 0, 1, 9, 0, 0, 0, 0, 5, 0, 1],
[ 0, 5, 1, 1, 0.1, 0, 0, 0, 5, 5, 1, 1],
[ 5, 5, 0, 0, 0.1, 25, 0, 0, 0, 0, 0, 1],
[ 5, 5, 1, 0, 0.1, 25, 5, 0, 5, 0, 0, 1],
[ 5, 5, 0, 1, 0.1, 25, 0, 5, 0, 5, 0, 1],
[ 5, 5, 1, 1, 0.1, 25, 5, 5, 5, 5, 1, 1]]
(ML; 60 min)
Location: machine-learning/analysis/logistic-regression-space-empires.py
Suppose that you have a dataset of points $(x,y)$ where $x$ is the number of hours that a player has practiced Space Empires and $y$ is their probability of winning against an average player.
data = [(0, 0.01), (0, 0.01), (0, 0.05), (10, 0.02), (10, 0.15), (50, 0.12), (50, 0.28), (73, 0.03), (80, 0.10), (115, 0.06), (150, 0.12), (170, 0.30), (175, 0.24), (198, 0.26), (212, 0.25), (232, 0.32), (240, 0.45), (381, 0.93), (390, 0.87), (402, 0.95), (450, 0.98), (450, 0.85), (450, 0.95), (460, 0.91), (500, 0.95)]
Fit a function $y=\dfrac{1}{1+e^{\beta_0 + \beta_1 x}}$ to the following data, using matrix methods involving the pseudoinverse ($\vec{\beta} \approx (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \vec{y}$).
First, you will need to transform the data into a set of linear equations in $\beta_0$ and $\beta_1.$ Here's how to do this:
\begin{align*} y &=\dfrac{1}{1+e^{\beta_0 + \beta_1 x}} \\ 1+e^{\beta_0 + \beta_1 x} &= \dfrac{1}{y} \\ e^{\beta_0 + \beta_1 x} &= \dfrac{1}{y}-1 \\ \beta_0 + \beta_1 x &= \ln \left( \dfrac{1}{y}-1 \right) \end{align*}So,
\begin{align*} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \ln \left( \dfrac{1}{y_1}-1 \right) \\ \ln \left( \dfrac{1}{y_2}-1 \right) \\ \vdots \\ \ln \left( \dfrac{1}{y_n}-1 \right) \end{bmatrix} \end{align*}a. (2 pts) If you practice for $300$ hours, what is your expected chance of winning against an average player? Round your answer to 3 decimal places.
Check: if your answer is correct, the digits will sum to 18.
b. (2 pts) Plot the predicted $y$-values for $0 \leq x \leq 750.$ Save your plot as machine-learning/analysis/logistic_regression_space_empires.png
.
c. (2 pts) Make sure your code is clean. (Follow the coding commandments!)
(Stats; 60 min)
Note: No points will be awarded if you do not show your work.
Location: assignment-problems/coin-flip-likelihood.py
Suppose we have a coin that lands on heads with probability $p$ and tails with probability $1-p.$
We flip the coin $5$ times and get HHTTH.
a. (1 pt) Compute the likelihood of the observed outcome if the coin were fair (i.e. $p=0.5$). SHOW YOUR WORK and round your answer to 5 decimal places.
\begin{align*} \mathcal{L}(p=0.5 \, | \, \text{HHTTH}) &= P(\text{H}) \cdot P(\text{H}) \cdot P(\text{T}) \cdot P(\text{T}) \cdot P(\text{H}) \\ &= \, ? \end{align*}Check: if your answer is correct, the digits will sum to 11.
b. (1 pt) Compute the likelihood of the observed outcome if the coin were slightly biased towards heads, say $p=0.55.$ SHOW YOUR WORK and round your answer to 5 decimal places.
\begin{align*} \mathcal{L}(p=0.55 \, | \, \text{HHTTH}) &= P(\text{H}) \cdot P(\text{H}) \cdot P(\text{T}) \cdot P(\text{T}) \cdot P(\text{H}) \\ &= \, ? \end{align*}Check: if your answer is correct, the digits will sum to 21.
c) (1 pt) Compute the likelihood of the observed outcome for a general value of $p.$ Your answer should be a function of $p.$
\begin{align*} \mathcal{L}(p \, | \, \text{HHTTH}) &= P(\text{H}) \cdot P(\text{H}) \cdot P(\text{T}) \cdot P(\text{T}) \cdot P(\text{H}) \\ &= \, ? \end{align*}Check: When you plug in $p=0.5,$ you should get the answer from part (a), and when you plug in $p=0.55,$ you should get the answer from part (b).
d) (2 pts) Plot a graph of $\mathcal{L}(p \, | \, \text{HHTTH})$ for $0 \leq p \leq 1.$ Save your graph as assignment-problems/coin-flip-likelihood.png
.
(Space Empires; 45 min)
Create another game events file for 3 turns using Combat Players. This time, use DESCENDING die rolls: 6, 5, 4, 3, 2, 1, 6, 5, 4, ...
The game events file for ascending die rolls is provided below. So, you should use the same template, using DESCENDING die rolls.
Save your file as notes/combat_player_game_events_descending_rolls.txt
, and rename your other file to notes/combat_player_game_events_ascending_rolls.txt
This problem is worth 2 points for completion, and 2 points for being correct.
STARTING CONDITIONS
Players 1 and 2
are CombatTestingPlayers
start with 20 CPs
have an initial fleet of 3 scouts, 3 colony ships, 4 ship yards
---
TURN 1 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: (2,0) --> (2,2)
Colony Ships 4,5,6: (2,0) --> (2,2)
Player 2:
Scouts 1,2,3: (2,4) --> (2,2)
Colony Ships 4,5,6: (2,4) --> (2,2)
COMBAT PHASE
Colony Ships are removed
| PLAYER | SHIP | HEALTH |
----------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 1
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 1
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 2 | 1 |
| 2 | Scout 3 | 1 |
Attack 2
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 2
Largest Roll to Hit: 3
Die Roll: 2
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 3 | 1 |
Attack 3
Attacker: Player 1 Scout 3
Defender: Player 2 Scout 3
Largest Roll to Hit: 3
Dice Roll: 3
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
Combat phase complete
------------------------
TURN 1 - ECONOMIC PHASE
Player 1
INCOME/MAINTENANCE (starting CP: 20)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts = -3 CP
PURCHASES (starting CP: 20)
ship size technology 2: -10 CP
destroyer: -9 CP
REMAINING CP: 1
Player 2
INCOME/MAINTENANCE (starting CP: 20)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0
PURCHASES (starting CP: 23)
ship size technology 2: -10 CP
destroyer: -9 CP
REMAINING CP: 4
------------------------
TURN 2 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: stay at (2,2)
Destroyer 1 : (2,0) --> (2,2)
Player 2:
Destroyer 1: (2,4) --> (2,2)
------------------------
TURN 2 - COMBAT PHASE
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Destroyer 1 | 1 |
| 2 | Destroyer 1 | 1 |
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
Attack 1
Attacker: Player 1 Destroyer 1
Defender: Player 2 Destroyer 1
Largest Roll to Hit: 4
Dice Roll: 4
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Destroyer 1 | 1 |
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
------------------------
TURN 2 - ECONOMIC PHASE
Player 1
INCOME/MAINTENANCE (starting CP: 1)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts -1 CP/Destroyer x 1 Destroyer = -4 CP
REMAINING CP: 0
Player 2
INCOME/MAINTENANCE (starting CP: 4)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0
PURCHASES (starting CP: 7)
scout: -6 CP
REMAINING CP: 1
------------------------
TURN 3 - MOVEMENT PHASE
Player 1:
Scouts 1,2,3: stay at (2,2)
Destroyer 1 : stay at (2,2)
Player 2:
Scout 1: (2,4) --> (2,2)
------------------------
TURN 3 - COMBAT PHASE
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Destroyer 1 | 1 |
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
| 2 | Scout 1 | 1 |
Attack 1
Attacker: Player 1 Destroyer 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 4
Dice Roll: 5
Hit or Miss: Miss
Attack 2
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Dice Roll: 6
Hit or Miss: Miss
Attack 3
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Dice Roll: 1
Hit or Miss: Hit
| PLAYER | SHIP | HEALTH |
---------------------------------------------
| 1 | Destroyer 1 | 1 |
| 1 | Scout 1 | 1 |
| 1 | Scout 2 | 1 |
| 1 | Scout 3 | 1 |
------------------------
TURN 3 - ECONOMIC PHASE
Player 1
INCOME/MAINTENANCE (starting CP: 0)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts -1 CP/Destroyer x 1 Destroyer = -4 CP
REMOVALS
remove scout 3 due to inability to pay maintenance costs
REMAINING CP: 0
Player 2
INCOME/MAINTENANCE (starting CP: 1)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0
REMAINING CP: 4