Introduction to Data Science Project

The main set of slides to reference are created by Justin Post here at NC State

Additionally, here is the bible for data science, and covers essentially all of the main Machine Learning methods.

More resources:

There are an overwhelming amount of resources. I link these only because I like them, and they provide an excellent jumping off point if you are interested.

Note that reading without actually implementing a project or summarizing what you have read is not going to be very helpful (even though it feels like it is).

https://r-graph-gallery.com/ <- The go-to place for data visualization

https://happygitwithr.com/index.html <- How does git work???

https://arrow-user2022.netlify.app/ <- How do I do stuff with large datasets?

https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet <- Just a quick reference guide on markdown

Below are some other books:

https://stat545.com/index.html <- Intro to R

https://r4ds.hadley.nz/ <- Intro to R

https://dtkaplan.github.io/DataComputingEbook/index.html#table-of-contents <- Intro to R

https://do4ds.com/chapters/intro.html <- DevOps for production code in data science

https://edwinth.github.io/ADSwR/index.html <- Agile philosophy

Project Goal:

By the end of this project, you should have a reproducable, visually appealing report.

Questions to address in your report include:

Are there any differences in the people who switch leagues?
Are there specific hitters that over/under performed relative to their salary?
Can you identify any groups of similar batters?
Are players who spend longer in the league worth more?

Please explain in a pagraph any potential ethical issues that might arise from the creation of this report.

Stretch goals:

See if you can predict whether a player will be paid higher or lower than the median salary.
Are there any questions that come to your mind that you think are worthwhile to explore? Investigate! Explore!

Skills you will acquire along the way:

* Understand the basics of manipulating data in R * Understand the basics of Statistical Inference + Gain experience grappling with randomness * Interpret multivariate regression output * Gain familiarity with common Machine Learning tools * Learn the basics of plotting within R

Start of project:

a <- c(1, 2, 3)
print(a)

## [1] 1 2 3

$$\pi + 2$$

b = [1, 2, 3]
print(b)

## [1, 2, 3]

Download the data from here:

Alternatively, inspect element to find the web address. Search for the “Baseball Data”

# data <- read.csv("path/to/your/downloaded/file.csv")
data <- read.csv(
  "https://vincentarelbundock.github.io/Rdatasets/URL"
)
summary(data)

Functions that may be helpful in your analysis:

hclust()
pca()
lm(), glm(), glmnet()
kmeans()
t.test()
wilcox.test()

And here is just an example (to modify with your file path) of the neat things that R can do:

# install.packages("tidyverse")
library(tidyverse)

# https://simplemaps.com/data/world-cities
WorldCities <- read_csv(data / worldcities.csv)


BigCities <-
  WorldCities |>
  arrange(desc(population)) |>
  head(4000) |>
  select(longitude, latitude)

clusts2 <-
  BigCities |>
  kmeans(centers = 6) |>
  fitted("classes") |>
  as.character()

BigCities <-
  BigCities %>% mutate(cluster = clusts2)

BigCities |>
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(aes(color = cluster, shape = cluster)) +
  theme(legend.position = "top")