How to get from Point A to Point Data Scientist. The Introduction.

08 May 2015

by: Beth Stankevich

Awash in a sea of resources. It is easy to find resources to aid in becoming a Data Scientist. And yet, having so many paths to get to the same end can be paralyzing. Thus begins the account of our group’s sojourn (meandering) towards becoming Data Scientists. It is by no means the path. These posts will serve as both a list of the series of steps one might decide to take as well as a reflection on what worked and what didn’t. (Side note: We are loosely following this lovely infographic because its simplicity is refreshing)

From the outset, the endgame of this group was to collaborate on Data Science projects. All of us have datasets in mind that we want to explore. With this endpoint in mind, the decision was made to begin by taking online courses together to gain background knowledge in the Data Science field and the tools required.

#####Step 1: Python - Code Academy Our group consists of Neuroscience and Psychology graduate students (also post-graduate real people). We all fall on a continuum of comfort with coding in R and/or Matlab from decent to proficient. A couple of us had even done some coding in Python (having taken Python courses previously or having learned PsychoPy and vis-a-vis Python to code behavioral tasks). We settled on learning Python via the Code Academy online course.

Reasons we chose Python:

  • powerful
  • simple
  • many packages already developed
  • great online resources for help
  • used within the Data Science community
  • maps relatively well onto the R knowledge we all have to varying degrees

Reasons we chose Code Academy:

  • free
  • 13hrs of course work (not too much time that dissertation work will suffer and the PI will notice)

Other Python courses:

#####Step 1b: Post-Python life The Code Academy course was fairly easy. A good way to toe into Python. It also gave us a lot of practice writing functions, loops, etc. This is great, especially coming from a background in other languages, as it commits to memory the nuances of Python syntax (98% of my problems stemmed from forgetting a colon, sigh). While the course itself does not have a ton of in-course help (some “Tips”), Stackoverflow always comes to your rescue.

Code academy had an in-browser console for interacting with Python. Perfect for the course. Not perfect for using Python on our own. Nicholas DiQuattro broke it down for us:

  1. Installing Python The cool element here is the idea of “virutal environments” which you use to create a new copy of python for each project. This lets you install modules without worrying about conflicts or having to update to a new version of a module that breaks old code because of a new project.

    Sidenote: There’s an issue between Python 2.7 and 3.3. Many modules are based on 2.7, which is why it’s still around, but there are actual differences between the versions. It seems to come down to if some module you want to use only works with 2.7 or 3.3.

  2. Choosing an IDE There are a couple in the link. I ended up using IEP because its simple and has solid code completion. However, recently it’s been hard to get it to plot easily with matplotlib

  3. Modules for data science Once you set up python you can install modules with a shell command like “pip install numpy”. Here are some good ones:
    • numpy: provides arrays type so you can do matlab like stuff
    • matplotlib: plotting
    • pandas: provides a data frame type, also grouping based analysis
    • flask: simple web apps/pages
    • scipy: stats and other functions
  4. BONUS: All-purpose text editor: ATOM This is made by GitHub and has easy packages for downloading added features (like syntax highlighting for almost anything or console support). It also works really well with git files, keeping track of what you have changed since the last commit.

#####Step 1c: Coding Considerations Best summed up by the superb Randall Munroe in this comic. Almost none of us has ever taken a computer science class. Coding style matters. We all know this from our own coding in R and Matlab…especially if we’ve ever had to track down scripts written in the early years of graduate school (raises hand) only to find it a mess and poorly commented (raises hand). Python style guidelines will help [see below], but this discussion led to Step 2: taking a basic introduction to computer science class.

Style guides:

#####Step 2: Introduction to Computer Science - Udacity As stated above, we are Neuroscience and Psychology graduate students (and post-graduate real people). Nary a Computer Scientist amongst us. We are all lacking in the basic knowledge and language of the computer science realm. So, while we are confident in our analytic techniques, statistical know-how, and ability to learn and use new coding languages, we felt we all had a large blindspot with having (almost) zero computer science background.

Reasons we chose Udacity:

  • free
  • relatively short (7 modules, expected to take ~1 month of working a few hours/week)
  • taught in Python (it fit the theme of continuing to get confident in Python)

Other Introduction to Computer Science courses:

As of this post, we are taking the Computer Science class. The first few modules of the course are mostly review for us after having taken a Python course. We all look forward to working on the course project, which is to build a search engine. It seems like a practical way to learn the basics of computer science.

To say that our path is linear, would be incorrect. Data projects have already begun because engaging in a project is likely the best way to learn to utilize the skills necessary for Data Science. Our current path could be summed up as “Chutes and Ladders”.