- Course overview
- Course details
- Prerequisites
Course overview
About this course
Data science is an applied study of data for statistical analysis and problem solving. This path of courses covers the data science pipeline needed by the everyday data scientist: data wrangling, analysis, machine learning, and communication and visualization.
Audience profile
Individuals with some programming and math experience working toward implementing data science in their everyday work.
At course completion
After completing this course, students will be able to:
- Data Science Overview
- Data Gathering
- Data Filtering
- Data Transformation
- Data Exploration
- Data Integration
- Data Analysis Concepts
- Data Classification and Machine Learning
- Data Communication and Visualization
Course details
Data Science Overview
Module 1: Defining Data Science
- What is Data Science?
- What is Data Wrangling?
- What is Big Data?
- What is Machine Learning?
- Implementing Data Science
Module 2: Data Science Terminology
- Data Communication
- Data Science Pipeline
- Data Science Tools
Data Gathering
Module 3: Data Extraction
- Basic Data Gathering
- Gathering Web Data
- Extracting Spreadsheet Data with in2csv
- Extracting Spreadsheet Data with Agate
- Extracting Legacy Data from dBASE Tables
- Extracting HTML Data
Module 4: Metadata
- Gathering Metadata
- Working with HTTP Headers
- Working with Linux Log Files
- Working with Email Headers
Module 5: Remote Data
- Connecting to Remote Data
- Copying Remote Data
- Synchronizing Remote Data
Data Filtering
Module 6: Introduction to Data Filtering
- Data Filtering Techniques and Tools
- Processing Date Formats
- Filtering HTTP Headers
- Filtering CSV Data
- Replacing Values with sed
- Dropping Duplicate Data
- Working with JPEG Headers
- Filtering PDF Files
- Filtering for Invalid Data
- Parsing robots.txt
Data Transformation
Module 7: File Format Conversions
- Converting CSV to JSON
- Converting XML to JSON
- Converting CSV to SQL
- Converting SQL to CSV
- Changing CSV Delimiters
Module 8: Data Conversions
- Converting Dates
- Converting Numbers
- Rounding Numbers
Module 9: Optical Character Recognition
- OCR JPEG Images
- Extracting Text from PDF Files
Data Exploration
Module 10: Introduction to Data Exploration
- Exploring CSV Data
- Exploring CSV Statistics
- Querying CSV Data
- Plotting from the Command Line
- Counting Words
- Exploring Directory Trees
- Determining Word Frequencies
- Taking Random Samples
- Finding the Top Rows
- Finding Repeated Records
- Identifying Outliers in Data
Data Integration
Module 11: Introduction to Data Integration
- Joining CSV Data
- Concatenating Log Files
- Sorting Text Files
- Merging XML Data
- Aggregating Data
- Normalizing Data
- Denormalizing Data
- Pivoting Data Tables
- Homogenizing Rows
Data Analysis Concepts
Module 12: Data Science Math
- Basic Data Science Math
- Linear Algebra Vector Math
- Linear Algebra Matrix Math
- Linear Algebra Matrix Decomposition
Module 13: Data Analysis Concepts
- Data Formation
- Introduction to Probability
- Working with Events
- Working with Probability
- Continuous Probability Distributions
- Discrete Probability Distributions
- Introduction to Bayes Theorem
Module 14: Estimates and Measures
- Sampling Data
- Statistical Measures
- Estimators
- Sampling Distributions
- Confidence Intervals
- Hypothesis Tests
- Chi-Square
Data Classification and Machine Learning
Module 15: Machine Learning Introduction
- Introduction to Supervised Learning
- Introduction to Unsupervised Learning
- Understanding Linear Regression
- Working with Predictors
Module 16: Regression and Classification
- Understanding Logistic Regression
- Understanding Dummy Variables
- Using Naïve Bayes Classification
- Working with Decision Trees
Module 17: Clustering
- K-means Clustering
- Using Cluster Validation
- Using Principle Component Analysis
Module 18: Errors and Validation
- Introduction to Errors
- Defining Underfitting
- Defining Overfitting
- Using K-folds Cross Validation
- Using Neural Networks
- Support Vector Machines (SVM)
Data Communication and Visualization
Module 19: Introduction to Data Communication
- Effective Communication and Visualization
- Correlation Versus Causation
- Simpson’s Paradox
- Presenting Data
- Documenting Data Science
- Visual Data Exploration
Module 20: Plotting
- Creating Scatter Plots
- Plotting Line Graphs
- Creating Bar Charts
- Creating Histograms
- Creating Box Plots
- Creating Network Visualizations
- Creating a Bubble Plot
- Creating Interactive Plots
Prerequisites
No prerequisites
Enquiry
Course : Data Science Essentials
Enquiry
request for : Data Science Essentials