Exploratory Data Analysis

DS01

20 Credits

Introduction

Prior to attempting this coursework assignment, learners must familiarise themselves with the following policies:

• Centre Specification

o Can be found at https://qualifi.net/qualifi-level-7-diploma-data-science/

• Qualifi Quality Assurance Standards

• Qualifi Quality Policy Statement

Plagiarism and Collusion

In submitting the assignment Learner’s must complete a statement of authenticity confirming that the work submitted for all tasks is their own. The statement should also include the word count.

Your accredited study centre will direct you to the appropriate software that checks the level of similarity. Qualifi recommends the use of https://www.turnitin.com as a part of the assessment.

Plagiarism and collusion are treated very seriously. Plagiarism involves presenting work, excerpts, ideas or passages of another author without appropriate referencing and attribution.

Collusion occurs when two or more learners submit work which is so alike in ideas, content, wording and/or structure that the similarity goes beyond what might have been mere coincidence

Please familiarise yourself on Qualifi’s Malpractice and Maladministration policy, where you can find further information

Referencing

A professional approach to work is expected from all learners. Learners must therefore identify and acknowledge ALL sources/methodologies/applications used.

The learner must use an appropriate referencing system to achieve this. Marks are not awarded for the use of English; however, the learner must express ideas clearly and ensure that appropriate terminology is used to convey accuracy in meaning.

Qualifi recommends using Harvard Style of Referencing throughout your work.

Appendices

You may include appendices to support your work, however appendices must only contain additional supporting information, and must be clearly referenced in your assignment.

You may also include tables, graphs, diagrams, Gantt chart and flowcharts that support the main report should be incorporated into the back of the assignment report that is submitted.

Any published secondary information such as annual reports and company literature, should be referenced in the main text of the assignment, in accordance of Harvard Style Referencing, and referenced at the end of the assignment.

Confidentiality

Where a Learner is using organisational information that deals with sensitive material or issues, they must seek the advice and permission from that organisation about its inclusion.

Where confidentiality is an issue, Learners are advised to anonymise their assignment report so that it cannot be attributed to that particular organisation.

Word Count Policy

Learners must comply with the required word count, within a margin of +10%. These rules exclude the index, headings, tables, images, footnotes, appendices and information contained within references and bibliographies.

When an assessment task requires learners to produce presentation slides with supporting notes, the word count applies to the supporting notes only.

Submission of Assignments

All work to be submitted on the due date as per Centre’s advice.

All work must be submitted in a single electronic document (.doc file), or via Turnitin, where applicable.

This should go to the tutor and Centre Manager/Programme Director, plus one hard copy posted to the Centre Manager (if required)

Marking and grades

Qualifi uses a standard marking rubric for all assignments, and you can find the details at the end of this document.

Unless stated elsewhere, Learners must answer all questions in this document.

Assignment Question

Task 1.1

Handle and manage multiple datasets within R and Python environments.

1.1 Work smoothly in R and Python development environments.

1.2 Import and export data sets and create data frames within R and Python in accordance with instructions.

1.3 Sort, merge, aggregate and append data sets in accordance with instructions.

Assessment Criteria

• Learning UI about basics rule of programming in both R and Python

• Create and import external datasets in R and python

• Export R data frames into external flat files

• Data Management in R and Python (Sort, merge, aggregate and subset)

Task 1.2

Use measures of central tendency to summarize data and assess both the symmetry and variation in the data.

1.1 Differentiate between variable types and measurement scales.

1.2 Calculate the most appropriate (mean, median or mode etc.) measure of central tendency based on variable type.

1.3 Compare variation in two datasets using the coefficient of variation.

1.4 Assess symmetry of data using measures of skewness

Assessment Criteria

• Introduction to basic concepts of Statistics, such as measures of central tendency, variation, skewness, kurtosis

Task 1.3

Present and summarise distributions of data and the relationships between variables graphically.

1.1 Select the most appropriate graph to present the data.

1.2 Assess distribution using Box-Plot and Histogram.

1.3 Visualize bivariate relationships using scatter-plots.

1.4 Present time-series data using motion charts.

Assessment Criteria

• Frequency tables crosstabs and bivariate correlation analysis

• Data visualization: what and why? Grammar of graphics, handling data for visualization

• Commonly used charts and graphs using ggplot2 package in R and matplotlib in python

• Advanced graphics in R and Python Data Management in R and Python (Sort, merge, aggregate and subset)

• Data Management in R and Python (Sort, merge, aggregate and subset)

Task 1.4

Evaluate standard discrete and standard continuous distributions.

1.1 Analyse the statistical distribution of a discrete random variable.

1.2 Calculate probabilities using R for Binomial and Poisson Distribution.

1.3 Fit Binomial and Poisson distributions to observed data.

1.4 Evaluate the properties of Normal and Log Normal distributions.

1.5 Calculate probabilities using R for normal and Log normal distributions.

1.6 Fit normal, Log normal and exponential distributions to observed data.

1.7 Evaluate the concept of sampling distribution (t, F and Chi Square).

Assessment Criteria

• Concept of random variables and statistical distribution

• Discrete vs. Continuous Random Variables

• t tests (one sample, independent samples, paired sample)

• Standard discrete distributions-Bernoulli, Binomial and Poisson

• Using R to calculate probabilities

• Fitting of discrete distributions to observed data

• Standard continuous distributions-Normal, Log Normal, Exponential

• Introduction to sampling distributions

Task 1.5

Formulate research hypotheses and perform hypothesis testing.

1.1 Write R and Python programmes that evaluate appropriate hypothesis tests

1.2 Draw statistical inference using output in R.

1.3 Translate research problems into statistical hypotheses.

1.4 Assess the most appropriate statistical test for a hypothesis

Assessment Criteria

• Statistical Hypothesis Testing-concepts and terminology

• Parameter, test statistics, level of significance, power, critical region

• Parametric vs. non-Parametric Tests

• Z tests for proportions (single and independent samples)

• Non-parametric tests (Mann-Whitney U, Wilcoxons signed rank)

• Tests for Normality, Q-Q plot

Task 2.1

Analyse the concept of variance (ANOVA) and an select an appropriate ANOVA or ANCOVA model.

2.1 Define variable, factor and level for a given research problem.

2.2 Evaluate the sources of variation, explained variation and unexplained variation.

2.3 Define a linear model for ANOVA/ANCOVA.

2.4 Confirm the validity of assumption based on definitions and analysis of variation.

2.5 Perform analysis using R and Python programs to confirm validity of assumptions.

2.6 Draw inferences from statistical analysis of the research problem.

Assessment Criteria

• What is analysis of variance?

• Definitions: Variable, factor, levels

• One Way Analysis of Variance

• Two Way Analysis of Variance (including interaction effects)

• Multi Way Analysis of Variance

• Analysis of Covariance

• Kruskal-Wallis Test

• Friedman Test

Task 2.2

Carry out global and individual testing of parameters used in defining predictive models.

2.1 Evaluate dependent variables and predictors.

2.2 Develop linear models using the lm function in R and the .ols function in Python.

2.3 Interpret signs and values of estimated regression coefficients.

2.4 Interpret output of global testing using F distributions.

2.5 Identify significant and insignificant variables.

Assessment Criteria

• Concept of random variables and statistical distribution

• Concept of a statistical model

• Estimation of model parameters using Least Square Method

• Interpreting regression coefficients

• Assessing the goodness of fit of a model

• Global hypothesis testing using F distribution

• Individual testing using t distributions

Task 2.3

Validate assumptions in multiple linear regression.

2.1 Resolve multicollinearity problems.

2.2 Revise a model after resolving the problem.

2.3 Assess the performance of the ridge regression model.

2.4 Perform residual analysis – graphically & using statistical tests to analyse results.

2.5 Resolve problems of non-normality of errors and heteroscedasticity.

Assessment Criteria

• Concept of Multicollinearity

• Calculating Variance Inflation Factors

• Resolving problem by dropping variables

• Ridge regression method

• Stepwise regression as a strategy

• Residual analysis

• Shapiro Wilk test, K-S test and Q-Q plot for residuals

• White’s test and Breusch-Pagan Test

• Partitioning data using the caret package

Task 2.4

Validate models via data partitioning, out of sample testing and cross-validation.

2.1 Develop models and implement them on testing data in accordance with the specification.

2.2 Evaluate the stability of the models using k-fold cross validation.

2.3 Evaluate influential observations using Cook’s distance and hat matrix.

Assessment Criteria

• Model development on training data

• Model validation on testing data using R squared and RMSE

• Concept of k-fold cross validation

• Performing k-fold cross validation using the caret package

• Identifying influential observations

Task 2.5

Develop models using binary logistic regression and assess their performance.

2.1 Evaluate when to use Binary Linear Regression correctly.

2.2 Develop realistic models using functions in R and Python.

2.3 Interpret output of global testing using Linear Regression Testing in order to assess the results.

2.4 Perform out of sample validation that tests predictive quality of the model.

Assessment Criteria

• Model definition and parameter estimation

• Estimation of model parameters using MLE

• Interpreting regression coefficients and odds ratio

• Assessing goodness of fit of the model

• Global hypothesis testing using LRT distribution

• Individual testing using Wald’s test

Task 3.1

Develop applications of multinomial logistic regression and ordinal logistic regression.

3.1 Select method for modelling categorical variable.

3.2 Develop models for nominal and ordinal scaled dependent variable in R and Python correctly.

Assessment Criteria

• Classification table

• ROC curve

• K-S Statistic

• Multinomial and Ordinal Logistic Regression – model building and parameter estimation

• Interpretation of regression coefficients

• Classification table and deviance test

Task 3.2

Develop generalised linear models and carry out survival analysis and Cox regression.

3.1 Evaluate the concept of generalised linear models.

3.2 Apply the Poisson regression model and negative binomial regression to count data correctly.

3.3. Model ‘time to event’ variable using cox regression.

Assessment Criteria

• Concept of GLM and link function and .GLM

• Poisson Regression

• Negative Binomial Regression

• Survival Analysis Introduction

• Cox Regression

Task 3.3

Assess the concepts and uses of time series analysis and test for stationarity in time series data.

3.1 Create time series object in R and Python correctly including decomposing time series and assessing different components.

3.2 Assess whether a time series is stationary.

3.3 Transform non-stationary time series data into stationary time series data.

Assessment Criteria

• Components of time series

• Seasonal decomposition

• Trend analysis

• Auto-correlogram

• Partial auto-correlogram

• Dickey-Fuller test

• Converting non-stationary time series data into stationary time series data

Task 3.4

Validate ARIMA (Auto Regressive Integrated Moving Average) models and use estimation.

3.1 Identify p, d and q of ARIMA model using ACF (auto-correlation function) and a PACF (partial auto-correlation function) to describe how well values are related.

3.2 Develop ARIMA models using R and python and evaluate whether errors follow the white noise process.

3.3 Finalize the model and forecast n-period ahead to make accurate predictions.

Assessment Criteria

• Concepts of AR, MA and ARIMA models

• Model identification using ACF and PACF

• Parameter estimation

• Residual analysis (testing for white noise process)

• Selection of optimal model

Task 3.5

Implement panel data regression methods.

3.1 Evaluate the concept of panel data regression.

3.2 Analyse the features of panel data.

3.3 Build panel data regression models in a range of contexts.

3.4 Evaluate the difference between fixed effect and random effect models.

Assessment Criteria

• What is Panel data?

• Need for different models for Panel data

• Panel data regression methods

• What is Panel data?

• Need for different models for Panel data

• Panel data regression methods

Task 4.1

Define Principal Component Analysis (PCA) and its derivations and assess their application.

4.1 Evaluate the need for data reduction.

4.2 Perform principal component analysis and develop scoring models using R and python to minimise data loss and improve interpretability of data.

4.3 Resolve multi-collinearity using Principal Component Regression.

Assessment Criteria

• Concept of Data reduction

• Definition of first, second, … ph principal component

• Deriving principal component using Eigenvectors

• Deciding optimum number of principal components

• Developing scoring models using PCA

• Principal component regression

Task 4.2

Understand hierarchical and non-hierarchical cluster analysis and assess their outputs.

4.1 Perform data reduction and derive interpretable factors and use factor scores to interpret the data set.

4.2 Obtain a brand perception map using multi-dimensional scaling.

Assessment Criteria

• Orthogonal factor model

• Estimation of loading matrix

• Interpreting factor solution

• Deciding optimum number of factors

• Using factor scores for further analysis

• Factor rotation

• Concept of MDS

• Variable reduction using MDS

Task 4.3

Evaluate the concept of panel data regression and implement panel data methods.

4.1 Evaluate the need for cluster analysis.

4.2 Obtain clusters using suitable methods.

4.3 Interpret cluster solutions and analyse the use of clusters for business strategies.

Assessment Criteria

• Concept of cluster analysis

• Hierarchical cluster analysis methods (linkage methods)

• Using dendrogram to estimate optimum number of clusters

• k-means clustering methods

• Using k-means runs function in R and Python to find optimum number of k

Task 4.4

Appraise classification methods including Naïve Bayes and the support vector machine algorithm.

4.1 Evaluate different methods of classification and the performance of classifiers.

4.2 Design optimum classification rules to achieve minimum error rates.

Assessment Criteria

• Bayes theorem and its applications

• Constructing classifier using Naïve Bayes method

• Concept of Hyperlane

• Support vector machine algorithm

• Comparison with Binary Logistic Regression

Task 4.5

Apply decision tree and random forest algorithms to classification and regression problems.

4.1 Use decision trees for classification and regression problems in comparison with classical methodologies.

4.2 Analyse concepts of bootstrapping and bagging.

4.3 Apply the random forest method in a range of business and social contexts .

Assessment Criteria

• Basics of Decision Tree

• Concept of CART

• CHAID algorithm

• ctree function in R

• Bootstrapping and bagging

• Random forest algorithm

Task 5.1

Analyse Market Baskets and apply neural networks to classification problems.

5.1 Analyse transactions data for possible associations and derive baskets of associated products.

5.2 Apply neural networks to a classification problem in domains such as speech recognition, image recognition and document categorisation.

Assessment Criteria

• Definitions of support, confidence and lift

• Aprioiri algorithm for market basket analysis

• Neural network problem for classification problem

Task 5.2

Perform text mining on social media data.

5.1 Appraise the concepts and techniques used in text mining.

5.2 Analyse unstructured data and perform sentiment analysis of Twitter data to identify the positive, negative or neutral tone of the text.

Assessment Criteria

• What is text mining?

• Term Document Matrix

• Word cloud

• Establishing connection with Twitter using twitteR package and Tweepy in Python

Task 5.3

Develop web pages using the SHINY package.

5.1 Build interpretable dashboards using the SHINY package.

5.2 Host standalone applications on a web page to present the results of data analysis.

Assessment Criteria

• Introduction to SHINY

• Introduction to R Markdown

• Build dashboards

• Host standalone apps on a webpage or embed them in R Markdown documents or build dashboards.

Task 5.4

Apply the Hadoop framework in Big Data Analytics.

5.1 Evaluate core concepts of Hadoop.

5.2 Appraise applications of Big Data Analytics in various industries.

5.3 Evaluate the use of the HADOOP platform for performing Big Data Analytics.

Assessment Criteria

• What is Big Data?

• Features of Big Data (Volume, Velocity and Variety)

• Big Data in different industries (Healthcare, Telecom, etc.)

• HADOOP architecture

• Introduction to R HADOOP package

Task 5.5

Evaluate the fundamental concepts of artificial intelligence.

5.1 Build a simple AI model using common machine learning algorithms that support business analysis and decision-making. In comparison with traditional assumptions from business theory.

Assessment Criteria

• What is AI and Theory behind AI

• What is Q learning

• The Monte Carlo theory

Task 6.1

Use SQL programming for data analysis.

6.1 Evaluate core SQL for data analytics.

6.2 Carry out data wrangling and analysis in SQL to uncover insights in underutilized data.

Assessment Criteria

• SQL programming Basics

• Data Wrangling and analysis

• Text mining of Twitter data

Task 6.2

Evaluate the concept of transformation and the key technologies that drive it.

6.1 Analyse the technologies that underpin digital transformation.

6.2 Assess the managerial challenges associated with implementing digital transformation successfully.

Assessment Criteria

• Fundamentals of Cloud Computing

• Compare and contrast cloud computing with traditional computing models

Task 6.3

Assess the strategic impact of the application of Big Data and Artificial Intelligence on business organisations.

6.1 Evaluate theories of strategy and their application to the digital economy and business.

6.2 Analyse examples of the application of Artificial intelligence on business operations or strategy.

Assessment Criteria

• Software as a Service

• Platform as a Services

• Infrastructure as a Service

• Business impact of Cloud Computing

• Historical development of Artificial Intelligence

Task 6.4

Appraise theories of innovation and distinguish between disruptive and incremental change.

6.1 Evaluate theories of disruptive innovation and how they explain the impact of innovation on industries.

6.2 Evaluate the managerial challenges of promoting and implementing innovation within organizations.

Assessment Criteria

• Vs of data – Volume, velocity, variety, veracity and value

• Christensen’s theory of disruptive innovation

Task 6.5

Evaluate ethics practices within organisations and how they relate to issues in Data Science.

6.1 Assess the role that codes of ethics play in the operation and sustainability of organisations.

6.2 Evaluate the importance of reporting and disclosure for ethical practice.

Assessment Criteria

• Ethical dilemmas and issues in Artificial Intelligence and Big Data

Distinguished Excellent Good Proficient Basic Marginal Unacceptable

Criteria 80+ 70 60 50 40 30 0

Content (alignment with assessment criteria) Extensive evaluation and synthesis of ideas; includes substantial original thinking Comprehensive critical evaluation and synthesis of ideas; includes coherent original thinking Adequate

evaluation and synthesis of key ideas beyond basic descriptions; includes original thinking Describes main ideas with evidence of evaluation; includes some original thinking Describes some of the main ideas but omits some concepts; limited evidence of evaluation;

confused original Largely incomplete description of main issues; misses key concepts; no original thinking Inadequate information or containing information not relevant to the topic

thinking

In-depth, detailed Clear and relevant

application of theory; fully integrates literature to support ideas and concepts Appropriate Adequate Confused application of theory; does not use literature for support Little or no evidence of application of theory and relevant literature

Application of and relevant application of application of Limited application

Theory and application of theory; integrates theory; uses of theory; refers to

Literature theory; expertly literature to support literature to support literature but may

integrates literature ideas and concepts ideas and concepts not use it

to support ideas and

concept consistently

Knowledge and Understanding Extensive depth of understanding and exploration beyond key principles and concepts Comprehensive knowledge and depth of understanding key principles and concepts Sound understanding of principles and concepts Basic Knowledge and understanding of key concepts and principles Limited and superficial knowledge and understanding of key concepts and principles Confused or inadequate knowledge and understanding of key concepts and principles Little or no evidence of knowledge or understanding of key concepts and principles

Logical, coherent

Somewhat weak presentation; errors in mechanics and syntax may interfere with meaning

and polished Logical, coherent Logical structure to Confused Illogical

presentation presentation presentation; makes Orderly presentation; errors presentation lacking

Presentation and exceeding demonstrating few errors in presentation; minor in mechanics and cohesion; contains

Writing Skills expectations at this mastery; free from mechanics and errors in mechanics syntax often significant errors

level; free from errors in mechanics syntax which do not and syntax interfere with that interfere with

errors in mechanics and syntax prohibit meaning meaning meaning

and syntax

Referencing

Advanced use of in- text citation and references

Mastery of in-text citation and referencing

Appropriate use of in-text citation and referencing

Adequate use of in- text citation and referencing

Limited use of in- text citation and referencing

Inadequate use of citation and referencing

Little or no evidence of appropriate referencing or use

of source

Page 8 of 11

Page 9 of 11

Directions:

1. For each of the criteria listed in the first column, circle one box in the corresponding column to the right which best reflects the student’s work on this particular assessment activity (e.g., project, presentation, essay).

2. Provide specific feedback to a student about each of the criteria scores he/she earned by writing comments and suggestions for improvement in the last row titled “Instructor’s comments.”

3. To arrive at a mark, total the boxes and divide by 5 to arrive at final mark.

Example:

Distinguished Excellent Good Proficient Basic Marginal Unacceptable

Range 80-100 70-79 60-69 50-59 40-49 35-39 0-34

Criteria Score

Content 50

Application of Theory and Literature 40

Knowledge and Understanding 50

Presentation/Writing Skills 40

Referencing 40

Total Score 220/5 = 44, Basic

Page 10 of 11

HEAD OFFICE

7 Acorn Business Park Commercial Gate, Nottingham Nottinghamshire

NG18 1EX

LONDON OFFICE

Golden Cross House

8 Duncannon Street, London WC2N 4JF info@qualifi.net

Copyright 2019 Qualifi Ltd

Page 11 of 11