Mathematics is the bedrock of any contemporary discipline of science. Almost all the techniques of modern data science, including machine learning, have a deep mathematical underpinning.
Modeling a process (physical or informational) by probing the underlying dynamics
Rigorously estimating the quality of the data source
Quantifying the uncertainty around the data and predictions
Identifying the hidden pattern from the stream of information
Understanding the limitation of a model
Understanding mathematical proof and the abstract logic behind it
Absolute must-know to grow as a data scientist. The importance of having a solid grasp of essential concepts of statistics and probability cannot be overstated in a discussion about data science. Many practitioners in the field call classical (project, where network) machine learning nothing but statistical learning. The subject is vast and endless, and therefore focused planning is critical to cover most essential concepts.
- Data summaries and descriptive statistics, central tendency, variance, covariance, correlation,
- Basic probability: basic idea, expectation, probability calculus, Bayes theorem, conditional probability,
- Probability distribution functions — uniform, normal, binomial, chi-square, student’s t-distribution, Central limit theorem,
- Sampling, measurement, error, random number generation,
- Hypothesis testing, A/B testing, confidence intervals, p-values,
- ANOVA, t-test
- Linear regression, regularization
Statistical data collection is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data often follows the study protocol specified before the study being conducted. The data from a study can also be analyzed to consider secondary hypotheses inspired by the initial results or to suggest new studies. A secondary analysis of the data from a planned study uses tools from data analysis, and the process of doing this is mathematical statistics.
Data analysis is divided into:
Descriptive statistics – the part of statistics that describes data, i.e., summarizes the data and their typical properties.
Inferential statistics – the part of statistics that concludes data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and quantifying the involved uncertainty (e.g. using confidence intervals).
While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data. For example, from natural experiments and observational studies, in which case the inference is dependent on the model chosen by the statistician and so subjective.