Introducing the basic concepts of statistical analysis,cornerstone of data science. Available both in PDF(more graphs and tables) and web form.
Statistical analysis allows a better understanding of the basic structure of the data and carry out a number of procedures to identify possible correlations between variables, recognize trends and, as an integral part of a data mining tool enable predictions. Below we present two main features of a statistical analysis
• Crosstabulation.
• Descriptive Statistics.
1 ) Types of Variables
The variables typically used in statistical analysis fall into one of the following three basic categories.
1.1 Arithmetic (or Quantitative or Scale) Variables
These variables take values in an interval of the real line and include the height, weight or income of an individual, the distance traveled by some automobile, the life-span of a machine, etc.
1.2 Categorical (or Qualitative or Nominal) Variables
These variables record qualitative attributes of the objects under consideration. Usually the possible categories are called the levels of the nominal variable. Examples of categorical variables include the political preference (Right, Center, Left), the preferred kind of music (Rock, Jazz, Classical, Country, Folk, etc.), the grade of a student in some exam (A, B, C, D, F), etc.
1.3 Ordinal Variables
These are nominal variables whose levels can be ordered in some logical sense, however the distances between the various levels are not exactly known. Examples of ordinal variables include the age group of an individual (teenager, middle aged, old), the opinion on some matter (absolutely disagree, disagree, rather disagree, rather agree, agree, absolutely agree), the grade of a student in some exam (A, B, C, D, F), etc.
2) Frequency Tables
A frequency table is a table that lists items and uses tally marks to record and show the number of times they occur. Each entry of such a table contains the frequency or count of the occurrences of values within a particular group or interval, and in this way, the table summarizes the distribution of values in the sample. Frequency tables can be used for any dependent (e.g. answer to a poll question) or independent (e.g. age, gender, etc.) variable. It is often useful that percentages are included, taking into account the missing values
3) Crosstabs or Contingency Tables
Cross-tabulation is the process of creating a table from the multivariate frequency distribution of two statistical variables, tabulating the results of one variable against the other. Such tables are called contingency tables and give a basic picture of the interrelation of the two variables. In contingency tables, independent variables (e.g. the gender) are usually displayed as rows and dependent variables (e.g. an answer to a poll question) as columns. It is often useful that percentages by row, by column, or total percentages are included . Contingency tables can analogously be defined for three or more variables, however for more than three variables they are hard to use and are usually avoided.
4) Pearson chi-square (χ2) test
This test is used for contingency tables and tests whether two categorical variables are dependent or not. For example, one can check whether the answer to the question ”Do you agree with Newsweek’s cover suggesting that Obama must go?” depends on the gender or age of the person responding. This test makes use of a test Statistic proposed by Carl Pearson in 1900, which is a function of the squares of the deviations of the observed counts from their expected values, weighted by the reciprocals of their expected values.
5) Descriptives
Descriptive measures make sense for statistical analysis of quantitative data and include the following:
5.1 The Central Tendency
The central tendency includes Statistics which describe the location of the distribution of a quantitative variable. It includes the mean, the median, and the sum.
• The Mean is the arithmetic average i.e. the sum of values of the quantitative variable divided by the number of cases.
• The Median is the value above and below which half of the cases fall. It is also called the 50th percentile. If there is an even number of cases, the median equals the average of the two middle cases when they are sorted in ascending or descending order, while in an odd number of cases it equals the middle case, when the cases are sorted as above. The median is a measure of central tendency not
sensitive to outlying values (unlike the mean, which can be affected by a few extremely high or low values).
• The Sum equals the sum of the values across all cases with non-missing values.
5.2 The Dispersion
The dispersion includes Statistics which measure the amount of variation or spread in the data. They include the standard deviation, the variance, the range, the minimum, the maximum, and the standard error of the mean.
- The Standard deviation is a measure of dispersion around the mean. In any distribution, 93.75% of the cases fall within four standard deviation of the mean (Chebyshev’s inequality). In a normal distribution, 68% of cases fall within one standard deviation of the mean and 95% of cases fall within two standard deviations. For example, if the mean age is 45 years with a standard deviation of 10 years, in a normal distribution, 95% of the cases will have to be between 25 and 65 years of age.
- The Variance measures the amount of variation around the mean and is equal to the sum of squared deviations from the mean divided by one less than the number of cases. The variance is measured in units that are the square of those of the variable itself.
- The Range equals the difference between the largest and smallest values of a quantitative variable.
- The Minimum is the smallest value of a quantitative variable.
- The Maximum is the largest value of a quantitative variable.
- The Standard Error of the Mean measures how much the value of the mean may vary from sample to sample taken from the same distribution and can be used to roughly compare the observed mean to a hypothesized value.
5.3 The Distribution
The distribution includes the skewness and kurtosis which are Statistics describing the shape and symmetry of the distribution. These statistics are displayed with their standard errors.
- The Skewness is a measure of the asymmetry of a distribution. The normal distribution is symmetric and has zero skewness. Distributions with a significant positive skewness have a long right tail while distributions with a significant negative skewness have a long left tail. As a guideline, a skewness value more than twice its standard error is taken to indicate departure from symmetry
- The Kurtosis is a measure of the extent to which observations cluster around a central point. Positive kurtosis indicates that, relative to a normal distribution of the same mean and variance, the obser- vations are more clustered about the center of the distribution and have thinner tails towards the extreme values of the distribution. At these points, the tails of leptokurtic distributions are thicker relative to a normal distribution. Negative kurtosis indicates that, relative to a normal distribution of the same mean and variance, the observations cluster less and have thicker tails towards the extreme values of the distribution. At these point, the tails of platykurtic distributions are thinner relative to a normal distribution
5.5 Graphs
Graphs enable us to understand the relationship between variables, and interpret the behavior of objects under consideration in a simple, pictorial way, easily understood by almost everyone.
In the following table we present a list of graphs suitable for the study of variables or combinations of variables of specific type.

AI ANALYTICS DATA DATA ANALYTICS DATA SCIENCE GRAPHS MACHINE LEARNING VISUALIZATIONS
