On interview you can be countered with any question related to machine learning, deep learning, natural language processing or even sometimes computer vision concept. So you should have complete knowledge about some concepts.
We will go in deep for few concepts and dig all the questions related to those data science concepts. That will make you confident too.
Tough question are often asked from basic concepts like features, accuracy, train test splitting are asked. And repeative questions are asked from tough topics / concepts.What does a z-score of mean?
This is a detailed guide ont top data science interview questions. Which cover data science basic to advanced topics :
- interquartile range
- univariate analysis and bivariate analysis
- central limit theorem
- feature scaling
We will cover from very basic and discuss every doubts related to these concepts.
Concept #1 : Interquartile range (IQR)
How do you find the interquartile range?
To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
What is interquartile range with example?
The interquartile range is 77 – 64 = 13; the interquartile range is the range of the middle 50% of the data. With an Odd Sample Size: When the sample size is odd, the median and quartiles are determined in the same way. Suppose in the previous example, the lowest value (62) were excluded, and the sample size was n=9
What is Q1 and Q3?
The lower quartile, or first quartile (Q1), is the value under which 25% of data points are found when they are arranged in increasing order. The upper quartile, or third quartile (Q3), is the value under which 75% of data points are found when arranged in increasing order.
How do I find the first quartile?
How to Calculate Quartiles
- Order your data set from lowest to highest values.
- Find the median. This is the second quartile Q2.
- At Q2 split the ordered data set into two halves.
- The lower quartile Q1 is the median of the lower half of the data.
- The upper quartile Q3 is the median of the upper half of the data.
Is Q2 the median or mean?
The median is considered the second quartile (Q2). The interquartile range is the difference between upper and lower quartiles. The semi-interquartile range is half the interquartile range.
What is the relationship between quartiles and percentiles?
percentiles and quartiles share the following relationship: 0 percentile = 0 quartile (also called the minimum) 25th percentile = 1st quartile. 50th percentile = 2nd quartile (also called the median)
What does NP percentile do?
percentile()function used to compute the nth percentile of the given data (array elements) along the specified axis.
How does Python calculate quantile?
The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of a set of data. Specifically, IQR = Q3–Q1.
If you haven’t started yet and looking for best free data science course. Check this blog.
Concept #2 : Central limit theorem and the normal distribution
The central limit theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size increases, regardless of the shape of the original distribution.
The law of large numbers states that as the number of trials increases, the average of the results will become closer to the expected value. For example, the proportion of heads in 1,000 fair coin tosses is closer to 0.5 than it is in 10 tosses.
What is central limit theorem formula?
Central limit theorem is applicable for a sufficiently large sample sizes (n ≥ 30). The formula for central limit theorem can be stated as follows: μ x ― = μ a n d. σ x ― = σ n.
Checkout this blog for more details
What are the three rules of central limits theorem?
It must be sampled randomly. Samples should be independent of each other. One sample should not influence the other samples. Sample size should be not more than 10% of the population when sampling is done without replacement.
Where do we use central limit theorem in machine learning?
The central limit theorem has important implications in applied machine learning. The theorem does inform the solution to linear algorithms such as linear regression, but not exotic methods like artificial neural networks that are solved using numerical optimization methods.
What is the normal distribution rule?
Key Takeaways. The Empirical Rule states that 99.7% of data observed following a normal distribution lies within 3 standard deviations of the mean. Under this rule, 68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean.
How do you determine normal distribution?
In order to be considered a normal distribution, a data set (when graphed) must follow a bell-shaped symmetrical curve centered around the mean. It must also adhere to the empirical rule that indicates the percentage of the data set that falls within (plus or minus) 1, 2 and 3 standard deviations of the mean.
What are the 5 properties of normal distribution?
The shape of the distribution changes as the parameter values change.
- Mean. The mean is used researchers as a measure of central tendency. …
- Standard Deviation. …
- It is symmetric. …
- The mean, median, and mode are equal. …
- Empirical rule. …
- Skewness and kurtosis.
What does a z-score of mean?
If a z-score is equal to 0, it is on the mean. A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean. A negative z-score reveals the raw score is below the mean average.
What is the importance of normal distribution in statistics?
The normal distribution is the most important probability distribution in statistics because many continuous data in nature and psychology displays this bell-shaped curve when compiled and graphed.
What is the difference between z-score and standard deviation?
Z-score indicates how much a given value differs from the standard deviation. The Z-score, or standard score, is the number of standard deviations a given data point lies above or below mean. Standard deviation is essentially a reflection of the amount of variability within a given data set.
What is the z-score for 90%, 95% and 99 %?
the z value at the 99 percent confidence interval is 2.58, for 95% its 1.96 and for 90 its 1.645.
You should also have clear understanding of machine learning pandas, python, and numpy usage in datascience
Concept #3 : Feature Scaling
What is scaling of data?
Feature scaling is not necessary when using tree-based algorithms e.g. decision trees, random forest, and gradient boosting as they are invariant to the scale of the features. Decision trees split each node using one feature at a time and as a result, they are unaffected the other features in the dataset.
For gradient descent and distance-based algorithms, however, feature scaling can result in improvement in model
What are the types of feature scaling?
Types of Feature Scaling
- Standardization. Standardization is another scaling technique in which the mean will be equal to zero and the standard deviation equal to one.
- Robust Scalar. Robust scaling is one of the best scaling techniques when we have outliers present in our dataset.
- Gaussian Transformation.
Why feature scaling is important?
Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
Principle component analysis require standardization?
Yes, it is necessary to normalize data before performing PCA. The PCA calculates a new projection of your data set. And the new axis are based on the standard deviation of your variables.
Why do we use MinMaxScaler?
MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data. Note that MinMaxScaler doesn’t reduce the importance of outliers. The default range for the feature returned MinMaxScaler is 0 to 1.
Which is better MinMaxScaler or StandardScaler?
StandardScaler is useful for the features that follow a Normal distribution. This is clearly illustrated in the image below (source). MinMaxScaler may be used when the upper and lower boundaries are well known from domain knowledge (e.g. pixel intensities that go from 0 to 255 in the RGB color range).
What is standard scaling?
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Concept #4 : Univariate analysis and a Bivariate analysis?
Univariate analysis is the analysis of one (“uni”) variable. Bivariate analysis is the analysis of exactly two variables. Multivariate analysis is the analysis of more than two variables.
Univariate analysis and a Bivariate analysis? examples
When you conduct a study that looks at a single variable, that study involves univariate data. For example, you might study a group of college students to find out their average SAT scores or you might study a group of diabetic patients to find their weights. Bivariate data is when you are studying two
How do you do a univariate analysis?
Example of Univariate Analysis
- Prepare your data set.
- Choose Analyze > Descriptive Statistics > Frequencies.
- Click statistics and choose what do you want to analyze, and click continue.
- Click chart.
- Choose the chart that you want to show, and click continue.
- Click ok to finish your analysis.
- See and interpret your output.
What is a bivariate table?
Bivariate table: a table that illustrates the relationship between two variables displaying the distribution of one variable across the categories of a second variable. Cross-tabulation: A technique used to to explore the relationship between two variables that have been organized in a table.
What are the types of bivariate analysis?
Numerical and Numerical – In this type, both the variables of bivariate data, independent and dependent, are having numerical values. Categorical and Categorical – When both the variables are categorical. Numerical and Categorical – When one variable is numerical and one is categorical.
Which plot can be used for both univariate and bivariate analysis?
It is called a scatter plot. If you would like to explore bivariate data sets more then you can use the Regression Activity to observe the correlation.
Which chart can be used for bivariate analysis?
Graphical methods Graphs that are appropriate for bivariate analysis depend on the type of variable. For two continuous variables, a scatterplot is a common graph. When one variable is categorical and the other continuous, a box plot is common and when both are categorical a mosaic plot is common.
Not exactly these questions will hit you. But knowing how these questions are answered to the key point. That you will subconsiously develop as you read this complete detailed guide. In which we tried to put top data science questions often asked on Interviews.