Data Science Interview Questions and Answers
Q1. Explains the key concepts of prior probability, marginal likelihood, and likelihood in the backdrop of naive Bayes algorithms?
In a dataset, the amount of dependent variable, (that is the binary value), is known as the prior probability. We can explain it through an example. Consider in an Email account, the dependent variable is binary i.e. 1 & 0. value 1 is spam and 0 is not spam. The probable proportion of value 1 is 70% and of 0 is the remaining 30%. with this analogy, we can certainly estimate that any new Email will be sealed as spam at the possibility of 70%.
The likelihood is defined as the probability of an event which is derived from a large amount of previously collected data.
In the statistical jargon, marginal likelihood represents the marginalized variables that are used to define the value in the given probable event. Moreover, the possible element we seek to find may be there in any of the target entity.
Q2. Why should we rotate principal component analysis (PCA)? Why rotating it so important?
The orthogonal rotation is vital since it does maximize the dissimilarities between discrepancies caught by the component. This makes the components very simple to expand the complex attributes. It is also should be noted that the motive of doing Principal Component Analysis where we mark to choose fewer components than the features which can provide solutions to the maximum variance in the dataset. So, By the rotation process, the corresponding location of the components does not change later, it only changes the real coordinates of the points.
But if we do not rotate PCA, the actual potency of PCA will be reduced to the considerable amount and the variance in the dataset become inexplicable unless we choose even number of components.
Q3. How to pick really important variables to work with the dataset?
There are some notable methods are traditionally used to opt for considerable variables when we do sort through a heavy load of data sets.
- Removal of the correlated variables beforehand, and then choosing valid variables.
- Using linear regression before choosing the variables on the basis of p values.
- Other selection methodologies like Forward, Backword, and Stepwise.
- There are some other effective options like Random Forest, Xgboost, and chart variable analysis in practice.
- Of all the methods, Lasso regression is a proven technique.
- By measuring the information gain for the existing features and picking out the top n features.
Q4. What are the differences between Gradient Boosting Algorithms (GBM) and Random Forest?
The rudimentary difference is that random forest makes use of bagging methods to foresee predictions, whereas GBM uses boosting techniques to do the predictions.
When we go for the bagging technique, a data set is dissected into n samples with randomized sampling procedure. After that, using a lone learning algorithm, a model is constructed on all samples. Now the resulting predictions are blended together using voting or averaging. Bagging is done in an analogous way.
When we do boost, after the initial round of predictions, the algorithm evaluates misclassified predictions higher, such that they can be rectified in the oncoming round. This parallel process of giving higher weights to misclassified predictions goes on and on until a stopping criterion is attained.
Random forest revamps model accuracy by minimizing variance. The trees grown are not correlated to increase and decrease in variance. And GBM upgrades the accuracy by lessening both bias and variance in a given model.
Q5. How to evaluate a logistic regression model?
By definition, logistic regression is a usual statistical model which incorporates the independent variable, dependent variable, and binary variable. As these three models constitute the logistic regression, it may seem a bit hefty to evaluate this model of statistical analysis. Generally, this model of logistic regression predicts the probabilities accurately. AUC-ROC curve along with confusion matrix is widely used to regulate its effective performance.
Also, the resembling metric of calibrating R²in logistic regression is Akaike Information Criterion (AIC). This AIC is the measure of fit which chastens model for the number of model coefficients. Hence, we need to prefer minimum AIC value model.
The pedigree of Null Deviance shows the response foreseen by a model with an intercept. When the value is proportionally lower, the model becomes better. Residual deviance exemplifies the response predicted by a model on including independent variables.
Q6. Which machine learning algorithm is better to use with a data set?
In fact, the way of machine learning algorithm singularlyrests on the type of data. If we are given a data set which reflects linearity, then linear regression would be the best algorithm to use. If we are given works on images and audios, then the neural network would help us to develop a robust model.
If the data consists of non-linear interactions, then a boosting or bagging algorithm should be the option. If the business requirement is to construct a model which can be installed, then we will use regression or a decision tree model, which is easy to interpret and explain, instead of black box algorithms like Support Vector Machine(SVM) andGradient Boosting Algorithm (GBM).
In essence, there is no one master algorithm given for all these situations. We must be diligent enough to understand which algorithm to use for which data set.
Q7. Distinguish between univariate, bivariate and multivariate analysis.
To analyze and bring forth meaningful insights from a given set of data, there are three important methods of analysis exists.
These are just illustrative statistical analysis techniques which can differ on the basis of the number of variables mixed up at a given point of time. For instance, the pie charts of sales based on region involve just one variable and can be cited to as univariate analysis.
If the analysis ventures to understand the difference between two different variables at a time as in a scatterplot, then it is referred to as bivariate analysis. For instance, analyzing the volume of sale and the expenditure can be noted as an example of bivariate analysis.</>
The analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
In the end, all that does actually matter is the number of variables involved and the analysis is solely based on them.
Q8. How to treat the missing values during an analysis?
The extended effect of the missing values is located after figuring out the variables with missing values. If any motifs are identified the analyst has to focus on them as it could lead to worthy and meaningful business insights. If there are no patterns noted, then the missing values can be replaced with mean or median values, that is imputation, or they can simply be left out. There are diverse factors to be taken into consideration when answering this question.
Comprehend the problem statement, understand the data and then provide the answer. Assigning a fixed value which can be minimum to maximum value. Getting into the data is vital.
If it is a categorical variable, the default value is allocated. The missing value is given a default value.
If we have a distribution of data on the go, for normal distribution provide the mean value.
If 80% of the values for a variable are missing then we can say that we would be dropping the variable instead of treating the missing values in the stead.
Q9. How do we use a Support Vector Machine over RandomForest Machine Learning algorithm and conversely?
Support Vector Machine (SVM) and Random Forest are both used in the classification of problems. There are some standard methods used to use the SVM with Random Forest.
- If we are certain that the data is free and clean then we can go for SVM. It is the converse – if the data contains outliers, then Random forest would be the best option.
- Usually, SVM takes up more computational power than Random Forest, so if we are restricted with memory, it is better to go for Random Forest.
- Random Forest gives us a very good layout as to the variable importance in our data, so if we want to have variable importance then picking the Random Forest machine learning algorithm is advisable.
- Random Forest machine learning algorithms have opted for multi-class problems.
- SVM is favored in multi-dimensional problem set – like text classification.
but a good data scientist experiments with both of them and test for authenticity.
Q10. How does missing data handling is full of selection bias?
Missing value treatment is one of the fundamental tasks which a data scientist is required to do before beginning data analysis. There are several methods for missing value treatment. If not done appropriately, the outcome would lead to selection bias. Let see few missing value treatment examples and their impact on selection.
Complete Case Treatment: Complete case treatment is when a weeliminatewhole row in data even if one value is missing. We could attain a selection bias if our values are not missing at arbitrary and they have few patterns. Assuming we are conducting a survey and some people did not mention their gender. It is not wise to remove all those who had not specified their gender from the entire census process.
Available case analysis: Let say we are trying to estimate correlation matrix for data so we could eliminate the missing values from variables which are required for that particular correlation coefficient. In this case, your values will not be completely accurate as they are derived from population sets.
Mean Substitution: In this method, missing values are substituted with the mean of other available values. This might render our distribution biased for instance standard deviation, correlation and regression mostly rely on the mean value of variables.
Hence, various data management procedures might add selection bias in our data if not picked up correctly.
Q11. Why data cleaning is important in data analysis process?
Cleaning data from several sources to transfigure it into a format that data analysts or data scientists can work with is a tedious process because, as the number of data sources enlarges, the time it takes to clean the data grows exponentially because of the number of sources and the volume of data created in these varied sources. It might take up to 80% of the time for cleaning data, and thus be making it a crucial part of analysis work.
Q12. What are the major differences between Cluster and Systematic Sampling?
Cluster sampling is a artistry used when it becomes hard to study the target population stretch across a wide area and simple random sampling cannot be appealed. Cluster Sample is a feasibility sample where every sampling unit is a stockpile or cluster of elements and components.
Systematic sampling is a statistical methodology where elements are chosen from an organized sampling frame. In systematic sampling, the list is developed in a circular way so once we reach the end of the list,it is developed from the top all over again. The fine example for systematic sampling is equal probability method.
Q13. What do an Eigenvalue and Eigenvector do?
Eigenvectors are mainly used for comprehending linear transformations. In data analysis, we generally estimate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a specific linear transformation is acted out by flipping, compressing or stretching.
Eigenvalue can be made up as the capability of the transformation in the direction of eigenvector or the factor by which the compression happens.
Q14. What are important steps in an analytics project?
- First, understand the business issues.
- Search out the data and become accustomed to it.
- Get the data ready for modeling by looking out for outliers, treating missing values, transforming variables,
- After the preparation of data, start running the model, analyze the result and tug the approach. This is a vital step till the best possible outcome is attained.
- Validate the model using a brand-new dataset.
- Begin implementing the model and trace the result to analyze the functionality of the model over the span of time.
Q15. When do we resample the data?
- Calculate the authenticity of sample statistics by using subsets of attainable data or drawing randomly with substituting from a set of data points.
- Replace the labels on data points during the performance tests.
- Endorse the models by using random subsets (bootstrapping, cross-validation).
Q16. What is cross-validation?
It is a model validation expertise for estimating how the outcomes of a statistical analysis will generalize to a stand-alone data set. It is chiefly used in backgrounds where the objective is to predict and one wants to calculate how exactly a model will finish off the practice. The aim of cross-validation is to specify a data set to test the model in the training stage, that is, validation data set, in order to eliminate issues like overfitting and gain insight on how the model will generalize to an independent data set.
Q17. What is Star Schema?
It is an orthodox and usually used database schema with a focal point of the table. Satellite tables map identities to physical names or descriptions and can be linked to the main fact table with the ID fields; these tables are called as lookup tables and are mainly used in real-time applications, as they save up a lot of memory. Sometimes star schemas indulge in many layers of summarization to recuperate information faster and clear manner.
Q18. How to assess the organized logistic model?
There are several procedures to assess the results of a logistic regression analysis. They usually are,
- By using Classification Matrix to pore through the real negatives and false positives.
- Reference map that supports to identify the capability of the logistic model to distinguish between the event happening and not happening.
- Lift helps assess the logistic model by comparing it with an arbitrary selection.
Q19. What are the differences between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
In the Bayesian estimate, we have some understanding of the data/problem in advance. There may be multiple values of the parameters which elucidate data and so we can look for several parameters like five gammas and five lambdas that do this. As an outcome of Bayesian Estimate, we get several models for making multiple foreseen results, that is one for every pair of parameters but with the exact prior. So, if a new example needs to be foreseen than computing the weighted total of these predictions serves the expectation.
The maximum possibility does not take prior to the thing and ignores the prior so it is like being a Bayesian during using some sort of a flat prior.