What are the best tools used for data analysis?
- Google Fusion Tables
What is the purpose of KNN imputation method?
When the missing attribute values are imputed with the support of the value of the attributes, which may be strikingly similar to the attribute whose values may miss in KNN imputation, we can determine the similarity of two attributes by using a distance function.
When we undertake an analytics project, what steps should be taken?
Before delving into the analytical project, there are some precursory things to be done beforehand.
- First, define the problem
- Explore the data
- Prepare the data
- Do the Modelling
- Validate the data
- Track and Implement
What are the best statistical methods for data analysis?
There are some most widely used techniques to be used by data analysts and data scientists.
- Bayesian method
- Imputation techniques
- Simplex algorithm
- outliers detection
- Rank statistics and percentile
- Markov process
- Mathematical optimization
- Spatial and cluster processes
Explain KPI, design of experiments, and 80/20 rules.
Key Performance Indicator comprises of various combinations of spreadsheets, reports, and charts regarding the entire business process. 80/20 rule is intended to indicate that almost 80 percent of our work output results from just a 20 percent of analysts. It is widely used to determine where the fruitful productivity comes from. The design of experiments is the initial process we can use to split the data and set up to sample a data which can be used for statistical analysis.
What is Hashtable collisions and how can it be avoided?
Hashtable collision happens when two keys of dissimilar background hash over homogenous value. Here, two data are not held inside the same slot. There are two major techniques to evade a hash table collision.
Open Addressing: This goes for other slots, supported by another function and holds the items in the inceptive empty lot that is disclosed.
Separate Chaining: With the help of data structure to store several items hashing to the selfsame particular spot.
What is Clustering? And explain its properties.
If a classification method is applied to a data that is collected together, that is called Clustering. Clustering algorithm helps to divide a dataset into clusters or natural groups.
Properties of clustering algorithm are:
- Hierarchical or flat
- Hard and soft
What is correlogram analysis?
A correlogram analysis is the commonly a framework of spatial of geographical analysis. It is generally comprised of a sequence of calculated coefficients of auto-correlation estimated for a different spatial relationship. If the raw data is portrayed as distance rather than values at individual points, it can be used to build a correlogram for distant data.
Explain K-mean Algorithm.
K algorithm is usually used for partitioning methods. Objects are categorized as owned by K groups, where K is a chosen prior.
The clusters are spherical in the K-mean algorithm and the data points in a gathered as a cluster are focused around that cluster. The variance of the clusters is also actually alike and each data point belongs to the adjacent cluster.
Explain the differences between data profiling and data mining?
The difference between data mining and data profiling is:
Data profiling focuses on the instant analysis of individual attributes and it provides information on several attributes like discrete value, range, and their frequency, happening of data type, null values, and length.
Data mining concentrates on cluster analysis, to detect anomaly records, sequence discovery, dependencies, and holding of the relation between many attributes.
What should we do to the process data cleaning?
- First of all, sort data by different attributes.
- Clean it with each step and improve the data with each step until you achieve a good data quality for large datasets.
- Break them into small data fragments to be used for large datasets. Iteration speed will be increased if we work with fewer data.
- For each column, we have to analyze the summary statistics like mean, standard deviation, and a number of missing values.
Explain time series analysis.
Time series analysis can be done in both frequency and time domains. The output of some process can be predicted by analyzing the previous data with the support of various methods like log-linear regression method and exponential smoothening.
Explain the concept of collaborative filtering?
Collaborative filtering is a simple algorithm to generate a proposal system on the basis of behavioral data. A good example of collaborative filtering is when we could see things like “people also like” on online shopping sites that pop up followed by our browsing history.
What should we do with suspected or missing data?
There may be times when we would encounter missing the data. In that case, we should do the following.
- We need to prepare a validation report that provides all the necessary information about all suspected data. Validation criteria information should be given based on the date and time of occurrence.
- An experienced and skilled team of professionals should inspect the doubtful data to determine the acceptability.
- Invalid data should be allocated and replaced with a validation code.
- As for analyzing the missing data, it is advisable to use single imputation methods, deletion method, and methods based on the model.