Data science is fundamentally a multi-disciplinary field which uses automated methods, algorithms, processes and systems to examine the huge amount of data which are in various forms then extract the knowledge and insights from the examined data.

Data science comes in the field of Big Data geared toward giving meaningful information based on the massive amounts of complex data. Data science, or data-driven science, group together different fields of work in statistics and computation in order to interpret data for the intention of decision making.

History:-

In the year 1960, Peter Naur used Data science as a replacement for computer science. In the year 2001, data science was introduced by William S. Cleveland as an independent discipline. In the year 2012, Harvard Business Review published an article defining the data scientist as “The sexiest job of the 21st century.” Data science is used interchangeably with early concepts like predictive modelling, statistics, business analytics and business intelligence

Role of Data Science in today’s world:-

Data science has supported a lot to bring the financial industry into the era of tech-savvy. With the assistance of data science, organisations are employing big data to give value to its consumers. Entertainment Companies like Netflix make use of big data to find out what its users are really interested in, and then Netflix uses this information to make TV shows.

Ques 1. What is Data Science?

Data science is fundamentally a multi-disciplinary field which uses automated methods, algorithms, processes and systems to examine the huge amount of data which are in various forms then extract the knowledge and insights from the examined data.

Data Science can interpret the huge amount of data which the digital age generates into the new knowledge by combining the aspects of Computer science, Mathematical statistics and applied mathematics.

Ques 2. What are the skills required to perform data analysis using Python?

Some of the important skills which are required while performing the data analysis using Python are as follows:-

Needs to have a proper understanding of the built-in data types particularly dictionaries, tuples, lists and sets.

Needs to have a skilfulness of N-dimensional NumPy arrays.

Needs to have a skilfulness of pandas treatments.

Needs to have an ability to execute the element-wise vector and the matrix operations on NumPy arrays.

Needs to have an ability an efficient list comprehensions rather than traditional for loops.

Needs to have an ability to write pure, small and clean functions that do not alter the objects

Ques 3. Can you explain the importance of having a Selection Bias?

Selection Bias takes place when there is no proper randomisation achieved while choosing groups or data, individuals that need to be analysed. Selection bias signifies that the sample obtained does not precisely represent the population which was actually intended to be analysed. Selection bias includes Sampling Bias, Attribute, Data and Time Interval.

Selection bias is sometimes mentioned as Selection Effort. Selection Bias is actually the distortion of the statistical analysis which has resulted from the process of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be exact.

Ques 4. What are the types of Selection bias?

There are five types of selection bias:-

Sampling bias: Sampling bias is a systematic error as a result of a non-random sample of a population leading to some members of the population to be less likely to be constituted than others resulting in a biased sample.

Time interval: A trial may be concluded early at an extreme value (often for ethical reasons), but the extreme value is possible to be reached by the variable with the largest variance, even if all variables have an identical mean.

Data: When particular subsets of data are selected to support a conclusion of bad data or rejection of bad data on arbitrary grounds, rather according to the previously stated or generally agreed to criteria.

Attrition: Attrition bias is a kind of selection bias which is caused by attrition (loss of participants) discounting trial subjects or tests that did not run to conclusion.

Ques 5. Can you explain difference between “long” and “wide” format data?

S no. Long Wide

1 The repeated responses of the subjectwill be in a single row and each response is in a separatedcolumn In Wide, each row is a one-time point per subject. Data can be recognizedin a wide format by the fact that columns broadlyrepresent groups.

Ques 6. What do you mean by statistical power of sensitivity?

Sensitivity is generally used to authorise the accuracy of a classifier (SVM, Logistic, Random Forest etc.).

Sensitivity is “Predicted True events/ Total events”. Here, True events are the events that were true and even model also predicted them as true.

Ques 7. Explain the term Normal Distribution.

Data is basically distributed in many different ways with a partial to the left or to the right or it can all be jumbled up.

Yet, there are some chances that without any partial to the left or to the right data is distributed around a central value and it reaches a normal distribution and forms a bell-shaped curve.

Ques 8. Explain the properties of a Normal Distribution curve?

The properties of Normal Distribution

Unimodal – It means only one mode

Symmetrical -Left halves and right halves are the mirror images

Bell-shaped – Bell-shaped is the maximum height (mode) at the mean

Mean, Median and are located in the centre

Asymptotic

Ques 9. What is the difference between Data Science Vs. Machine Learning?

S No. Criteria Data Science Machine Learning

1 Scope Scope is Multidisciplinary The Scope is in Training Machines

2 Role It can take a business role It is completely technical

3 Artificial Intelligence It is loosely integrated It is tightly integrated

Ques 10. List the difference between supervised and unsupervised learning?

S No. Supervised Learning Unsupervised Learning

1 Supervised Learning uses labelled data as input. Unsupervised Learning uses unlabelled data as input.

2 It has a feedback mechanism It does not have a feedback mechanism

Ques 11. Can you explain the term logistic regression in data science and how logistic regression is done?

Logistic Regression is mostly referred to as a logit model is a method to predict the binary result from a linear combination of predictor variables.

Logistic Regression fundamentally measures the relationship between the one or more independent variables and the dependent variables by approximating probabilities using its logic function. Here, depended variables can be our label or what we would like to predict and independent variables can be our features.

Ques 12. Real life example:- To predict whether BJP will win the election or not. In this scenario, the outcome of prediction will be binary i.e. either 0 or 1 (Win or lose). The predictor variables here will be the sum of money spent for an election campaigning, the period of time spent in campaigning and many more.

Ques 13. Explain the steps in making a decision tree?

Step 1:-Take the total data set as input.

Step 2:- Look for a split which maximizes the separation of the classes. A split is a test that divides the data into two sets.

Step 3:-Apply the split to the input data (divide step).

Step 4:-Again apply Step 1 and Step 2 to the data which has been divided.

Step 5:-Stop when you meet stopping criteria.

Step6:- This step is called pruning. Clean up the tree if the user went too distant doing the splits.

Ques 14. What is the goal of A/B Testing?

A/B Testing is basically a statistical hypothesis testing for a randomised experiment with two variables A and B. The objective of A/B Testing is to determine any modification to the web page to maximize or increase the conclusion of an interest. A/B testing is an amazing method for figuring out the best online promotional and marketing schemes for the business. A/B Testing can be used to check everything from website copy to search ads to sales emails.

Ques 15 . What is Overfitting in data science?

In statistics and in machine learning, The most common task is to fit a model to a set of training data, for making reliable predictions on the general untrained data.

In Overfitting, a statistical model specifies random error or the noise instead of an underlying relationship. Overfitting happens when a model is too complex, like having so many parameters comparative to the number of observations. A model that has been overfit has a poor predictive performance, as it overreacts to the minor fluctuations in the training data.

Ques 16. How can the Overfitting of the model be avoided?

There are three methods through which Overfitting of the model can be avoided:-

Keep the model as simple as possible:-take into account few variables, thereby removing the noise which is in the training data.

Use of Cross validation technique:- k-folds cross-validation is one of the techniques that can be used.

Use of regularisation technique:- A regularisation technique LASSO can be used which penalise certain model parameters if these parameters are probable to cause Overfitting

Ques 17. What is Underfitting in data science?

Underfitting happens when a statistical model or a machine learning algorithm is not able to capture the underlying trend of the data. Underfitting would probably occur when fitting a linear model to the non-linear data. Such a model will have a low predictive performance.

Ques 18 . What is the meaning of the term dimensionality reduction? What are the benefits in data science?

Dimensionality reduction is the process of converting a particular set of data which has vast dimensions into the data which has lesser dimensions to communicate similar information briefly.

Benefits of Dimensionality reduction

Dimensionality reduction helps in the compression of data and in reducing the storage space

Dimensionality reduction decreases the computation time

Dimensionality reduction removes the redundant features

Ques 19. What do you mean by Recommender systems?

Recommender Systems are actually a subclass of an information filtering systems which are meant to predict the preferences or predict the ratings that a user will give to a product. Recommender systems are mostly used in movies, research articles, social tags, news, products, music and many more.

For Example- Product recommenders in an e-commerce sites like eBay, Amazon & Flipkart, YouTube video recommendations, movie recommenders in Netflix, IMDB, BookMyShow etc.

Ques 20. Can you explain the process of Collaborative filtering?

Collaborative filtering is actually a process of filtering which is used by almost all the recommender systems to find out the patterns or information by collaborating the viewpoints, multiple agents and several data sources

For Example:- To predict the rating of a specific user based on his/her ratings for the other movies and others’ ratings for all the movies.

Ques 21. Why data cleaning plays a critical role in the analysis?

Data cleaning plays a very important role in analysis because:

Cleaning of the data from the multiple sources supports to transform it into a particular format that data analysts or data scientists will be able to work with.

Data Cleaning supports to increase the accuracy of the model in machine learning.

Data cleaning is a cumbersome process because the increase in the number of data sources results in the increase in the time taken to clean the data due to the volume of data and number of sources generated by these sources.

It might take up to 80% of the time for cleaning the data thus making data cleaning a critical part of the analysis task.

Ques 22. What do you mean by Cluster Sampling?

Cluster sampling is actually a technique which is used when it becomes difficult to study the target population which has spread across a wide area and a simple random sampling cannot be applied. Cluster Sample is a probability sample in which each sampling unit is a group or cluster of elements.

For Example:- A researcher wants to survey about the academic performance of high school students in India. A researcher can divide the total population of India into different clusters or cities. Now the researcher selects a number of clusters depending on the research he has done through a simple or systematic random sampling.

Ques 23. Explain about Eigenvectors and Eigenvalues?

Eigenvectors are basically used for understanding the linear transformations. In data analysis, eigenvectors are calculated for a correlation or a covariance matrix. Eigenvectors are those directions along which a specific linear transformation acts by flipping, stretching or compressing.

Eigenvalue is referred to as the strength of the transformation which is in the direction of an eigenvector or the factor by which the compression occurs.