DEV Community

Cover image for 75+ data science interview questions : most asked interview questions
Ajith R
Ajith R

Posted on

75+ data science interview questions : most asked interview questions

  1. What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?

  2. How to deal with unbalanced binary classification?

  3. What is the difference between a box plot and a histogram?

  4. Describe different regularization methods, such as L1 and L2 regularization?

  5. What are Neural Networks

  6. What is cross-validation?

  7. How to define/select metrics?

  8. Explain what precision and recall are

  9. Explain what a false positive and a false negative are. Why is it important these from each other?

10.Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important

  1. What is the difference between supervised learning and unsupervised learning? Give concrete examples

12.Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model

13.What does NLP stand for?

  1. When would you use random forests Vs SVM and why? 15. Why is dimension reduction important?

  2. What is principal component analysis? Explain the sort of problems you would use PCA for.

  3. Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

  4. What are the drawbacks of a linear model?

  5. Do you think 50 small decision trees are better than a large one? Why?

  6. Why is mean square error a bad measure of model performance? What would you suggest instead?

  7. What are the assumptions required for linear regression? What if some of these assumptions are violated?

  8. What is collinearity and what to do with it? How to remove multicollinearity?

  9. How to check if the regression model fits the data well?

  10. What is a decision tree?

  11. What is a random forest? Why is it good?

  12. What is a kernel? Explain the kernel trick

  13. What is the Central Limit Theorem? Explain it. Why is it important?

  14. What is the statistical power?

  15. Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?

  16. What is overfitting? 31. What is boosting?

  17. The probability that item an item at location A is 0.6, and 0.8 at location B. What is the probability that item would be found on Amazon website?

  18. You randomly draw a coin from 100 coins-1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?

  19. Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?

  20. Walk through the probability fundamentals 36. Describe Markov chains?

  21. A box has 12 red cards and 12 black cards. Another box has 24 red cards and 24 black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?

  22. You are at a Casino and have two dices to play with. You win $10 every time you roll a 5. If you play till you win and then stop, what is the expected payout?

  23. How can you tell if a given coin is biased?

  24. Make an unfair coin fair

  25. You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it's raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in London.

  26. You are given 40 cards with four different colors- 10 Green

cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The

cards of each color are numbered from one to ten. Two cards

are picked at random. Find out the probability that the cards

picked are not of the same number and same color.

  1. How do you assess the statistical significance of an insight?

  2. Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

  3. Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

  4. Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?

  5. Is mean imputation of missing data acceptable practice? Why or why not?

  6. What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.

  7. How do you handle missing data? What imputation techniques do you recommend?

  8. You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

  9. Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?

  10. You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

  11. Give examples of data that does not have a Gaussian distribution, nor log normal.

  12. What is root cause analysis? How to identify a cause vs. a correlation? Give examples

  13. Give an example where the median is a better measure than the mean

  14. Given two fair dices, what is the probability of getting scores that sum to 4? to 8?

  15. What is the Law of Large Numbers?

  16. How do you calculate the needed sample size?

  17. When you sample, what bias are you inflicting?

  18. How do you control for biases?

  19. What are confounding variables?

  20. What is A/B testing?

  21. How do you prove that males are on average taller than females by knowing just gender height?

  22. Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

  23. You roll a biased coin (p(head)=0.8) five times. What's the probability of getting three or more heads?

  24. A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)

  25. Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?

  26. An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?

  27. You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

  28. Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.

  29. The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?

  30. Consider influenza epidemics for two-parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?

  31. Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44 year old has a DBP less than 70?

  32. In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student's T confidence interval for the mean brain volume in this new population?

  33. A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up baseline) is -2 pounds. What would the standard - deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?

  34. In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System - Old System).

  35. To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there's so many observations per group, just use the Z quantile instead of the T.)

Top comments (0)