- #1
fog37
- 1,549
- 107
- TL;DR Summary
- Understanding the sample best-fit line and its relation to the population best-fit line
Hello,
1) Let's consider a population of 1,000,000 data points with each data point being represented by the pair of values (x,y).
Let's assume that, when plotted on a graph, the 1,000,000 points look like a spread out cloud with an overall positive linear trend. These 1,000,000 points represent the population. The best-fit line calculated using all the 1,000,000 points will have a specific slope and intercept.
Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct? In essence, the average value of the y variable depends linearly with the value of x variable.
2) In this case, instead of using all 1,000,000 data points to plot the graph and calculate the best fit line, we only use a random sample of 100 points. The best-fit line obtained using the 100 random data points is a different line from the best fit line calculated using the 1,000,000 points. We can take a different sample of 100 random points and the best fit line will again be different (but similar in intercept and slope to the previous sample line). In essence, both the slope and the intercept, calculated for each different random sample of size 100, are random variables. Very often we can only work with a sample and not with the 1,000,000 data points population. Under which conditions will the sample best-fit line be a good approximation of the population best fit line? The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?
Thank you!
thank you!
1) Let's consider a population of 1,000,000 data points with each data point being represented by the pair of values (x,y).
Let's assume that, when plotted on a graph, the 1,000,000 points look like a spread out cloud with an overall positive linear trend. These 1,000,000 points represent the population. The best-fit line calculated using all the 1,000,000 points will have a specific slope and intercept.
Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct? In essence, the average value of the y variable depends linearly with the value of x variable.
2) In this case, instead of using all 1,000,000 data points to plot the graph and calculate the best fit line, we only use a random sample of 100 points. The best-fit line obtained using the 100 random data points is a different line from the best fit line calculated using the 1,000,000 points. We can take a different sample of 100 random points and the best fit line will again be different (but similar in intercept and slope to the previous sample line). In essence, both the slope and the intercept, calculated for each different random sample of size 100, are random variables. Very often we can only work with a sample and not with the 1,000,000 data points population. Under which conditions will the sample best-fit line be a good approximation of the population best fit line? The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?
Thank you!
thank you!