Linear regression question -- Is the slope equivalent to individually finding the slope between all the datapoints?

  • #1
sawtooth500
16
0
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out? I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset. Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again.... My intuition says yes I will get the same result but I've forgotten the math necessary to prove that. Thank you!
 
Mathematics news on Phys.org
  • #2
sawtooth500 said:
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out?
No. Remember that the regression line minimizes the sum-SQUARED errors of the line versus the sample y-values. So a sample y value being far from the line will have much more effect than if the slopes were just averaged. Also, what slopes are you talking about? The sample might have many different y values from the same x value. How would you define a slope then?
sawtooth500 said:
I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset.
You haven't said how many dimensions you have in your independent variable(s). I will assume that you are talking about simple linear regression.
sawtooth500 said:
Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again....
There are fairly efficient calculations. See this. You can use:
##\hat {\beta} = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n\sum {x^2_i} - ( \sum x_i)^2}##
 
  • #3
sawtooth500 said:
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out?
No. Remember that the regression line minimizes the sum-SQUARED errors of the line versus the sample y-values. So a sample y value being far from the line will have much more effect than if the slopes were just averaged. Also, what slopes are you talking about? Slope from what to what? The sample might have many different y values from the same x value. How would you define a slope then?
sawtooth500 said:
I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset.
You haven't said how many dimensions you have in your independent variable(s). I will assume that you are talking about simple linear regression.
sawtooth500 said:
Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again....
There are fairly efficient calculations. See this. You can use:
##\hat {\beta} = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n\sum {x^2_i - ( \sum x_i)^2}##
 
  • #4
So a bit of clarification -

For any given X value, there will only be one Y value.

Good point about the Y values being squared in a regression - that does give the points further away more sway - at this time we don't want to do that.

The calculation we're doing can be imagined like this - Image X axis variables A-Z

So we need to calculate the average slope of -

A B C D E
B C D E F
C D E F G
D E F G H

And so on.... except in my actual dataset initially we are working with about 350,000 X variables (each X variable actually represents a timestamp in nanoseconds), so basically we are taking 5 second "lookbacks" at every 1 second interval - and depending on the lookback, some times are "busier" than others, so we have have hundreds to thousands of individual data points in a 5 second block. Because of this lookback nature, you can see that just in a 350,000 dataset we'll likely have tens of millions of calculations, and later data sets have millions of initial entries....

So I was thinking just have the program go through the ENTIRE dataset to find the slopes between each line... then just sum up the averages.

Of course it will be interesting to try a linear regression model after we do this model since the Y axis is squared in the regression model... and see how the data compares.
 
  • #5
Suppose that your x values are ordered, ##x_0 \lt x_1 \lt .... \lt x_n##. An average of the slopes from point to point can give strange results unless the x values are reasonably equally spaced. Otherwise, you might have two x values very close together where even a small difference in the y values gives a huge slope.
It sounds like you are not wanting to use the sum-squared-errors of linear regression, so you may have to invent your own method. Be careful.
When you mention a 5 second "lookback" it makes me think that some sort of weighted average, where the influence of older values has less weight might be appropriate.
In any case, you should not be too intimidated by tens of millions of calculations unless you require hard real-time results. Today's computers are VERY fast at simple calculations like this.
 

Similar threads

  • General Math
Replies
5
Views
1K
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
364
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
665
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
Replies
12
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Replies
6
Views
914
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Replies
20
Views
3K
Back
Top