Bad RMSE when predicting Price with Linear Regression


Hi. I have a dataset of price data. It looks like this

10002523454360.332022-03-24 14:00

The dataset has about 1M records

I feature engineered it in the following way


Each component of the DateTimeOfPrice got a separate column We have 3 branches. To avoid the situation when algorithm may think that "branch" column is some kind of priority column, I created 3 new column (we have 3 branches). If the item belongs to branch2, the column will get the value 1, if not - it will be 0

I run Linear Learner, XGBoost build-in algorithms and also SageMaker AutoPilot. In all cases I run , the best RMSE was 60 and prediction/ validation gives sometimes a result which is far from the actual value. I tried also to run XGBoost from the notebook with the following parameters

hyperparams = {
    "max_depth": "7",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "100",
    "verbosity": "2",

Still, the RMSE is arround 60.

Please advice what can be done to improve the mertic and predication

feita há 2 anos304 visualizações
3 Respostas

Since I see you have a timestamp field in your data, would it be fair to assume your use case is mainly aimed at forecasting future prices - rather than estimating missing historical prices at different points in time?

If so, plain tabular regression (Autopilot regression task type) is probably not a good way to tackle this problem as forecasting techniques would work better instead. You could instead explore:

  • SageMaker Canvas, which offers a forecasting model (see the docs here to make sure your input timestamp is recognised so that Canvas shows you the forecasting option)
  • Amazon Forecast, a dedicated managed forecasting service separate from SageMaker
respondido há 2 anos
  • I followed you suggestion and used Sagemaker Canvas

    I modified the data structure in the following way. Create 5 records pnly

    ItemPrice Branch Discount ItemCode PriceDate

    I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

    According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

    | | | | | |


I followed you suggestion and used Sagemaker Canvas

I modified the data structure in the following way


I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

respondido há 2 anos

I suggest before you start to build your algorithm, do a data exploration. Does your data have a seasonality? Some items are just not seasonal.

respondido há 2 anos

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas