Bad RMSE when predicting Price with Linear Regression

0

Hi. I have a dataset of price data. It looks like this

PriceBranchItemCodeDiscountDateTimeOfPrice
10002523454360.332022-03-24 14:00

The dataset has about 1M records

I feature engineered it in the following way

PriceDiscountItemCodeYearMonthDayHourBranch1Branch2Branch3
100.33523454362022032414010

Each component of the DateTimeOfPrice got a separate column We have 3 branches. To avoid the situation when algorithm may think that "branch" column is some kind of priority column, I created 3 new column (we have 3 branches). If the item belongs to branch2, the column will get the value 1, if not - it will be 0

I run Linear Learner, XGBoost build-in algorithms and also SageMaker AutoPilot. In all cases I run , the best RMSE was 60 and prediction/ validation gives sometimes a result which is far from the actual value. I tried also to run XGBoost from the notebook with the following parameters

hyperparams = {
    "max_depth": "7",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "100",
    "eval_metric":"rmse",
    "verbosity": "2",
}

Still, the RMSE is arround 60.

Please advice what can be done to improve the mertic and predication

AWS
Mi_Sha
gefragt vor 2 Jahren304 Aufrufe
3 Antworten
0

Since I see you have a timestamp field in your data, would it be fair to assume your use case is mainly aimed at forecasting future prices - rather than estimating missing historical prices at different points in time?

If so, plain tabular regression (Autopilot regression task type) is probably not a good way to tackle this problem as forecasting techniques would work better instead. You could instead explore:

  • SageMaker Canvas, which offers a forecasting model (see the docs here to make sure your input timestamp is recognised so that Canvas shows you the forecasting option)
  • Amazon Forecast, a dedicated managed forecasting service separate from SageMaker
AWS
EXPERTE
Alex_T
beantwortet vor 2 Jahren
  • I followed you suggestion and used Sagemaker Canvas

    I modified the data structure in the following way. Create 5 records pnly

    ItemPrice Branch Discount ItemCode PriceDate

    I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

    According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

    | | | | | |

0

I followed you suggestion and used Sagemaker Canvas

I modified the data structure in the following way

ItemPriceBranchDiscountItemCodePriceDate
DataDataDataDataData
DataDataDataDataData

I choose ItemCode as "id" and "grouped" by "branch". However the score of the prediction is very poor score 22%

According to the analisys the reason is because of the Discount column. So I removed it and run the process again. And the score was even lower :(

AWS
Mi_Sha
beantwortet vor 2 Jahren
0

I suggest before you start to build your algorithm, do a data exploration. Does your data have a seasonality? Some items are just not seasonal.

AWS
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen