Redshift unload problem with Parquet and negative numbers

0

Looks like there's a problem unloading negative numbers from Redshift to Parquet.

For example, my table has a column that's numeric(19,6), and a row with a value of -2237.430000. When I unload to a Parquet file and read it back with a Python program, the value is 18446744071472.121616 (which is the 2's complement).

Can anyone confirm? Should be easy enough to reproduce.

Let me know if there's a fix.

Thanks.
-LP

lp2020
asked 4 years ago857 views
16 Answers
0
Accepted Answer

Just for curiosities sake what's the unloaded number when read back in from Redshift Spectrum or in Athena?

klarson
answered 4 years ago
0

I just read the Parquet file back into Redshift, and the value was inserted correctly into the new table as a negative number.

COPY my_table
FROM 's3://my_bucket/my_parquet_file'
IAM_ROLE 'arn:aws:iam::...'
FORMAT AS PARQUET;

So looks like Redshift is doing it correctly, and the problem I'm encountering points to PyArrow.

Thanks.

Edited by: lp2020 on Jan 7, 2020 8:53 AM

lp2020
answered 4 years ago
0

Marking as not answered -- see previous reply.

Edited by: lp2020 on Jan 7, 2020 11:36 AM

lp2020
answered 4 years ago
0

Actually, I'm seeing the incorrect value when I read the Parquet file in multiple readers. For example, this web site shows me the two's complement number when I upload the Parquet file.

http://parquet-viewer-online.com/

Therefore, it seems like Redshift is not off the hook. Perhaps Redshift can read the Parquet file back and correctly convert to a negative number, but it looks like others are not reading it back as a negative number. This includes:

  • Azure Data Factory (the original place that I encountered this problem)
  • PyArrow (read_table() and to_pandas() functions)
  • Pandas (pd.read_parquet() function)
  • parquet-viewer-online.com web site (however they're doing it)

Can someone else please confirm / try to reproduce this problem?

Thanks.
-LP

Edited by: lp2020 on Jan 7, 2020 11:37 AM

Edited by: lp2020 on Jan 7, 2020 11:37 AM

lp2020
answered 4 years ago
0

It certainly seems like Redshift coercion of the number from Parquet back to Redshift is behaving differently than the other Parquet readers you cite. That alone feels like there's basis for a Redshift support case on the reading side regardless if there's a separate problem on the writing side.

klarson
answered 4 years ago
0

Is the Redshift support case something I need to submit? Or will someone from the Redshift team pick it up from this thread?

lp2020
answered 4 years ago
0

Hi lp2020,

If you have AWS Enterprise support I'd recommend that you submit a support case via the Support Center link in the upper left "Support" drop down on any AWS Console page.

Regards,
-Kurt

klarson
answered 4 years ago
0

Hi,
We have fixed this Unload issue in the next version of Amazon Redshift Patch Release. This patch will automatically be deployed to your Cluster if the track is set to CURRENT. If this is an urgent issue, please file a support case for us to generate a private patch and deploy to your cluster.

Regards
AWS Redshift Team

answered 4 years ago
0

Thank you very much!!

lp2020
answered 4 years ago
0

Hi lp2020,

Were you able to see where the problem was in the Parquet data type or the Parquet metadata in the unloaded files?

If so could you please share what the issue was? So other customers, myself included, can see if any of the data we have unloaded to Parquet also sufferers from this problem.

If not it would be next best if you could retain an unloaded Parquet file from unloaded data that has this problem and then compare it to another unload of the same data after the fix. Then it should be possible to identify the fingerprint of the defect with a diff of the metadata in a bad and a good Parquet file.

Thanks,
-Kurt

klarson
answered 4 years ago
0

Hi lp2020,

I'm looking for a quicker more direct way to determine if other Redshift customers like me already have artifacts of this problem in data we have also unloaded to Parquet. You said you saw the problem reading the unloaded data with the following tools:

  • Azure Data Factory
  • PyArrow
  • Pandas
  • parquet-viewer-online.com web site

Did you or can you also see the problem with some of these other tools that can more quickly be used to check existing unloaded data that part of external partitioned tables that are in the Glue Data Catalog?

  • Redshift Spectrum (using the Glue Data Catalog)
  • Athena (uses the Glue Data Catalog by definition)
  • EMR Presto (using the Glue Data Catalog)
  • EMR Spark SQL (using the Glue Data Catalog)

Basically I'm looking for quick low cost way to identify the problem in unloaded data already sitting in S3. So, we can either rule out the problem or turn our focus to remediating problem Redshift has created for us with this bug.

Thanks,
-Kurt

klarson
answered 4 years ago
0

@klarson if that can help we also noticed that issue and that what we could find out :

  1. It seems to be impacting only NUMERIC datatypes, casting those to double precision while unloading would produce "correct" negative values.
  2. The issue can be seen using Athena as well as S3 Select.
    I haven't checked since the last update if this is fixed or not but it could be easily done with the mentioned services.
answered 4 years ago
0

Hi kevinv-dh,

Thanks so much for replying to my ask so quickly. We can immediately use Athena to check an awful lot of data we have already unloaded to Parquet to see if it has the symptom. I strongly suspect that we will see the symptom.

Do you know of any other solutions to remediate the bad data in Parquet other than loading it back into Redshift and then reunloading it with the cast to double precision? I suspect the clock may be ticking on using Redshift to do that because once we're patched with the fix I don't think we should expect that to work because as reported by lp2020 Redshift currently can COPY the Parquet data back in to undo what UNLOAD did. So, it looks like there was a symetric bug in both UNLOAD and COPY.

Thanks,
-Kurt

klarson
answered 4 years ago
0

Small update, so far the issue remains for us (Frankfurt).
I haven't found a way to correct that data outside of Redshift, best is probably to get in touch with the support about it to see whether it is an actual bug or some compatibility issue. In any case, I think if you need to recover data at some point and Redshift is the only option, probably the support would be able to downgrade a cluster to a version with the potentially bugged COPY so you can recover.

answered 4 years ago
0

Hi kevinv-dh,

You're probably correct about being able to get Redshift to downgrade a cluster to a version that has the bug. That is as long as the patch with the fix doesn't span an OS patch as well. The last release cycle that we just got in us-east-1 this week (on the trailing track) included an OS patch too. You can see this coming ahead of time in what the Redshift events are telling you when you clusters will get patched, 2026 events that contain the words "system maintenance" mean an OS patch, and 2025 events that contain the words "database maintenence" mean Redshift software version upgrade.

However, we're trying to get this fixed sooner than we'll the bug fix which in this thread seems like it may be in the next release cycle's current track release. So, we'll have to wait 2 release cycles, i.e. ~6 weeks, because we're on the trailing track.

We'll try reloading the data with COPY and then re-export it with UNLOAD using the cast to double precision that you mentioned. We'll verify it as a workaround and I'll post back the results.

Thanks for your help,
-Kurt

klarson
answered 4 years ago
0

2's complement is not a good thing... it reminds me of Buffer Overflow Errors...

Check what value is returned when reading the unloaded data back into Redshift directly using a SELECT statement. This will confirm if the issue is only occurring after exporting to Parquet.

Consider unloading to a different file format like ORC or CSV that may preserve the negative number correctly. Parquet sometimes has issues with data type representations.

File an AWS support case with example code and data. The support team may be able to investigate further and determine if this is an known issue with the Redshift Parquet unload process.

profile picture
EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions