- Más nuevo
- Más votos
- Más comentarios
UPDATED QUERY:
Hey AlexR, could you please try running this query?
WITH systemid_partition_day AS ( SELECT systemid, partition_day FROM event_index ) SELECT ed.* FROM event_data ed JOIN systemid_partition_day spd ON ed.systemid = spd.systemid WHERE ed.partition_day = spd.partition_day LIMIT 10;
This query joins
event_data
withevent_index
based onsystemid
andpartition_day
, filteringevent_data
to include only rows where thepartition_day
matches thepartition_day
associated with eachsystemid
inevent_index
, limiting the result to 10 rows.
or
SELECT ed.* FROM event_data ed JOIN ( SELECT systemid, partition_day FROM event_index ) spd ON ed.systemid = spd.systemid WHERE ed.partition_day = spd.partition_day LIMIT 10;
Sources:
Hi, I was looking at your earlier question on this topic too - i wonder how it behaves if you explicitly set the
event_data.partition_day=xxxxx event_index.partition_day=xxxxx as well as doing the join
It (should) be able to just work this out for itself as it's implicit by the join statement - but maybe being explicit helps it work out the more efficient pruned plan?
Cheers, Rich
Setting the partition_day explicitly "works" (the error goes away) but is not the correct query: the event_index can contain more than one partition_day for the same given systemid. The event_data table contains 1000x more data than the event_index table, so it is critical that the event_index must be queried first, followed by reading only the relevant partitions from event_data. I can make this work by making two separate queries and setting the explicit partition_day in the second query. That's effectively moving the JOIN to application code. That's an anti-pattern, a form of technical debt that I would like to resolve.
Contenido relevante
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace 2 años
Unfortunately this is not the same as the original query. The partition_day is not known in advance. The event_index table exists to map systemid to partition_day. The systemid is known at query time (it is an input to the query), but the partition_day is not. As an analogy to help you understand, think about an airport tracking the ID of traveler's wi-fi devices. When you want to query the logs for a given ID, you do not know in advance which days the device was at the airport. That is the purpose of the event_index table.
Hey AlexR, I've made some updates to the query. Could you give it a try and let me know if it works?