Vote count:
0
Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://ift.tt/SN63Zy
003 http://ift.tt/SdtquN
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID
should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://ift.tt/SdtquN
The Pig GROUP BY
operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT
operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002
are returned.
asked 15 secs ago
Aucun commentaire:
Enregistrer un commentaire