mardi 27 mai 2014

In Apache Pig, select DISTINCT rows based on a single column


Vote count:

0




Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:



ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://ift.tt/SN63Zy
003 http://ift.tt/SdtquN


I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:



ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://ift.tt/SdtquN


The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).


The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.


For my purposes, I do not care which of the rows with ID 002 are returned.



asked 15 secs ago

Arel

175





Aucun commentaire:

Enregistrer un commentaire