3D grphique: In Apache Pig, select DISTINCT rows based on a single column

mardi 27 mai 2014

In Apache Pig, select DISTINCT rows based on a single column

Vote count:

0

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:


ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://ift.tt/SN63Zy
003    http://ift.tt/SdtquN

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:


ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://ift.tt/SdtquN

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.

asked 15 secs ago

Arel

175

3D grphique

mardi 27 mai 2014

In Apache Pig, select DISTINCT rows based on a single column

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

mardi 27 mai 2014

In Apache Pig, select DISTINCT rows based on a single column

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0