How To Find The Pathing Flow And Rank Them Using Pig Or Hive?
Solution 1:
You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1
would become 1 -> 2 -> 1
. If this is correct, then you can't just group and distinct
(as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.
UDF:
package something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
publicclassRemoveSequentialDuplicatesUDFextendsUDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = newArrayList<Text>();
newList.add(arr.get(0));
for (inti=1; i < arr.size(); i++) {
Stringfront= arr.get(i).toString();
Stringback= arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
To build this jar you will need a hive-core.jar
and hadoop-core.jar
, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:
Query:
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary functioncollectas "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";
select screen_flow, count
, dense_rank() over (orderby count desc) rank
from (
select screen_flow
, count(*) count
from (
select session_id
, concat_ws("->", remove_dups(screen_array)) screen_flow
from (
select session_id
, collect(screen_name) screen_array
from (
select*from database.table
orderby screen_launch_time ) a
groupby session_id ) b
) c
groupby screen_flow ) d
Output:
s1->s2->s3 21
s1->s2 12
s1->s2->s3->s1 12
Hope this helps.
Solution 2:
Input
990004916946605-1404157897784,S1,1404157898275990004916946605-1404157897784,S1,1404157898286990004916946605-1404157897784,S2,1404157898337990004947764274-1435162269418,S1,1435162274044990004947764274-1435162269418,S2,1435162274057990004947764274-1435162269418,S3,1435162274081990004947764274-1435162287965,S2,1435162690002990004947764274-1435162287965,S1,1435162690001990004947764274-1435162287965,S3,1435162690003990004947764274-1435162287965,S1,1435162690004990004947764274-1435162212345,S1,1435168768574990004947764274-1435162212345,S2,1435168768585990004947764274-1435162212345,S3,1435168768593
register /home/cloudera/jar/ScreenFilter.jar;
screen_records = LOAD '/user/cloudera/inputfiles/screen.txt'USING PigStorage(',') AS(session_id:chararray,screen_name:chararray,launch_time:long);
screen_rec_order = ORDER screen_records by launch_time ASC;
session_grped = GROUP screen_rec_order BY session_id;
eached = FOREACH session_grped
{
ordered = ORDER screen_rec_order by launch_time;
GENERATE groupas session_id, REPLACE(BagToString(ordered.screen_name),'_','-->') as screen_str;
};
screen_each = FOREACH eached GENERATE session_id, GetOrderedScreen(screen_str) as screen_pattern;
screen_grp = GROUP screen_each by screen_pattern;
screen_final_each = FOREACH screen_grp GENERATE groupas screen_pattern, COUNT(screen_each) as pattern_cnt;
ranker = RANK screen_final_each BY pattern_cnt DESC DENSE;
output_data = FOREACH ranker GENERATE screen_pattern, pattern_cnt, $0as rank_value;
dump output_data;
I am not able to find a way to use Pig Builtin function to remove adjacent screens for a same session_id ,hence I have used JAVA UDF inorder to remove the adjacent screen names.
I created a JAVA UDF called GetOrderedScreen and covrerted that UDF in to jar and named that jar as ScreenFilter.jar and registered that jar in this Pig Script
Below is the Code for that GetOrderedScreen Java UDF
publicclassGetOrderedScreenextendsEvalFunc<String> {
@Override
publicString exec(Tuple input) throws IOException {
String incoming_screen_str= (String)input.get(0);
String outgoing_screen_str ="";
String screen_array[] =incoming_screen_str.split("-->");
String full_screen=screen_array[0];
for (int i=0; i<screen_array.length;i++)
{
String prefix_screen= screen_array[i];
String suffix_screen="";
int j=i+1;
if(j< screen_array.length)
{
suffix_screen = screen_array[j];
}
if (!prefix_screen.equalsIgnoreCase(suffix_screen))
{
full_screen = full_screen+ "-->" +suffix_screen;
}
}
outgoing_screen_str =full_screen.substring(0, full_screen.lastIndexOf("-->"));
return outgoing_screen_str;
}
}
Output
(S1-->S2-->S3,2,1)
(S1-->S2,1,2)
(S1-->S2-->S3-->S1,1,2)
Hope this helps you!.. Also Wait for some more time, Some good brains who see this Question will answer effectively(without JAVA UDF)
Post a Comment for "How To Find The Pathing Flow And Rank Them Using Pig Or Hive?"