This example demonstrates simple join between two streams.
To ensure correctness of joining data in the distributed environment, classification must
precede any type of streams combine (i.e., join and/or union*). This will ensure
the two+ streams represented as individual partitions have comparable data.
The following case has two data sets:
-one-
1 Oracle
2 Amazon
. . .
- two-
Arun Murthy 3
Larry Ellison 1
. . .
Classification is performed using the common "id", this ensuring that
'1 Oracle' and 'Larry Ellison 1' will end up in the same partition.
In this example you can also see a nice side-effect of 'classification',
since this example uses 'dstream.parallelism=3' configuration.
Since variation of classification values matches the 'parallelism' value (3)
the result resembles 'join' behavior since each of the three partitions
only contain data relevant to classification id, giving you the following result:
=> PARTITION:0
3 Hortonworks
Rob Bearden 3
Herb Cunitz 3
Tom McCuch 3
Oleg Zhurakousky 3
Arun Murthy 3
=> PARTITION:1
1 Oracle
Larry Ellison 1
Thomas Kurian 1
=> PARTITION:2
2 Amazon
Jeff Bezos 2
Jeffrey Blackburn 2