Recently I have been tasked with building a flexible, user friendly, and fast data streaming platform for a client. It needed to be flexible in a way that we could demand the way data would flow through a pipeline without needing to recompile code for every change. User friendly in a way that new developers and users of the system didn’t need to take several months to learn. Finally, it needed to be fast enough to process tens of thousands of transactions per second. After evaluating several streaming frameworks I came across Twitter Heron. Heron had just about all we needed, but we still needed the ability to build topologies without recompilation. From this need, the idea of ECO (Extensible Component Orchestrator) came about. Over the past few months, I’ve been working with the devs and architects at Twitter and Streamlio to build another way to design and deploy topologies with Heron. It’s been a lot of late nights and hard work, but we managed to come up with the first beta release of this API. I am going to give a brief explanation of what Heron is used for, how conversations of ECO came about, the capabilities of, and how ECO works.
Heron is a system that allows realtime, distributed, and fault-tolerant stream processing and is currently in the application process to be an incubating Apache project. It was created by Twitter after experiencing pains with their previous systems that were used to process billions of tweets in real time. Heron allows you to define data pipelines called “topologies” for processing data sets. This allows you to make decisions on data that enter your pipeline at sub second speeds. At 1904labs one of our clients has a need for processing large amounts of data quickly while keeping the manageability of the system as a #1 priority. This need is what led us to explore options already being developed. After a month of research I came across Heron. It was created from the same people who wrote Storm and addresses some of the hard to manage areas that Storm users had to deal with. In my opinion, Storm offered one benefit over Heron; this is Flux. Flux allows you to build topologies from yaml files which removed the need to design topologies in Java code. Fueled by my own personal agenda, I decided to pitch the idea of “Flux” to the Heron community. They bit. ECO had been approved to be built.
Fast forward a few months and you have your first beta release of the ECO API out for public consumption and testing. Defining the initial release of ECO was simple: being able to take a Storm Flux topology and run it as a Heron ECO topology with minimal changes. The concept behind ECO is a little different than it is with Heron’s existing Streamlet and Topology APIs. With the Streamlet and the Topology APIs you are defining the path your data moves through a topology with Java (other languages are soon to be supported), but you are left to recompile the project if you decide to change the path of how data flows between the components in a topology. This is where ECO steps in.
ECO’s concept is that you bundle all of the spouts and bolts you plan to use in a jar. You then assemble the way data is supposed to flow between each spout and bolt via the ECO configuration file. The configuration file consists of three main parts, the configuration definitions, the topology definition and stream definitions; all defined in yaml. The configuration definition section allows you to set your container, topology, and component specific configurations. Extracting this from outside of your Java topology code means changes to configuration no longer require recompilation.The topology definition is where you list your components, spouts, and bolts. You can find a sample topology definition below this paragraph. Components will typically be configuration classes that spouts and bolts depend on. This way, if you have a spout or bolt that requires property injection other than primitives through constructor or setter injection you have the ability to do so. Streams define the direction data flows between two components and the method of which that data is transferred.
name: "simple-wordcount-topology" # topology configuration # this will be passed to the submitter as a map of config options config: topology.workers: 2 topology.component.resourcemap: - id: "spout-1" ram: 256 MB - id: "bolt-1" ram: 256 MB topology.component.jvmoptions: - id: "spout-1" options: ["-XX:NewSize=300m", "-Xms2g"] # spout definitions spouts: - id: "spout-1" className: "com.twitter.heron.examples.eco.TestNameSpout" parallelism: 1 # bolt definitions bolts: - id: "bolt-1" className: "com.twitter.heron.examples.eco.TestNameCounter" parallelism: 1 - id: "bolt-2" className: "com.twitter.heron.examples.eco.LogInfoBolt" parallelism: 1 # stream definitions define connections between spouts and bolts. streams: - from: "spout-1" to: "bolt-1" grouping: type: FIELDS args: [ "word"] - from: "bolt-1" to: "bolt-2" grouping: type: SHUFFLE
ECO’s core is built from the core code of Flux for compatibility. Because of this, Storm users can now take almost any topology they are existing in Storm and run it in Heron with minimal changes. The future roadmap for ECO is going to be influenced heavily by the community. At this time we chose to only support org.apache.storm spouts and bolts, based on adoption and/or demand we may change this. ECO is supported from Heron version 0.17.6 and onwards. Check out the ECO docs at http://heronstreaming.io/docs/developers/java/eco-api for more information on how to run these topologies