Artificial intelligence (AI) is a must-have in today’s economy and the benefits of data-driven businesses can be seen in the success of AI trailblazers like Google and Amazon. Today companies of all sizes are trying to stay competitive by collecting large amounts of data for data-driven decision making. However, until recently, ingesting, processing, and distributing large amounts of data in real-time across a single platform was a huge challenge for companies, especially for companies aggressively charting the path to AI.
Enter Apache…and the power of the open source community.
The Apache Software Foundation is the leader in open-source software with over 50 active projects. On a weekly basis, The Apache Foundation receives over 35 million page views and they boast of having over 460,000 active community members. One of the most successful projects to date has been with the distributed messaging system known as Apache Kafka. As a result of the efforts of this large community, Kafka has become a mainstream technology for the largest enterprises in the world. In fact, more than 30% of the Fortune 500 companies use Apache Kafka, specifically, an event streaming platform for high-performance data pipelines, real-time analytics, and data infrastructure. It is also open-source, meaning any company can build and distribute their own program with it.
The purpose of this article is to highlight four enterprises that have transformed their businesses using open source software from the Apache Foundation. Additionally, the paper will touch on the future of AI and distributed messaging with a rising Apache innovator.
Listed alphabetically, here are the Top Apache Innovators:
In 2011, the engineering team from LinkedIn built what would become Kafka because it needed a single, distributed pub-sub platform to accommodate its growing membership and site complexity. In 2014, LinkedIn contributed this innovation to The Apache Foundation and it became an open source offering known as Apache Kafka. Today, LinkedIn maintains over 100 Kafka clusters with more than 4,000 brokers serving 100,000+ topics. This infrastructure allows LinkedIn to track activity data, message exchanges, and operational metrics. They also customize Kafka to maximize overall operability with its large scalability requirements.
For Netflix, their use of Kafka focuses on real-time monitoring and event-processing. They utilize a dual cluster system in one Keystone pipeline: a Consumer Kafka and a Fronting Kafka. On the consumer side, Kafka is used to route topics in real-time to consumers while the Fronting Kafka delivers messages from the producers. With 36 Kafka clusters and 4,000+ broker instances, Netflix is able to ingest an average of more than 700 billion messages per day.
Oracle provides a service called Oracle Service Bus (OSB) which connects its Enterprise Service Bus to Kafka. This allowed developers to personalize implementations for data pipelines. Additionally, Oracle recently introduced their Cloud Infrastructure Streaming Service which opened up Oracle customers to move data from streaming to autonomous warehouses for analytics, capture database change data, build event-driven applications on top of streaming, and other use cases. With Kafka, Oracle provides data storage and ingestion at high-volumes and process in real-time.
Spotify music streaming has over 200 million users and 40+ million tracks available. Kafka is used as the key component in their log delivery system. They adopted it as a part of their pipeline to accommodate their growing amounts of data as well as to reduce the average transfer time for logs, from 4 hours to 10 seconds! Additionally, Spotify’s previous production load peaked at ~700 thousand events per second. Thanks to Kafka, Spotify has the ability to support 2 million events per second and from a single data center. Now, the daily transport of data from all hosts to a centralized repository is quick and effective.
As four very influential tech players take advantage of Apache Kafka, what does the future hold for this technology?
As of recently, Kafka has dominated the scene of event streaming at the highest scale. It is heavily relied on in the space of big data to ingest and transport large amounts of data quickly. However, as the demand for machine learning initiatives rise, Kafka is no longer sufficient.
Kafka is not cloud-native and does not scale horizontally in an effective manner. It also does not separate storage and compute. Additionally, Kafka does not track the consumers a topic has or who has consumed what messages, leaving more work for the consumers. As the amount of data explodes within an ML context, Kafka users are left with challenges in provisioning excess hardware and constant performance tuning to avoid costly outages.
Newer technologies like Apache Pulsar add data processing capabilities essential for feeding data into analytics and AI applications. Additionally, Pulsar has higher overall throughput with lower latency to minimize data loss. Recently, Pulsar has gained real traction in the market – already companies including Capital One, Verizon Media, Splunk, Tencent, Yahoo Japan, and many others are adopting its technology.
Just to name a few, here are the key features of Pulsar:
- Stateless brokers
- All-in-one streaming and queuing
- Distributed ledgers rather than logs
- Easy geo-replication
Rising Apache Innovator – Powered by Apache Pulsar:
Pandio helps companies connect their data more efficiently to AI / ML models in the cloud. Pandio’s technology leverages the Apache Pulsar core operating framework. For data science teams, Pandio’s hosted solution frees up their time to focus on tuning and operationalizing their ML models. For software architects, their technology is the catalyst for shifting applications, databases, and systems towards a distributed environment. From the CTO perspective, Apache Pulsar allows for the delivery of 2.5x the performance of any other messaging platform available at just 60% of the cost. And for the CFO, Pandio’s hosted solution is tuned and optimized by means of a federated neural network, eliminating the need for a costly DevOps team. For the CEO, Pandio is the catalyst for AI / ML adoption – making the access, ingestion, and movement of data at scale a reality.
About the Author:
Stephanie Moore is a Data Scientist at The Data Standard, the premier user community for Data Science, AI, ML, and Cyber Security thought-leaders. Stephanie’s background is in data wrangling, machine learning, and data visualization. Prior to working at The Data Standard Stephanie graduated from the University of California San Diego as a scholar-athlete and Data Science major. She is passionate about turning complex data into actionable insights through analysis and storytelling with the hopes of using her skills to create innovative solutions with a worldwide impact.