7 items found for ""
- Benchmarking in the Real World: Apache Spark™ 3.5.0 on EC2
In our conversations with clients, we've found that Spark is a go-to tool for many data engineering and analytics tasks. It's battle-tested, robust, and capable of handling virtually any amount of data. However, a common challenge we hear about is the difficulty in managing an organization's Spark deployment. Which clusters should be used? How many clusters are needed? Should you use EMR or EMR Serverless? The answer often depends on the specific characteristics of your organization's data and workloads. However, there are some general guidelines that can help improve your total cost of ownership (TCO) without resorting to "it depends." At Underspend, we've benchmarked various Spark programs on different EC2 instances and found a difference of over 100% in the cost of running a given program. The key takeaway is that it's worth investing time to consider the instances being used for Spark, as the cost difference can be significant. Our results are based on a real-world PySpark program provided by one of our clients. The program does not use UDFs and is translated to clean Spark code, running on Spark 3.5.0. While the TPC-DS benchmark is commonly used and definitely has its uses, we've found that the program we used is more representative of what companies run in the real world. Here are our findings: The most cost-effective instances were c7a.xlarge and c7g.xlarge, both showing meaningful improvements compared to their previous generation (c6a and c6g). The difference between the most expensive instance (r6i.xlarge) and the cheapest ones is $0.54, which is 142% of the cost of the query on the cheapest instance. Your workloads might differ in various aspects (e.g., memory usage, data) from the one we tested. Nevertheless, if you're running a workload frequently, it's advisable to benchmark it on various instances and optimize the instance type to achieve the best total cost of ownership. © Underspend 2024. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
- Programming Language Showdown: Rust vs. C vs. C++ for Software Optimization
"Which programming language should we use?" Embarking on a new project with a clean slate theoretically allows for the selection of any programming language. However, in practical scenarios, project choices are often constrained by factors like company standards and necessary integrations. At Underspend, TypeScript is our go-to choice. But when we optimize libraries for companies, we work within the environment that company is already operating in. Still, there is some wiggle room when choosing a language to work with. The ideal language varies based on the situation: In Java, the Java Native Interface (JNI) enables calling native applications and libraries in different languages. Yet, external calls through JNI are often performance-intensive. Due to a lack of built-in integration with native code, the choice of programming language does not impact performance. In Node.js, WebAssembly (WASM) allows code in C/C++/Rust to be compiled and executed at nearly native speed. This capability enables writing performance-intensive sections of a Node.js application in any low-level language that compiles to WASM. In Python, Cython provides a method to write inline C code, which is then compiled into the Python library. For optimizing Python applications, we leverage Cython to craft performance-critical code segments. In scenarios where the overhead of invoking external libraries is not a major concern, alternative languages may be considered, but C is often preferred for its performance advantages when integrated with Cython. A similar principle applies to Golang and its cgo feature. Sometimes it's fun to build using a wide variety of languages. Other times you're reminded why people mostly no longer write code in C :)
- Striking the Balance: Compute Savings vs. Networking Costs in the Public Cloud
Managing costs in public cloud environments, such as AWS, Azure, and GCP, is a complex challenge. One appealing solution is the use of spot instances. Also known as "spot VMs" on Azure and GCP, these instances offer cheaper compute power with the caveat that they might be interrupted with short notice (120s on AWS and GCP, 30s on Azure). In theory, if your workload can handle interruptions, you should always use spot instances. It’s also recommended to expand the search of spot instances across all of the availability zones in a region to secure the best price. However, in practice, the cost of compute isn’t the sole consideration. Companies that extensively use spot instances may find themselves with a large, unexpected networking bill due to the traffic between the spot instance and other services. A Closer Look at the Numbers: For a m5.large workload that runs for 5 minutes: On-demand price: $0.096/hr Average spot instance price: $0.061/hr Price in availability zone us-east-1b: $0.0488/hr Should you use us-east-1b for your workload? The savings here would be $0.0122/hr, or approximately $0.001 for the 5-minute run. In AWS, the list price for data transfer in us-east-1 is $0.01/GB. If this workload is transferring more than 100 MB during its lifetime, then the switch to us-east-1b will cost more than you are saving. One might ask - how do I know how much data is being transferred? In some cases, it’s fairly simple to benchmark based on a few examples. But if you want a deeper view into your networking costs, feel free to reach out at datatransfer@underspend.com. We have a powerful tool that provides a full picture of who in your network is talking to whom, and how much it’s costing you.
- Logging Cost Creep: Unpacking the Challenge
Software teams love to add logs to their code, and for good reason. High quality logs allow you to debug production code and gain insights on user behavior, without requiring new and elaborate tooling. But it’s easy for logging to spin out of control, resulting in a dramatic increase in costs. Built-in cloud logging tools like Amazon CloudWatch, Azure Monitor Logs, and GCP Cloud Logging make this especially easy since a developer can add a simple console.log command to their code, later discovering that this data is being saved in perpetuity in their cloud environment. The result can be thousands and tens of thousands of dollars in unnecessary monthly spend that simply aren’t providing value to the business. We see logging cost creep at most of the clients we work with. For example, an international public company was paying over $3,000 a month for logs that were simply translations being auto-generated and logged. The translation feature was important but the engineers at the company said that they never looked at these logs and didn’t need them. Nobody remembered why they were inserted in the first place. These kinds of cases aren’t rare. In fact, most companies we work with discover a meaningful amount of logging cost creep, and the numbers can add up quite rapidly. Sadly, it’s hard to understand the source of logging costs in the top 3 cloud environments. At Underspend we’ve built software that allows companies to get a granular view of their logging costs and their sources. So far this tool has helped companies save hundreds of thousands of dollars on an annualized basis, a figure which is rapidly growing. If you’re interested in learning more, reach out to us at logs@underspend.com.
- Block Storage: The Silent Cost Sink
When we talk to customers we frequently hear that Cloud Compute (EC2, Compute Engine, Azure Virtual Machines) is the #1 source of cost in their cloud environment. Compute cost is considered difficult to reduce because it’s at the core of many mission critical applications and Engineering teams are wary of risking the production (or even dev) environment in the name of saving costs. That said, there is one low-hanging fruit that companies tend to ignore - block storage (AWS EBS, GCP Persistent Disk, Azure Disk Storage). When a compute instance is spun up it comes with an associated hard disk, which is billed separately from the instance itself. When a company runs many instances and does not police the usage of hard disks, the cost can add up quite significantly. This is especially true for spot instances because they very rarely take advantage of, say, a 100 GB hard disk, while it’s common to see those as the default for spot instances. We recommend that any company with a significant compute deployment to look into their block storage bill and consider the size of their compute volumes. We also offer a free tool that helps to detect unused EBS instances in all regions. The tool has already saved companies hundreds of thousands of dollars annually. Feel free to ask for a demo on the website or shoot us an email at blockstorage@underspend.com. No strings attached. P.S. We’re also developing cutting edge technologies to reduce compute costs without changing your code and without risking the production environment. If you want to spend less on your compute shoot us an email at compute@underspend.com.
- Cost Drill Down: The New AWS Aurora I/O-Optimized Database
On May 10, Amazon announced a new database type for AWS Aurora - Aurora I/O-Optimized. This new offering targets companies that heavily utilize I/O operations in their Postgres and MySQL databases, with the key selling point being zero cost for I/O operations. However, AWS has not completely relinquished revenue from I/O operations and is compensating for the I/O savings by charging a higher storage cost. So, is it a favorable option for your company? Naturally, the answer depends on your usage. Let's take a look at the costs in the us-east-1 region: If you have a 5 TB database, for example, your break-even point for Aurora I/O will be 3.13 billion requests per month ($625 storage difference equals $625 in I/O costs). To determine if this is a worthwhile deal, you can check the recent I/O activity in the Monitoring tab on the relevant database in the RDS page. Based on our experience, most companies do not have a consistent high I/O intensity in their database usage. However, for companies that do, this could be an appealing offer. It might also be interesting for companies that utilize databases as caches, where there is relatively less data but a substantial number of I/O operations. It's worth noting that you have the flexibility to switch between Aurora I/O-Optimized and Aurora Standard every 30 days. This billing feature can prove useful if you observe changes in your I/O patterns over time.
- Compression Is (Almost) All You Need: Reducing Data Transfer Costs in the Cloud
Notoriously, the top sources of cost in the Cloud are Compute, Database, Storage, and Big Data Analysis (e.g. BigQuery, AWS EMR, Azure Synapse). Another significant source of cost is Data Transfer (“Networking” in GCP, “Bandwidth” in Azure), which can be divided to two types - inter-AZ costs, and egress costs from the cloud environment to the Internet. For some companies these costs can accumulate significantly. Each type of traffic requires a different solution: Egress from the cloud environment to the Internet is often a necessity driven by the company's business model and cannot be avoided. The best way to mitigate these costs is by adding an external Internet endpoint capable of accepting compressed data. While this may not be a simple undertaking and requires resources outside of the cloud environment, it can result in substantial cost savings, with only a minor increase in the compute cost of compression. On the other hand, inter-AZ costs are generally unnecessary and tend to accrue due to suboptimal deployments of instances or services. The challenge lies in identifying these data transfer "leaks" since cloud providers do not make it easy to determine which service and instance are responsible for data transmission across AZ lines. We have witnessed several organizations invest significant time attempting to understand the causes of high costs, such as AWS Data Transfer USE1-APS1-AWS-Out-Bytes, without success. At Underspend, we have developed a free Data Transfer tool that operates within the cloud environment, analyzing flow logs and deployed services. This tool precisely identifies which instance, database, and service is communicating with whom and provides cost information for each interaction. The results obtained from the tool often empower DevOps teams to reallocate resources to the appropriate AZ, leading to substantial cost savings. In cases where resource relocation is not feasible, compressing the data before transferring it between AZs proves to be an effective solution. While compression may seem straightforward, its impact is remarkable. Using the Underspend Data Transfer tool, we have been able to save our customers millions of dollars annually, and we would be delighted to assist more companies. If you want to reduce your Data Transfer costs, please feel free to reach out to us at datatransfer@underspend.com or simply click on "Get A Demo" on our website.