Best Practices for data pipeline costs and setting up Guardrails?

Senior Data Engineer at Taro Communitya year ago

In AWS clusters, there are numerous instances, and multiple teams deploy new data workflows every week.

We once had a situation where a team member added numerous nodes to a cluster for a backfill operation, resulting in an unexpectedly high bill that shocked us.

To prevent such instances in the future, we're exploring ways to streamline monitoring and accurately calculate the costs of each instance. However, this can be time-consuming.

How does your company establish guardrails to prevent unexpectedly high costs from the use of expensive machines?

6363 Views

44 Comments

Discussion

(4 comments)

2
Lalit Kundu
•Staff Software Engineer [L6] at Google
a year ago
I don't have much details of what features are there in AWS, but just some ideas that you can combine

Setup quotas: configure a soft and hard limit on how many nodes can be there in a cluster. Perhaps quotas per team?

Two party control: any change to nodes in a cluster should be approved by one additional person.
- 0
  Senior Data Engineer [OP]
  •Taro Community
  a year ago
  
  Right now, the platform team lets us do our thing with a decentralized setup, giving us the freedom to deploy jobs on our own. We can't tweak soft and hard limits ourselves, but if the platform team decides to set them, we just have to roll with it, and it might not help us much.
  
  Maybe the second approach would be easier to handle. Any cluster changes would need approval from an additional person
0
Friendly Tarodactyl
•Taro Community
a year ago
Just a few quick thoughts.

> To prevent such instances in the future, we're exploring ways to streamline monitoring and accurately calculate the costs of each instance. However, this can be time-consuming.

You probably don't want to manually calculate it (whether each instance refer to each EMR cluster or node or incident). AWS has daily cost report and it should be straightforward to set up alarm / emails on those. https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html I heard my manager would get monthly (or some other cadence) AWS usage report and based on that AWS doc, daily report seems possible.

You mentioned "multiple teams". Are all your orgs' EMR cost going into your team? If so, it's good to consider a campaign to have each team use their own accounts / bill. This feels like playing organizational politic to kick the bill around, but I guess it's fair to do so. Maybe some teams don't know they have a huge bill.

Depending on your specific situation, maybe it's time to consider other means of computation or the query. Are you using unnecessary joins and what are the bottleneck of the query? Are there duplicated computation across different workflows? Are you using parallelization effectively (e.g.Spark needs DRA enabled to utilize more nodes, maybe different file read function could help, just more perf tuning stuff like https://spark.apache.org/docs/latest/sql-performance-tuning.html

Process-wise, it's good to bring your team together to discuss ideas versus individuals thinking really hard in silos. Usually more discussions lead to more ideas and then you can funnel down which ideas to chase.
- 0
  Senior Data Engineer [OP]
  •Taro Community
  a year ago
  Right now, the platform team lets us do our thing with a decentralised setup, giving us the freedom to deploy jobs on our own. We can't tweak soft and hard limits ourselves, but if the platform team decides to set them, we just have to roll with it, and it might not help us much.
  
  I'm on the product team, and I only have control over what we can deploy. We've just reminded everyone to be mindful of costs. However, some team members, in their rush to deliver quickly and run a backfill ASAP, have been deploying multiple times until the load kicks in, and we can't monitor right now.
  
  Certainly, some jobs could use optimization. We're currently working on refactoring and fine-tuning some of those tasks