BACK

how many people for shopify queue

RedisDays 2019 - New York How Shopify is Scaling Up Its Redis Message Queues please welcome mochi I'

Redis

Updated on Mar 20,2023

RedisDays 2019 - New York How Shopify is Scaling Up Its Redis Message Queues

The above is a brief introduction to how many people for shopify queue

Let's move on to the first section of how many people for shopify queue

Let TThunt's experts help you find the best TikTok product on your Shopify business!

Find TikTok products (It's Free)
No difficulty
No complicated process
Find winning products
3.5K Ratings

WHY YOU SHOULD CHOOSE TTHUNT

TThunt has the world's largest selection of TikTok products to choose from, and each product has a large number of advertising materials, so you can choose advertising materials for TikTok ads or Facebook ads without any hassle.

RedisDays 2019 - New York How Shopify is Scaling Up Its Redis Message Queues

please welcome mochi I'm infrastructure,engineer at Shopify hello everyone I'm,very excited to be here today this is my,second time being invited to speak ready,stay so thanks for having me I work for,Shopify Shopify is the leading,omni-channel commerce platform and what,that means is that we allow people to,design set up and oh I think my slides,just moved too fast,yes this is it Shopify is the leading,Omni Commerce omni-channel commerce,platform what that means is that we,allow businesses of also of all sizes to,design set up and manage their,businesses online on all sorts of,channels be it web mobile,brick-and-mortar POS you name it we,allow anyone to sell anywhere on the,technical side of things we're one of,the oldest and largest Ruby on Rails,monoliths and we have over a thousand,developers today that we merge every day,over a thousand poll requests have seen,traffic peaks of over 170 thousand,requests per second and relevant Leifer,my talk today we process 2 billion,background jobs every day on top of,Redis and this is the responsibility of,my team for the most part so for today I,would like to give you an overview of,how that works at Shopify and some of,the challenges that we face by doing,that so I'm gonna give a few examples of,both technical limitations and how we,overcame those a choppa fire background,jobs at Shopify are basically we used,those to process emails process web,hooks do delayed checkout processing and,payment processing so not very different,from most web applications the their use,cases for background jobs anything that,can be used,you speed up a web request basically we,also used them to as the backbone of our,database schema migrations to,encapsulate all this logic we use our,own library hedwig we first started by,using rescue but with we had diverging,needs from what the library supported so,we kept adding more patches and the,complexity started growing so moving to,our own library was a worthwhile,investment to encapsulate all the,background queue operations that our,workers need to do what that translates,on the Redis side is a lot of queue,operations so a lot of list operations a,lot of popping a lot of pushing because,queues are abstracted as lists inside of,Redis and that's that's kind of the main,architectural takeaway we have this,thing on on Shopify called flash tails,so if you're not familiar with you know,hi beast sneakers or like very exclusive,makeup that I sold online a lot of our,merchants are into the business of,selling very very limited quantity items,that and that drives huge amounts of,traffic to our platform this stresses,our platform in many many ways but,rather than that rather than sway our,merchants away from basically causing us,to fade like to be on the brink of,breaking every day we actually fully,embraced this these flash sales and are,making them a core feature of our,platform so we encourage our merchants,to have these sales and as it's actually,a competitive advantage for us but these,sales can drive orders of magnitudes of,traffic from a single merchants into our,platform and our platform needs to,respond positively to this a key feature,of this traffic is that it is very right,heavy a lot of bookkeeping needs to,happen,during a checkout so inventory needs to,be updated payments need to be processed,and user information needs to be updated,and so on historically we started as a,very simple Ruby on Rails monolith we,have web workers that process web,requests job workers that process our,jobs from from Redis and a single my,sequel instance to hold all the all the,persisted data but with these flash,sales the right traffic started,stressing the my sequel instance to the,point where grow growing that single,star my sequel server started causing,resiliency concerns because yes we can,get a bigger machine from my sequel,bigger CPU and handle all those rights,but now having a single see my sequel,instance that can fail because of a,single flash sale is a serious single,point of failure that we need to,consider,so my sequel was the very first thing,that we had to horizontally scale and we,partitioned it into shards that have the,exact same schema but hold a different,subset of merchants so if we have a,hundred thousand merchants now we have,ten we can have ten shards of ten,thousand merchants and route traffic to,each single pod to each single shard,sorry as needed scaling up workers is,usually not as big of a deal you can,just provision new nodes bring them into,rotation and this is super easy to do,with things like kubernetes so at this,point in time this is kind of a snapshot,of our architecture around three years,ago we have a reliable way of scaling my,sequel rely an easy way of scaling our,compute power but we still had no know,firstly we had no need and then we had,we had no way of scaling Redis as this,start to become the bottleneck as well,so we we decided to piggyback on the,concept of this exact same partitioning,scheme that we used for my sequel and,apply the same idea to Redis as well,into what we call a Shopify pod so now,we can say that each subset of,Shopify stores will have their Q's into,a single Redis instance and share the,compute power across all these pots this,solves a few things so in the case of,flash sales we can share job workers,between different pods so this is for,capacity elasticity and secondly it's to,ensure fairness so if one pod is,Catching Fire,we know that the other pods are kind of,protected from it so this is this is,kind of a main feature that we built,into the platform to handle the case of,flash sales however this in this past,year have been starting to see the,limitations of this this approach it,works very well but we need to do some,more some more work we're now at the,point where a single one of those flash,sales can entirely overwhelm that's that,single Redis instance so those flashes,are driving enough traffic to basically,generate enough queueing traffic so,queueing commands like el pops output el,papel push and so on that basically we,we can hit the maximal CPU usage of the,Redis server we also found out this was,due to some inefficient use usage,patterns that we found in our internal,library and when this happens usually we,see other side effects other negative,symptoms such as latency spikes that,lead to cascading failures in different,parts of a platform one thing we do to,help us in such instances is use circuit,breakers so circuit breakers are a,resiliency abstraction around our Redis,clients that keep track of the failure,rates of the upstream Redis server and,basically open the circuit if a certain,error threshold is reached with that,what that does when the circuit is open,is that the client fails immediately,rather than try to send the request of,it's a struggling Redis server,the end goal of this is that we fail,fast and then if a single pot is failing,the job worker moves on to another pod,and secondly we hopefully allow the,Redis server to recover because what,that does essentially is dynamically,disconnect it when it's struggling,however this is potentially costly so,this graph shows two spikes of open,circuits that we've seen during a flash,sale when when the circuit is open our,application code is programmed with,fallback values right so indicate in,certain if the Redis server can't,respond the the hopefully there's some,fallback mechanism in place but there's,instances where bugs in the fallback,mechanism or emissions or other other,issues can cause can put us in a in an,inconsistent state and this is very very,painful to have to go to explain to our,merchants this is essentially a customer,facing incident at this point so one,thing that we had to go look into first,is what we Shopify used as what we had,Shopify call the error queue so inside a,ruby process when except when an,exception occurs we need to generate a,payload with metadata stack traces and,so on of that exception to send to bug,snack which is a third party exception,aggregator that we use this requires an,API call to bug snag which we use a,background job to encapsulate in order,to be able to execute that in an,asynchronous fashion this was a fine use,case for for for Redis and our,background job system but in cases of,massive flash sales would that translate,it to is that flash sales can also be,caused spikes of exceptions as well so,now Redis is basically being overwhelmed,by error reporting and not by handling,the flash sale commands right so this,meant that during peak capacity a fixed,amount of the Redis CPU usage was not,being used to help us recover from from,the degraded State so for this reason we,deemed that a message streaming bus was,better suited for this use case now we,moved for for error reporting we moved,on top of Kafka and you must be thinking,why you here talking about Kafka are you,crazy,but this was purely pragmatic we already,had a very experienced team that was,operating Kafka Shopify and we were able,to move out of move error reporting out,of out of Redis in a matter of weeks,we built a simple Kafka consumer and go,and made our web and job workers produce,payloads to a casket instead and Kafka,is also scalable,this is outside of the scope of today,but the consumer then took care of of,relaying these jobs to these these,payloads to bug snagged by doing this we,freed up around 25% CPU capacity during,peak loads on Redis and this was now,fully utilizable for queuing and D,queuing which is what we want another,problematic pattern was that with our,ever-growing compute power so a number,of processes that talk to Redis for web,process for processing web requests or,working off jobs we started seeing way,too many connections because we shared,job workers around different pods which,gave us both fairness and elasticity we,also were making each Redis instance,connect to every single worker in the,cluster that meant that around 20% of,the Redis CPU at any given point in time,was just busy handling connections and,reconnecting and retrying and so on,so this was a lot of overhead,our first attempt at mitigating this was,to partition the worker,in two subsets that connected to,randomly assigned reticences that meant,that we still had some capacity sharing,a little less but we substantially,reduced the number of connections that,each Redis had to main to maintain,however a truly future-proof solution,for this was to use a proxy we're,currently in the process of deploying,envoy as a proxy and this comes with on,top of this comes with a lot of benefits,right on top of solving the problem of,having too many connections since the,proxy can maintain a connection pool it,also allows us to distribute the load of,Redis commands to a pool of regice,regice servers even though we're not,there yet we still for jobs we still,have a single Redis instance for now it,also allows us to set up high,availability so we can do a master and,replica set up and be able to deploy,with zero downtime and so on,so something we're investing we're,investing into another challenge that we,had to optimize in order to scale are,our jobs jobs queue is that some some,background jobs require locking so,developers expect that they're able to,in queue a job and not have any other,job of the same type or within the same,parameter execute at the same time,so basically execution uniqueness for,this we use Redis to store the,uniqueness locks so before a job worker,acquires a job it it acquires the lock,for that job processes the job and then,releases the lock a key thing though is,that many of the background jobs that,execute during flash sales so checkout,and payment processing have that have,this constraint they need to be unique,so during again during peak peak traffic,and peak usage of Redis we saw that,these locking operations were,again a they were a significant overhead,on top of Redis so for this we decided,to to stay on top of Redis but dedicate,a separate instance we're able to come,up with a zero downtime migration scheme,that allowed us to safely acquire and,release locks on a totally separate,instance and dedicate the the the,previous Redis instance exclusively for,queuing operations I have it we have a,blog post written on the topic by a,teammate if you can grab the URL of the,slides otherwise come talk to me later,and I'll be happy to share it with you,ultimately no matter how well we,optimize our usage of Redis we also,anticipate that the flash sales are,gonna keep getting bigger and bigger and,bigger so ultimately queuing operations,themselves are going to be enough to,overwhelm a single CPU instance a single,CPU of the Redis instance so for that,reason we need to explore ways of,distributing the job queues themselves,across multiple instances we're,currently looking into two potential,solutions for this the first and easy,way is to assign each job queue a,separate Redis instance so this means,that if we have ten queues for example,we'll provision ten Redis instances and,make workers and queue as needed and,connect to each Redis instance and queue,the job there as needed dynamically the,downsides of this is that this is a huge,operational operational overhead right,because if we have a hundred pods and 10,Q's now we have a thousand Redis,instances and the other downside is that,these queues are not equally important,or equally busy so the web hook you,might be way more problematic and need,another,their Redis instance while the,low-maintenance queue is basically empty,so another approach is to horizontally,distribute every single queue across a,fixed number of Redis instances per pod,this is nice because we can achieve,equal usage across a pool of workers a,pool of Redis instances and then the,problem is how do we how do we make,workers aware of this of this cluster,right now we have the problem that,Salvatori alluded to earlier that you,want the worker to not be aware of a,cluster but there is still a cluster,behind so operations that need multiple,queues become tricky and and so on and,other limitations also the thing though,is that having a proxy like what we're,doing with envoi will allow us to,distribute commands across partitions,and be able to scale up and down more,easily than than without having that,that proxy set up so in conclusion we,think that scaling your Redis,infrastructure is really about knowing,your usage patterns really really well,in our case is the driving factor was,the single tenant traffic that we see,during flash sales so this has forced us,to look deeper into our Redis use cases,and evaluate the way forward in each,case each case had a separate bottle,performance bottleneck and scalability,bottleneck that we addressed in a,pragmatic and a simple way in the case,of asynchronous are reporting through,HTTP we leveraged a message streaming,technology to deal with an overload of,connections we went with a proxy and for,locking for locking operations we went,to with a dedicated Redis instance and,ultimately for scaling the queues,themselves we're looking at horizontally,a horizontal scaling and employing a,cluster of instances rather than a,single instance if you'd like to talk,about any of this or have questions come,find me I'll be,here throughout the day and I'm always,happy to talk so if you have questions,about Shopify we're also hiring and all,that stuff all right thanks

Congratulation! You bave finally finished reading how many people for shopify queue and believe you bave enougb understending how many people for shopify queue

Come on and read the rest of the article!

Browse More Content