2022.4
Last updated
Last updated
采用STAR分析框架:情境(Situation)、任务(Task)、 动作(Action)、结果(Result)
S: We have a new requirement from Customer , add two more column which they can use to select. We can make two additional column selectable to extract. We run the performance test with 100 concurrent user.
and it failed 30% due to the out of memory of that container, As the extract API is limit of 250K records in one extract. The reason is as we are adding two more column into the data retrivel, it takes much heap during the query time. One solution is add two sort key on these column,but it we already meet the deadline.
T: What I do is first to calculated how may extra memory we need on new scenario.
Second is pulling the production data from last 3 months in terms of the users and respective total request made by them. and I found the peak time of concurrent user is less than 20.
Then I just emailed to customer and explain this situation and very luckily we get proved by them to make sure they will avoid to have more than 100 concurrent used to invoked the API.
A:
R: We are lucky to deliver this feature successfully
L: Peformance test is critical, and we have take the worst situation in considration.
----------------------------
5 users - pass for IBM new field columns
10+ users - fail with a high error rate for the new IBM field columns
I understand there is a plan to fix for the next NGRA 22.4 release. We'll need this in writing.
Also, we ( product team? ) need to prove IBM will not use 10+ users concurrently, then we can sign off. Otherwise we'll have to roll back this change
This is what last 4 weeks looks like in Prod and UAT in terms of the users and respective total request made by them. There are only 2-3 customers in each envs who are consuming api with good rate..
Per month approx
So about 300 per day, and approx 25 per hour ( assume 12 hour day ).
Would the APIs be called spread through the day, or say at 9am beginning of the day for example?
yes.. so in general we have only 2 users who can complete with each other for large volume of the data..
9370 requests is spread across the 1 month
which is very low rate
only 390 request per day..
OK, so we are OK with the 5 user API test results, as IBM requests would be 32.5 per hour total , correct?( assuming 12 hour day )
Prod autoscaling for api is 12 times than the current SAT capacity .. we are running only 1 node in SAT but in prod if there is more load it can spin up upto 12 node....
Nice. What is the starting API node count in Prod?
in prod IBM is the only customer who is running these high volume ( 1 user) , the second user is orica .. and so on...Prod will run minimum 2 nodes
S: We are making the generated report from a single pulling to a async task.
Old machanism is only the time when the user login again, it will pull the data from s3 bucket in a long-lived folder (never expired and get glacier in 7 days) into a standard folder (which is will be glaiced fatre 7 days).
Our new machanism is whenver the file uploaded into s3 bucket , it will have create a notification to a SQS, the the queue will process it moving the file from long-lived folder into standard folder. which mean the user no need to login and the filed got moved.
After we deployed into SAT we found there is a corner case we forgot to handle .
The client requested the export and left it. and then the new deployment came in, if he didn't login to pull the file. in this interval. the file will got lost. as the new mechanism only handle the new file generation.
T:
A: After diagnosed the root cause, I just quickly applied a hotfix which enabled two machanism.
R: enable the sync code also remain the async mechanism. Double safe gurantee.
Learned: Pay attention to each small detail during the development process, and document them in time so that we won’t forget. And it is critical to have back up solution at the worst case. I need to make sure that all the things I delivered are fully qualified.
Mistake (earn trust, customer obsession):
-- Message Queue Missing Old scheduled reports , API Performance Test failed.-
But from this mistake, I learnt that details are definitely important, but I also need to pay attention to the whole schedule, I need to always keep good communication with my teammates when I have my plan. I need to make sure that my schedule won’t affect other’s schedule.
-- s3 Event notifications , only have the eventType as Put, forgot the Multipart upload completed, find the root cause in DLQ.
Message 1001 not found ,how to resolve it?
seeing intermittent error "Message 1001 not found", even after increasing the pool size and checking the cluster metrics.
Check email with aws & tibco. -- >open the spy and auditing and monitoring
manually replace the driver jar for a detailed error message.
new jndi connection instead of jdbc.
Email attached in the end
change from jdbc into jndi
Bug --> Different component to dive deep to find where it met problem.
Chanlleging work:
Move from SAP to JasperSoft (Move Oracle to Postgres)
DynamoDB load to GCP via BigQuery
Move Redshift to GCP (currently work)
Design schedule report make is schedulable. (beyond the scope, make the class more extensibility in the future.)
So through this process, I learnt that it is so important to be curious and keep learning, the more you read and learn, the more problems you can solve. The feeling of ownership is really really important, the product is just like your own child. So “I don’t know how to do it” will never be the excuse.
Migration Service:
maintain old one ,deal with of old product production issue,
POC
Document
High level Design meeting, break down the epik into subtask and estimate
After the solution got approved, setup the dicussion with several team like db, develop, qa
Request new resource from devops, i,e. building a custom pipeline
db design, discussion with dba
Write test plan
take care of the deployment., pre-instruction before instruction, provide the documentation and pass to the deployment team. i.e. manually update a system variables. || update a param store value /secret manager.
Tight DDL(customer obsession, bias for action):
-- Messaging Queue failed, enable both approach to be safe
-- API Performance Test failed.
Firstly, I figured out another temporary solution with my manager. If I couldn’t finish tasks before deadline, I will discuss with my colleagues, trying to figure out a way that can improve the efficiency and If necessary, I will use my private time to keep working on the task. After all, finishing the task with high quality as soon as possible is what we want. I’ll never sacrifice the customer experience or the quality of the product because of that. Customer experience is always the most important. We must make sure that the product we are gonna deliver is qualified. We can sacrifice our own time to try to finish the tasks. If we still can not finish the tasks, we will communicate with customers and related people, to tell them why and earn their trust. At the same time, we will try our best to finish the tasks as soon as possible.
Conflict:
Tell me a time when you disagree with others and how you persuade others
Make it data driven, not opinion driven. If data is not available, use testing to gather the data
Be confident. ...
Introduce a logical argument. Initially, your coworker might resist, but you can use a logical argument to explain that he/she is better equipped to handle that section, making both of you look good and helping the company in the process.
Make it seem beneficial to the other party. ...
Choose your words carefully. ...
Use flattery. ...
Be patient, but persistent.
Dead letter queue scheduled long polling every 6 hours --> DataDog monitoring + manually process message API, (Take both)
Fast-moving queue/ slow moving queue (design a rule, rule service).(ME) -> two ecs service
api add sort key / add timeout.
Invent and Simplity something/ creative:
Separate a independent health check connection pool and Introduce a callable level request timeout return 429
Fast-moving queue/ slow moving queue (design a rule, rule service). -> two ecs service
Ownership/ helping peers:
Git command / Java formatting:
Intern messed up the git and it seems he don't know how git works and what is the right command he has to pickup to meet this requirement. We are using the afterwork time, sitting together and draw a picture
how the git manage the develop branch, master branch, release branch and feature branch.
If there is a conflict , how to resolve the conflict.
Then I showed him how to squash , cherry-pick, how to ammend commit,
Then I am showing him hot we cut the release branch, and merge back to develop if there is a hotfix.
Also I am let him practice on his messed-up feature branch , hot to restore it. and he fixed it under my help. also mentored him on having patience as these deals would take time to develop but would be worth it in the end.
As a beginner developer, it is critical to have some good dev habit.
And also tell him what is our standard on the git. and some formatting stuff.
-- Message 1001 intermittently, ask senior people as earlier as possible and look into the source code)
AWS email:
Hi Mauricio,
Thanks for reaching out to AWS Premium Support. This is Ramya, I will be assisting you with this case.
Based on case notes, I understand that you are seeing intermittent error "Message 1001 not found", even after increasing the pool size and checking the cluster metrics.
To further troubleshoot can you please share below details:
Run this query and share the results in Excel along with headers. select * from STL_ERROR where pid = ; (when you see the error again)
Log file when the error occured.
Is this issue observed recently or from the time you started to use JDBC?
Which JDBC tool you are using?
Is this error show after any recent updates or installation of the Tool?
JDBC version
As this is intermittent issue, can you run the query which failed in JDBC immediately in AWS console and let me know if you see any issues.
Date when the issue started
Will be waiting for your response!!
We value your feedback. Please share your experience by rating this correspondence using the AWS Support Center link at the end of this correspondence. Each correspondence can also be rated by selecting the stars in top right corner of each correspondence within the AWS Support Center.
Best regards, Ramya V. Amazon Web Services
-----------------------
We have been seeing this issue for around a year and we have not prioritized solving it as it was infrequent and hard to reproduce.
However, we have seen an increase in failures when we try to execute queries in all our environments and different Redshift intances (we have separate instances for UAT and PROD environments and two regions us-east-1, eu-central-1). The only feedback we get is the message: "Message 1001 not found" in our logs.
We tried to solve the issue by troubleshooting the queries we found generating this error, but recently we have observed that there is no pattern and the error can happen to any query. Also the error happens randomly, sometimes the same query executes successfully and the next try it can fail.
We use a JDBC connection. We have also tried to solve it by increasing the pool size of the connection without success. When we monitor the Database instance we see we are not close to using all the connections and the CPU use is not 100% so we don't see the issue being with running out of connections.
This issue is now affecting our UAT and PROD environments so we need an urgent solution.
Thanks,
Mauricio Cluster Name/Region or Cluster ARN: a206447-deal-server-preprod-develop-us-east-1-redshift Complete Error Message: Message 1001 not found Timeframe in UTC of start of the issue: QueryIDs:
api failed performance test
api sort key / add request out
Fast-moving queue/ slow moving queue (design a rule, rule service). -> two ecs service
Design schedule report make is schedulable. (beyond the scope, make the class more extensibility in the future.)
sync - async (message queue) enable both approach to be safe
S3 Event notifications , only have the eventType as Put, forgot the Multipart upload completed, find the root cause in DLQ.
Dead letter queue scheduled long polling every 6 hours --> DataDog monitoring + manually process message API, (Take both)
message 1001 intermittently error/ source code investigate.
devops work & git
BiqGuery migration