Async - steels sessions from users

Randy_J_Houlahan_B.S 2012-08-05 22:21:45 UTC #1

We have created some functions run by async. Although my understanding is limited I wanted to report my results and seek some suggestions.

We have 3 function all run @ midnight. Fairly intensive functions that run for about a half hour. Checking entries and statuses of related inflogs, if files exists in filemanager and creating infologs when needed based on some business rules. We have multiple installs with different ids on this server. 3x development and our production.

The async runs on the production and when any user makes an EGW request it won’t respond for user, same for anon user for the website through sitemanager-site. Only about 25% of the servers resources are being utilized during these async runs and when we attempt to access our other installs there is no issue, they respond fine. When we access the production while the async is running the requests basically time out.

Is async only triggered by a user request? Any suggestions how we can run our intensive asyncs in the background and still have our production EGW respond to user requests?

P.S. I noticed this same problem with backups running on async through setup. I have manually ran mysql dumps because I seem to be working within EGW when the backs run in the middle of the night :-P.

I have people working around the clock not to mention the website gets hits around the clock too.

Suggestions would be greatly appreciated.

WLD 2012-08-05 22:45:30 UTC #2

Is async only triggered by a user request? Any suggestions how we can run
our intensive asyncs in the background and still have our production EGW
respond to user requests?

Hi Randy J,

We’ve experienced all sorts of weird and wonderful issues relating to various load problems in eGW/pERP setups. There are various patches we’ve made (which we’re in the process of open-sourcing now), but the biggest help has been virtualisation. Being able to intelligently split apart frontend and backend operations, and give relative priority to particular things at particular times is beyond useful.

We’ve also experimented with teleporting a live eGW instance around different metal with different configurations suited to particular tasks as those tasks are run. Although it’s not suitable for production use yet, again there’s lots of promise there, especially when creating ghost instances that provide read-only output to keep things snappy when disk grinding becomes a major bottleneck. One eGW/pERP instance is hosting around 7-8TiB of data and simply would not function unless it was virtualised in various ways.

I understand if you’re not using it yet, it can be quite a big upheaval to re-architect your enterprise but you can buy in cloud capacity so easily now it’s definitely worth looking in to and experimenting with. Although hopefully someone will offer some suggestions to try and cure this, I think you get to a point where you have to accept that certain operations will just create bottlenecks. You can try and solve them the best you can but ultimately there gets a point where fundamental design limitations just don’t allow you to get past a certain point of optimisation and even if you manage to cure this now it might be a hint that you’re reaching the point to think about this, especially if you have 4 instances on one machine.

I personally am always interested to hear about how people have fine-tuned their setups for greatest efficiency, so I wish you well and hope you keep the list updated on your progress and solution,
-WLD

Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Randy_J_Houlahan_B.S 2012-08-06 00:41:46 UTC #3

Our custom app relies heavily on infolog only 500MG of data but can take upto 20 seconds per query. On the other hand we have 25Gigs in filemanager/files. No major issues there except for loading infolog in fileselect.

I have one hog of a funciton that faxs and as matter of poor design while waiting for the confirmation we use php sleep which just spins the servers wheels waiting for a response. However other users as not effected by that bottleneck only the user who started that session has to wait for it’s completion before apache responds to another request from that user.

The async functions seem to not allow any other users access to that instance. It seems almost like an apache configuration would need to be changed but at this point I was looking for a cheat as opposed to researching :-P.

What sort of data are you store to get to 7-8 TiB ? out of curiosity?

I am thinking maybee a hack to start up with a new install ID assuming async is started by cron and not a user session. i have not got into the code yet to know if that would work.

Thanks,

Randy

WLD 2012-08-06 02:38:24 UTC #4

Hi Randy,

Actually it’s quite a big misconception that PHP’s sleep() is a resource hog, it actually tells the kernel of all modern OS’s to give the process a lower priority and ignore it for a bit - kind of the same way that cron works. As a deadline nears its priority gets higher and the time span between being ignored decreases.

pepsi@planb:~$ time php

<?php echo time()."\n"; sleep (30); echo time()."\n"; ?>

1344217588
1344217618

real 0m31.040s
user 0m0.100s
sys 0m0.078s

The only reason the real time is over the 30 seconds is because it took me 1.04s to paste the quick program and hit ^D to parse/execute it. The actual combined time spent in the kernel and userspace was 0.178 seconds, and I dare say the bulk of that was spent parsing my pasted program rather than doing nothing when it was running.

With regards to your infolog, 500MiB of data isn’t a great deal but I’ve seen a few instances where the “SQL ninja” inside of me can write a more efficient query to do something but I’ve thought that perhaps portability in certain areas are better than custom SQL that takes full advantage of the feature set and foibles of a single RDBMS and doesn’t work at all in anything else. I think this is the trade-off when writing portable SQL. Maybe your best bet is to activate a slow-query log, EXPLAIN what takes time and seeing where you can optimise the database rather than the SQL. I know you can get automatic database tuning systems but in our experience these cause merry hell with eGW when they start making sweeping changes. 20 seconds does horrify me quite a bit though, even a tenth of that would be of grave concern.

With regards to your issue proper, I am understanding more that it’s definitely related to a specific installation-identifier and not affecting other installation. Because of this I highly doubt your delay is coming from Apache, since it would similarly stop serving pages for all the other VirtualHost’s on the machine. It seems far more likely it’s coming from the database, perhaps from a locked table or even you’re maxing out the maximum connections or some other database parameter.

I note that if you send a SIGSTOP to all MySQL processes that your header file is set to connect to or create some other kind of blocking action; Apache will sit there for a while before timing out (or depending on your configuration, it may sit there indefinitely). You may not get any error message if it takes too long and the browser times out. This would affect only a particular eGW installation (assuming that all 4 connect to different databases). The hack you suggested just seems kind of the wrong way to go about it and I’m not convinced it will work if it is DB related.

Without some more investigating I don’t know what to suggest, I don’t know much about async since we don’t use it much. Somebody else may have some ideas but it’s all a bit pie-in-the-sky without looking in log files, gathering some metrics and running reproducible test cases. As I said in my last post about virtualisation, this might be the perfect time to give it a go. Clone your server and then you can test it, dismantle it and rip out its inner workings to discover the issue - but all in the safety of a self-contained sandbox where it can’t do any damage and you can simply throw it away afterwards.

The data I mentioned is about 2.5TiB of MySQL data containing a lot of pERP information and the rest is supporting file assets for that data. An example of where direct database import systems had to be written to extract and import the relevant data, but they still take several hours to run a full batch. But I tell you it was easier to get it in to pERP than it would be to get it in to anything SAP. Nathan has done a good job with the schema.

I think you should wait for the problem to occur and simultaneously look at the Apache scoreboard and the MySQL processlist to get a signpost to the relevant direction. Good luck,
WLD

Our custom app relies heavily on infolog only 500MG of data but can take upto 20 seconds per query. On the other hand we have 25Gigs in filemanager/files. No major issues there except for loading infolog in fileselect.
I have one hog of a funciton that faxs and as matter of poor design while waiting for the confirmation we use php sleep which just spins the servers wheels waiting for a response. However other users as not effected by that bottleneck only the user who started that session has to wait for it’s completion before apache responds to another request from that user.
The async functions seem to not allow any other users access to that instance. It seems almost like an apache configuration would need to be changed but at this point I was looking for a cheat as opposed to researching :-P.
What sort of data are you store to get to 7-8 TiB ? out of curiosity?
I am thinking maybee a hack to start up with a new install ID assuming async is started by cron and not a user session. i have not got into the code yet to know if that would work.
Thanks,
Randy

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Ralf_Becker_Stylite 2012-08-06 07:00:01 UTC #5

Hi Randy,

We have created some functions run by async. Although my understanding is
limited I wanted to report my results and seek some suggestions.

We have 3 function all run @ midnight. Fairly intensive functions that run
for about a half hour. Checking entries and statuses of related inflogs, if
files exists in filemanager and creating infologs when needed based on some
business rules. We have multiple installs with different ids on this
server. 3x development and our production.

The async runs on the production and when any user makes an EGW request it
won’t respond for user, same for anon user for the website through
sitemanager-site. Only about 25% of the servers resources are being
utilized during these async runs and when we attempt to access our other
installs there is no issue, they respond fine. When we access the
production while the async is running the requests basically time out.

Sounds like a database lock, thought stock EGroupware never locks the
database directly.

MySQL update operations lock concerned table. If eg. session-table is
locked, no user can log in. Again that should not be the case from
regular EGroupware code, but maybe part of your code.

Is async only triggered by a user request?

Depends on how you installed it. Recommended is to execute async
services via cron (Admin >> Async services creates such a cron-job,
package installs do something similar with a file in /etc/cron.d/).

Any suggestions how we can run our intensive asyncs in the background and
still have our production EGW respond to user requests?

Running them via async service is a good idea. I would check MySQL
status while noone can log in, you will probably see a locked table
blocking everything else.

Async service currently assumes no job will run longer then a few
minutes. It locks concurrent async jobs with a semaphore in database,
but ignores this semaphore, if it’s older then 10 minutes. You should
set this in phpgwapi/inc/class.asyncservice.inc.php around line 379 to a
value higher then your longest async job. Otherwise long running jobs
will be restarted parallel over and over.

P.S. I noticed this same problem with backups running on async through
setup. I have manually ran mysql dumps because I seem to be working within
EGW when the backs run in the middle of the night :-P.

Strange. As I said above, check MySQL status.

Ralf

I have people working around the clock not to mention the website gets hits
around the clock too.

Suggestions would be greatly appreciated.

–
View this message in context: http://egroupware.219119.n3.nabble.com/Async-steels-sessions-from-users-tp3985115.html
Sent from the egroupware-developers mailing list archive at Nabble.com.

Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

–
Ralf Becker
Director Software Development

Stylite AG

Morschheimer Strasse 15 | Tel. +49 6352 70629 0
D-67292 Kirchheimbolanden | Fax. +49 6352 70629 30

Email: rb@stylite.de

www.stylite.de | www.egroupware.org

Managing Directors: Andre Keller | Ralf Becker | Gudrun Mueller
Chairman of the supervisory board: Prof. Dr. Birger Leon Kropshofer

VAT DE214280951 | Registered HRB 31158 Kaiserslautern Germany

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Randy_J_Houlahan_B.S 2012-08-10 05:40:31 UTC #6

Thank You. I have started researching locks and monitoring for mysql.
If i make any break throughs i will report back.

I will be sure to update the async over 10 mins as I have noticed it starting over. I thought initially it was me attempting requests.

Randy

Randy_J_Houlahan_B.S 2012-08-10 05:44:20 UTC #7

When I use the sleep funciton within EGW I have to wait until completed before I try another request. I assumed it was a hog.

Not enough time to undertake virtualizaiton at this point but I am sure I will have to in the future. May have to seek your consultation on a professional level if that happens.

Thanks for the help.

Randy

Ralf_Becker_Stylite 2012-08-10 07:03:00 UTC #8

When I use the sleep funciton within EGW I have to wait until completed before I try another request. I assumed it was a hog.

This is caused by lock in sessions. You can call

$GLOBALS['egw']->session->commit_session();

before doing your sleep, thought you can change anything in session
after that call, but you send new requests to that session.

Ralf

Not enough time to undertake virtualizaiton at this point but I am sure I will have to in the future. May have to seek your consultation on a professional level if that happens.

Thanks for the help.

Randy

From: “WLD” ml-egwperp@white-label-dev.co.uk
To: “development of eGroupWare,for active developers” egroupware-developers@lists.sourceforge.net
Date: Mon, 6 Aug 2012 03:38:24 +0100

Hi Randy,

Actually it’s quite a big misconception that PHP’s sleep() is a resource hog, it
actually tells the kernel of all modern OS’s to give the process a lower priority
and ignore it for a bit - kind of the same way that cron works. As a deadline nears its
priority gets higher and the time span between being ignored decreases.

pepsi@planb:~$ time php
<?php echo time()."\n"; sleep (30); echo time()."\n"; ?>
1344217588
1344217618

real 0m31.040s
user 0m0.100s
sys 0m0.078s

The only reason the real time is over the 30 seconds is because it took me 1.04s to
paste the quick program and hit ^D to parse/execute it. The actual combined time
spent in the kernel and userspace was 0.178 seconds, and I dare say the bulk of that
was spent parsing my pasted program rather than doing nothing when it was
running.

With regards to your infolog, 500MiB of data isn’t a great deal but I’ve seen a few
instances where the “SQL ninja” inside of me can write a more efficient query to do
something but I’ve thought that perhaps portability in certain areas are better
than custom SQL that takes full advantage of the feature set and foibles of a
single RDBMS and doesn’t work at all in anything else. I think this is the
trade-off when writing portable SQL. Maybe your best bet is to activate a
slow-query log, EXPLAIN what takes time and seeing where you can optimise the
database rather than the SQL. I know you can get automatic database tuning
systems but in our experience these cause merry hell with eGW when they start
making sweeping changes. 20 seconds does horrify me quite a bit though, even a
tenth of that would be of grave concern.

With regards to your issue proper, I am understanding more that it’s definitely
related to a specific installation-identifier and not affecting other
installation. Because of this I highly doubt your delay is coming from Apache,
since it would similarly stop serving pages for all the other VirtualHost’s on
the machine. It seems far more likely it’s coming from the database, perhaps from
a locked table or even you’re maxing out the maximum connections or some other
database parameter.

I note that if you send a SIGSTOP to all MySQL processes that your header file is set
to connect to or create some other kind of blocking action; Apache will sit there
for a while before timing out (or depending on your configuration, it may sit
there indefinitely). You may not get any error message if it takes too long and the
browser times out. This would affect only a particular eGW installation
(assuming that all 4 connect to different databases). The hack you suggested
just seems kind of the wrong way to go about it and I’m not convinced it will work if
it is DB related.

Without some more investigating I don’t know what to suggest, I don’t know much
about async since we don’t use it much. Somebody else may have some ideas but it’s
all a bit pie-in-the-sky without looking in log files, gathering some metrics
and running reproducible test cases. As I said in my last post about
virtualisation, this might be the perfect time to give it a go. Clone your server
and then you can test it, dismantle it and rip out its inner workings to discover
the issue - but all in the safety of a self-contained sandbox where it can’t do any
damage and you can simply throw it away afterwards.

The data I mentioned is about 2.5TiB of MySQL data containing a lot of pERP
information and the rest is supporting file assets for that data. An example of
where direct database import systems had to be written to extract and import the
relevant data, but they still take several hours to run a full batch. But I tell you
it was easier to get it in to pERP than it would be to get it in to anything SAP. Nathan
has done a good job with the schema.

I think you should wait for the problem to occur and simultaneously look at the
Apache scoreboard and the MySQL processlist to get a signpost to the relevant
direction. Good luck,
WLD

Our custom app relies heavily on infolog only 500MG of data but can take upto 20 seconds per query. On the other hand we have 25Gigs in filemanager/files. No major issues there except for loading infolog in fileselect.
I have one hog of a funciton that faxs and as matter of poor design while waiting for the confirmation we use php sleep which just spins the servers wheels waiting for a response. However other users as not effected by that bottleneck only the user who started that session has to wait for it’s completion before apache responds to another request from that user.
The async functions seem to not allow any other users access to that instance. It seems almost like an apache configuration would need to be changed but at this point I was looking for a cheat as opposed to researching :-P.
What sort of data are you store to get to 7-8 TiB ? out of curiosity?
I am thinking maybee a hack to start up with a new install ID assuming async is started by cron and not a user session. i have not got into the code yet to know if that would work.
Thanks,
Randy

Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Randy J. Houlahan B.Sc.

Specializing in Traffic Offences

Road Warriors Paralegal Services

Tel. 1.905.641.2880 ex. 101

Fax. 1.888.797.7871

www.roadwarriors.ca

Find us on facebook

Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

–
Ralf Becker
Director Software Development

Stylite AG

Morschheimer Strasse 15 | Tel. +49 6352 70629 0
D-67292 Kirchheimbolanden | Fax. +49 6352 70629 30

Email: rb@stylite.de

www.stylite.de | www.egroupware.org

Managing Directors: Andre Keller | Ralf Becker | Gudrun Mueller
Chairman of the supervisory board: Prof. Dr. Birger Leon Kropshofer

VAT DE214280951 | Registered HRB 31158 Kaiserslautern Germany

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

WLD 2012-08-10 13:12:02 UTC #9

I think Ralf you meant that you can’t change anything in session after that call?

This is caused by lock in sessions. You can call
$GLOBALS[‘egw’]->session->commit_session();
before doing your sleep, thought you can change anything in session
after that call, but you send new requests to that session.
Ralf

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Ralf_Becker_Stylite 2012-08-10 14:05:02 UTC #10

I think Ralf you meant that you can’t change anything in session after that call?

Yeap, whatever you change after the call, will not be in session of
further requests.

Ralf

This is caused by lock in sessions. You can call
$GLOBALS[‘egw’]->session->commit_session();
before doing your sleep, thought you can change anything in session
after that call, but you send new requests to that session.
Ralf

–
Ralf Becker
Director Software Development

Stylite AG

Morschheimer Strasse 15 | Tel. +49 6352 70629 0
D-67292 Kirchheimbolanden | Fax. +49 6352 70629 30

Email: rb@stylite.de

www.stylite.de | www.egroupware.org

Managing Directors: Andre Keller | Ralf Becker | Gudrun Mueller
Chairman of the supervisory board: Prof. Dr. Birger Leon Kropshofer

VAT DE214280951 | Registered HRB 31158 Kaiserslautern Germany

eGroupWare-developers mailing list
eGroupWare-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/egroupware-developers

Randy_J_Houlahan_B.S 2012-08-11 04:43:46 UTC #11

Thanks Ralf. That did the trick! Solves all issues I was having.

Randy