mwells
dcc775eae7
added more langs to url filters drop down
2014-09-21 18:16:11 -07:00
mwells
e45c0d32f6
Merge branch 'diffbot-testing' into testing
2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c
various fixes.
...
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
58f5a2dd57
save conf files safely to disk so we don't
...
lose them because the disk is full.
2014-07-29 10:02:43 -07:00
mwells
0409571262
Merge branch 'diffbot-testing' into testing
...
Conflicts:
Spider.cpp
2014-07-28 14:37:44 -07:00
Matt Wells
343f783592
another fix for &restartRound=1
2014-07-28 13:58:36 -07:00
mwells
9347b1fc79
Merge branch 'diffbot-testing' into testing
...
Conflicts:
Collectiondb.cpp
Spider.cpp
2014-07-15 19:30:34 -07:00
Matt Wells
3421befd3a
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-07-15 16:10:50 -07:00
Matt Wells
c1c31c1364
fix for using more than 32k colls
2014-07-15 16:10:35 -07:00
mwells
cd48799030
try to fix core on neo
2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9
fix core from adding/deleting collection
2014-07-12 08:23:40 -07:00
mwells
2f8207ccf7
qa fixes
2014-07-11 19:07:49 -07:00
Matt Wells
b393a1bbbe
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
2014-07-10 10:06:55 -07:00
mwells
0da6063983
bring tags back in site list / url filters.
2014-07-10 07:44:16 -07:00
mwells
6434e5cc04
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
Parms.h
2014-07-07 09:49:59 -07:00
Matt Wells
886063a3bd
fixes for query reindex.
2014-07-03 12:24:14 -07:00
Matt Wells
e6dd317664
Merge branch 'diffbot-testing' into diffbot-matt
2014-06-30 11:37:12 -07:00
Matt Wells
5e39b7870d
fix for bad crawl info stats
2014-06-30 10:53:11 -07:00
Matt Wells
98b317b421
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Parms.cpp
Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
2227d1fca7
Merge branch 'diffbot-matt' of github.com:gigablast/open-source-search-engine into diffbot-matt
...
Conflicts:
Collectiondb.cpp
2014-06-27 17:18:20 -07:00
Matt Wells
2137e150e7
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Make.depend
Parms.cpp
2014-06-27 17:17:14 -07:00
Matt Wells
3162c83473
add some debug msgs
2014-06-27 08:28:28 -07:00
Matt Wells
e9ff8c48d8
try to remove the sluggishness from
...
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
mwells
651f0f27ac
only send localcrawlinfo if it has
...
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
Matt Wells
48a98df71d
make &s=20000 search much faster by skipping
...
generation of first 20000 summaries if
deduping is off, site clustering is off and
gigabit generation are off (&dr=0&sc=0&dsrt=0).
turn gigabits off on load for all customcrawls(diffbot)
2014-06-23 14:44:21 -07:00
mwells
6da972704b
bring back custom home page html into search controls
2014-06-21 09:57:51 -07:00
mwells
a09d4cd723
Merge branch 'master' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Pages.cpp
XmlDoc.cpp
gb.conf
2014-06-20 09:35:39 -07:00
Matt Wells
1bef36c03c
emergency bug fixes
2014-06-18 05:04:45 -07:00
mwells
584af942d4
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Make.depend
Parms.cpp
2014-06-16 20:42:28 -07:00
mwells
928456b102
fix merge
2014-06-15 15:05:28 -07:00
mwells
2be7c78f2f
Merge branch 'diffbot-testing' into testing
...
Conflicts:
Collectiondb.cpp
Parms.cpp
2014-06-15 15:02:17 -07:00
mwells
5c0b371dc9
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
HttpServer.cpp
Make.depend
Parms.cpp
Parms.h
2014-06-13 11:00:09 -07:00
Matt Wells
ab7717d065
now use &roundStart=0 to trigger the next crawl round.
...
now assume all crawl jobs are "repeat", but those that
have repeat of "0" just assume 10 year frequency, 3652.5 days.
that way the &roundStart=0 will do another round of crawling
for them as well.
2014-06-11 18:45:58 -07:00
Matt Wells
365f29b293
made &spiderRoundStart=1 (or 0) force the next
...
spider round to begin.
also added pageUrl to XmlDoc::getContentHashJSON32()
so it's not included in the hash to fix some spider-time
deduping issues.
2014-06-10 14:20:41 -07:00
mwells
108c281c33
fix annoying bug when adding new parms.
2014-06-10 12:29:50 -07:00
Matt Wells
56af753c3e
fixed nasty bug of resetting RdbBases for
...
random collnums, causing data loss and corruption.
2014-06-09 10:16:29 -07:00
Matt Wells
74f0a41290
bulk jobs give up after downloading a url
...
3 times. crawls don't give up on tmperrors,
but retry every 30 days.
2014-06-05 23:11:14 -07:00
Matt Wells
970eb33a83
sanity checks to ensure fakefirstip
...
was able to convert to a real good firstip
2014-06-05 16:13:33 -07:00
Matt Wells
d98cf4b2b0
try to prevent slamming diffbot backend
...
with bulk jobs consisting of hundreds of
different domains/ips.
2014-06-04 12:37:49 -07:00
Daniel Steinberg
fc5cfa2a62
move list of bulk urls to new directory earlier. May fix Defect #2218 if there is something that is causing the bulk job to restart before this function returns
2014-05-15 13:35:32 -07:00
mwells
45b8bb3421
log msg cleanups
2014-05-11 21:55:44 -07:00
mwells
2b37f56e4c
Merge branch 'diffbot-matt' into testing
2014-05-10 07:56:45 -07:00
mwells
38a79888b6
Merge branch 'diffbot-testing' into testing
2014-05-10 07:49:29 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
3bf52f0f2d
if "wpid" is supplied try to update sitelist
...
for that wpid. hopefully we can get the wp admin
tools to send a /search?wpid=xxxx&sites=xyz.com request so
we can start spidering those sites before they even see the
widget. also it is simpler than trying to update m_siteListBuf
each time someone does a query since those can be hundreds
a second.
2014-05-07 16:10:26 -07:00
mwells
ff121a76d9
fix formatting bugs
2014-04-30 14:17:39 -06:00
Matt Wells
066a01cba6
Merge branch 'diffbot-testing' into diffbot-matt
2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c
fixed bug for product title extraction.
...
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
20a2729827
added jobCreationTimeUTC and jobCompletionTimeUTC
...
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
f3c06ced57
try to fix core from deleting coll
2014-04-25 11:52:17 -07:00