Commit Graph

218 Commits

Author SHA1 Message Date
Matt Wells
b8886c399c show start/end job times on pagecrawlbot. 2014-05-21 13:55:01 -07:00
Daniel Steinberg
6afa3f2561 save spots to disk as space separated 2014-05-14 14:40:46 -07:00
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
mwells
ac5cf7971b more misc updates. 2014-04-05 18:09:04 -07:00
Daniel Steinberg
0efac8c156 Defect #2080: seed URLs duplicated 2014-03-25 17:25:55 -07:00
Daniel Steinberg
ab90c06d8d add TODO for regex checking 2014-03-25 13:05:43 -07:00
Daniel Steinberg
1ff6c1fae0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan 2014-03-25 12:53:37 -07:00
Daniel Steinberg
b8836745f0 use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122) 2014-03-25 12:51:08 -07:00
Daniel Steinberg
f27d549fc6 Defect #2122: If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain 2014-03-11 19:46:38 -07:00
Daniel Steinberg
85a5954256 only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday 2014-03-11 18:52:14 -07:00
Matt Wells
312438a32b Merge branch 'diffbot-dan' into diffbot-testing 2014-03-11 17:02:59 -07:00
Matt Wells
84784d8d76 minor fixups 2014-03-11 17:02:24 -07:00
Daniel Steinberg
f9fdc96563 no use in newline separating the list of urls if they're going to be read back in and need to be space separated 2014-03-10 15:22:43 -07:00
Daniel Steinberg
e293d465a3 snprintf instead of sprintf 2014-03-10 14:03:28 -07:00
Daniel Steinberg
41e3988fbc not a conf file 2014-03-10 13:57:13 -07:00
Daniel Steinberg
4a7bf5d4d0 Story #2040: store raw URL submissions for customer bulk jobs 2014-03-10 13:50:30 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
b1381cc610 make csv streamable, faster and take almost no memory. 2014-03-04 10:45:57 -08:00
Matt Wells
5f3aa24805 took out restrictDomain logic. now we always
only follow links on the same domain as the seed
UNLESS a url crawl pattern or a url crawl regex
was specified.
2014-02-27 19:53:17 -08:00
Matt Wells
f11e25024a Merge branch 'diffbot' into diffbot-testing 2014-02-26 20:34:06 -08:00
Matt Wells
8208178c79 remove "Initial crawl request" dups from the urls.csv.
do not count fake firstip spider requests attempts
in xmldoc.cpp as crawlbot page download attempts
since we just re-add that request with the correct
firstip and bail. it basically doubles this count
form what users would expect.
2014-02-26 15:48:52 -08:00
Matt Wells
33c8123288 more fixes for new link info code. 2014-02-25 13:53:41 -08:00
Matt Wells
b58d88c57f fix sections infinite loop bug. 2014-02-25 11:09:07 -08:00
Matt Wells
ceb623bb8f do not dedup bulks.
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
ae2aed7066 try to fix a few cores from deleting collections.
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
0a4963f597 do not allow spot/seeds to be added to collnum being repaired
or rebuilt.
2014-02-16 15:18:50 -08:00
Matt Wells
cd6069e5a6 send single space to socket if not streaming
and search results still not ready after 10 seconds.
send it every 10 seconds to prevent client from closing socket.
sped up all downloads, json and csv, but not doing "fuzzy"
deduping of search results, but just deduping on page
content hash. added TcpSocket::m_numDestroys to ensure we
do not send heartbeat on a socket that was closed and
re-opened for another client.
2014-02-13 08:45:13 -08:00
Matt Wells
a9737ea97d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-12 21:20:01 -08:00
Matt Wells
d42e2377e7 return json download as search results now.
all smokes have passed.
2014-02-12 21:19:32 -08:00
Matt Wells
25eae3da39 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-12 13:21:57 -08:00
Matt Wells
0e48bbcea9 fix a core from bad return values 2014-02-12 13:21:30 -08:00
Matt Wells
69fa6662bc EDOCUNCHANGED fixes for diffbot 2014-02-10 16:23:39 -08:00
Matt Wells
f420bd2769 checkpoint 2014-02-09 15:09:48 -07:00
Matt Wells
e593b6e1de basic controls code checkpoint. 2014-02-08 15:10:06 -07:00
Matt Wells
dabd691626 basic admin controls page structure 2014-02-08 00:34:45 -07:00
Matt Wells
2d4af1aefe index numbers as integers too, not just floats
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d show lastSpidered time at end of json item.
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8 many more fixes for streaming mode 2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52 fixes for streaming mode. 2014-02-06 16:28:42 -08:00
Matt Wells
4cfe69a96f minor link updates 2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056 get streaming time sliced results working 2014-02-06 14:25:44 -08:00
Matt Wells
c60dcf4ecb show userobots for bulk jobs 2014-02-05 15:45:39 -08:00
Matt Wells
392d043bd8 undo canonical deduping.
added dump round stats when uploading
json files.
2014-01-31 14:53:49 -08:00
Matt Wells
7b424a6236 always use kstart.
fixed restrictDomain bug of not saving parm.
sped up csv download around 2x.
2014-01-28 14:37:21 -08:00
Matt Wells
8f39c41962 just print out cached page straight, it is
just the diffbot json reply pretty much
verbatim, except for being tokenized.
should no longer escape forward slashes.
2014-01-28 11:04:53 -08:00
Matt Wells
1a9a5e53a7 show if coll has urls ready to spider in html page 2014-01-25 14:49:55 -08:00
Matt Wells
321fc90ff6 fix some cores.
NOTE: emails disabled here... need to fix.
2014-01-24 12:07:28 -08:00
Matt Wells
27b6ceffa8 fix bug of sending notification email twice
for really really tiny jobs.
2014-01-23 21:22:39 -08:00