Matt Wells
b8886c399c
show start/end job times on pagecrawlbot.
2014-05-21 13:55:01 -07:00
Daniel Steinberg
6afa3f2561
save spots to disk as space separated
2014-05-14 14:40:46 -07:00
Matt Wells
20a2729827
added jobCreationTimeUTC and jobCompletionTimeUTC
...
to json api
2014-04-25 14:12:18 -07:00
mwells
ac5cf7971b
more misc updates.
2014-04-05 18:09:04 -07:00
Daniel Steinberg
0efac8c156
Defect #2080 : seed URLs duplicated
2014-03-25 17:25:55 -07:00
Daniel Steinberg
ab90c06d8d
add TODO for regex checking
2014-03-25 13:05:43 -07:00
Daniel Steinberg
1ff6c1fae0
Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
2014-03-25 12:53:37 -07:00
Daniel Steinberg
b8836745f0
use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122 )
2014-03-25 12:51:08 -07:00
Daniel Steinberg
f27d549fc6
Defect #2122 : If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain
2014-03-11 19:46:38 -07:00
Daniel Steinberg
85a5954256
only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday
2014-03-11 18:52:14 -07:00
Matt Wells
312438a32b
Merge branch 'diffbot-dan' into diffbot-testing
2014-03-11 17:02:59 -07:00
Matt Wells
84784d8d76
minor fixups
2014-03-11 17:02:24 -07:00
Daniel Steinberg
f9fdc96563
no use in newline separating the list of urls if they're going to be read back in and need to be space separated
2014-03-10 15:22:43 -07:00
Daniel Steinberg
e293d465a3
snprintf instead of sprintf
2014-03-10 14:03:28 -07:00
Daniel Steinberg
41e3988fbc
not a conf file
2014-03-10 13:57:13 -07:00
Daniel Steinberg
4a7bf5d4d0
Story #2040 : store raw URL submissions for customer bulk jobs
2014-03-10 13:50:30 -07:00
Matt Wells
bd4484db3c
Merge branch 'testing' into diffbot-testing
2014-03-10 12:08:23 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
27e8e810d2
use collnum instead of coll string.
...
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
b1381cc610
make csv streamable, faster and take almost no memory.
2014-03-04 10:45:57 -08:00
Matt Wells
5f3aa24805
took out restrictDomain logic. now we always
...
only follow links on the same domain as the seed
UNLESS a url crawl pattern or a url crawl regex
was specified.
2014-02-27 19:53:17 -08:00
Matt Wells
f11e25024a
Merge branch 'diffbot' into diffbot-testing
2014-02-26 20:34:06 -08:00
Matt Wells
8208178c79
remove "Initial crawl request" dups from the urls.csv.
...
do not count fake firstip spider requests attempts
in xmldoc.cpp as crawlbot page download attempts
since we just re-add that request with the correct
firstip and bail. it basically doubles this count
form what users would expect.
2014-02-26 15:48:52 -08:00
Matt Wells
33c8123288
more fixes for new link info code.
2014-02-25 13:53:41 -08:00
Matt Wells
b58d88c57f
fix sections infinite loop bug.
2014-02-25 11:09:07 -08:00
Matt Wells
ceb623bb8f
do not dedup bulks.
...
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
ae2aed7066
try to fix a few cores from deleting collections.
...
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
0a4963f597
do not allow spot/seeds to be added to collnum being repaired
...
or rebuilt.
2014-02-16 15:18:50 -08:00
Matt Wells
cd6069e5a6
send single space to socket if not streaming
...
and search results still not ready after 10 seconds.
send it every 10 seconds to prevent client from closing socket.
sped up all downloads, json and csv, but not doing "fuzzy"
deduping of search results, but just deduping on page
content hash. added TcpSocket::m_numDestroys to ensure we
do not send heartbeat on a socket that was closed and
re-opened for another client.
2014-02-13 08:45:13 -08:00
Matt Wells
a9737ea97d
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-02-12 21:20:01 -08:00
Matt Wells
d42e2377e7
return json download as search results now.
...
all smokes have passed.
2014-02-12 21:19:32 -08:00
Matt Wells
25eae3da39
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-02-12 13:21:57 -08:00
Matt Wells
0e48bbcea9
fix a core from bad return values
2014-02-12 13:21:30 -08:00
Matt Wells
69fa6662bc
EDOCUNCHANGED fixes for diffbot
2014-02-10 16:23:39 -08:00
Matt Wells
f420bd2769
checkpoint
2014-02-09 15:09:48 -07:00
Matt Wells
e593b6e1de
basic controls code checkpoint.
2014-02-08 15:10:06 -07:00
Matt Wells
dabd691626
basic admin controls page structure
2014-02-08 00:34:45 -07:00
Matt Wells
2d4af1aefe
index numbers as integers too, not just floats
...
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d
show lastSpidered time at end of json item.
...
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8
many more fixes for streaming mode
2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52
fixes for streaming mode.
2014-02-06 16:28:42 -08:00
Matt Wells
4cfe69a96f
minor link updates
2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056
get streaming time sliced results working
2014-02-06 14:25:44 -08:00
Matt Wells
c60dcf4ecb
show userobots for bulk jobs
2014-02-05 15:45:39 -08:00
Matt Wells
392d043bd8
undo canonical deduping.
...
added dump round stats when uploading
json files.
2014-01-31 14:53:49 -08:00
Matt Wells
7b424a6236
always use kstart.
...
fixed restrictDomain bug of not saving parm.
sped up csv download around 2x.
2014-01-28 14:37:21 -08:00
Matt Wells
8f39c41962
just print out cached page straight, it is
...
just the diffbot json reply pretty much
verbatim, except for being tokenized.
should no longer escape forward slashes.
2014-01-28 11:04:53 -08:00
Matt Wells
1a9a5e53a7
show if coll has urls ready to spider in html page
2014-01-25 14:49:55 -08:00
Matt Wells
321fc90ff6
fix some cores.
...
NOTE: emails disabled here... need to fix.
2014-01-24 12:07:28 -08:00
Matt Wells
27b6ceffa8
fix bug of sending notification email twice
...
for really really tiny jobs.
2014-01-23 21:22:39 -08:00